ClassifierEval Plugin¶
Description¶
ClassifierEval is a specialized model evaluation plugin for CVEDIA-RT that provides comprehensive performance assessment for classification models. It implements industry-standard classification metrics and supports both single-label and multi-label classification evaluation scenarios.
This plugin is designed for AI researchers, data scientists, and system integrators who need to assess the performance of their classification models using rigorous statistical analysis. It provides detailed metrics including accuracy, precision, recall, F1-scores, and confusion matrix analysis, enabling thorough model validation and comparison.
Key Features¶
- Comprehensive Metrics: Accuracy, precision, recall, F1-score, and confusion matrix analysis
- Multi-Class Support: Handles multiple classification categories simultaneously
- Multi-Label Evaluation: Supports models that predict multiple labels per sample
- Model Comparison: Differential analysis between two classification models
- Statistical Analysis: Micro and macro averaging for aggregate performance assessment
- Flexible Integration: Works with various dataset formats and inference engines
- CLI Integration: Command-line tools for batch evaluation and reporting
- JSON Export: Structured output for automated analysis and reporting
- Cross-Platform: Compatible with Windows, Linux, and embedded platforms
Use Cases¶
- Model Validation: Assess classification model performance against ground truth datasets
- Model Comparison: Compare multiple models to select the best performer
- Performance Monitoring: Track model performance over time in production systems
- Research and Development: Detailed analysis for academic research and model development
- Quality Assurance: Automated testing of classification models in CI/CD pipelines
- Benchmarking: Compare models against industry standards and baselines
Requirements¶
Hardware Requirements¶
- CPU: Multi-core processor (Intel/AMD x64 or ARM64)
- Memory: Minimum 1GB RAM (4GB+ recommended for large datasets)
- Storage: Sufficient space for dataset and result files
Software Dependencies¶
- RTCORE: CVEDIA-RT core library for plugin infrastructure
- CMake: Build system with
WITH_EVAL=ON
option enabled - C++ Runtime: C++17 compatible standard library
- JSON Library: For result serialization and export
Platform Requirements¶
- Operating Systems: Windows 10+, Linux (Ubuntu 18.04+, CentOS 7+), embedded Linux
- Architecture: x86_64, ARM64, AARCH64
- No GPU Dependencies: CPU-only implementation for maximum compatibility
Configuration¶
Basic Single-Label Evaluation¶
{
"classifiereval": {
"evaluation_type": "single_label",
"categories": ["person", "vehicle", "bicycle", "animal"],
"output_format": "json",
"skip_missing_categories": true,
"case_sensitive": false
}
}
Multi-Label Classification Configuration¶
{
"classifiereval": {
"evaluation_type": "multi_label",
"categories": ["person", "vehicle", "bicycle", "traffic_light", "sign"],
"confidence_threshold": 0.5,
"output_format": "detailed_json",
"include_confusion_matrix": true,
"calculate_per_class_metrics": true
}
}
Model Comparison Configuration¶
{
"classifiereval": {
"evaluation_type": "differential",
"categories": ["person", "vehicle", "bicycle"],
"comparison_mode": "side_by_side",
"output_format": "comparison_report",
"highlight_differences": true,
"statistical_significance": true
}
}
Configuration Schema¶
Parameter | Type | Default | Description |
---|---|---|---|
evaluation_type |
string | "single_label" | Type of evaluation ("single_label", "multi_label", "differential") |
categories |
array | [] | List of classification categories to evaluate |
confidence_threshold |
float | 0.5 | Threshold for positive predictions in multi-label mode |
output_format |
string | "json" | Output format ("json", "detailed_json", "console", "comparison_report") |
skip_missing_categories |
bool | true | Skip samples without matching categories |
case_sensitive |
bool | false | Case-sensitive category matching |
include_confusion_matrix |
bool | true | Include confusion matrix in output |
calculate_per_class_metrics |
bool | true | Calculate metrics for each class individually |
comparison_mode |
string | "side_by_side" | Comparison display mode for differential analysis |
highlight_differences |
bool | false | Highlight significant differences in comparison mode |
statistical_significance |
bool | false | Calculate statistical significance of differences |
API Reference¶
C++ API (ClassifierEvalImpl)¶
Core Evaluation Methods¶
class ClassifierEvalImpl {
public:
// Single-label classification evaluation
iface::ClassifierSummary evaluate(
cvec classifications, // Model predictions
cvec groundtruth, // Ground truth labels
std::vector<std::string> categories // Category names
);
// Multi-label classification evaluation
iface::ClassifierSummary evaluateMultiLabel(
cvec classifications, // Model predictions
cvec groundtruth, // Ground truth labels
std::vector<std::string> categories, // Category names
float confidence_threshold // Threshold for positive predictions
);
// Differential analysis between two models
iface::ClassifierSummary diff(
cvec detections1, // First model predictions
cvec detections2, // Second model predictions
cvec groundtruth, // Ground truth labels
std::vector<std::string> categories // Category names
);
};
Classification Summary Structure¶
struct ClassifierSummary {
// Confusion matrix data
std::vector<std::vector<int>> confusionMatrix; // Raw counts
std::vector<std::vector<double>> confusionMatrixPerc; // Percentages
// System-level metrics
float avgSystemAccuracy; // Overall accuracy
float systemErr; // System error rate
// Micro-averaged metrics (global across all samples)
float precisionMicro; // Micro precision
float recallMicro; // Micro recall
float f1scoreMicro; // Micro F1-score
// Macro-averaged metrics (averaged across classes)
float precisionMacro; // Macro precision
float recallMacro; // Macro recall
float f1scoreMacro; // Macro F1-score
// Per-class metrics
std::vector<float> perClassAccuracy; // Accuracy per class
std::vector<float> perClassPrecision; // Precision per class
std::vector<float> perClassRecall; // Recall per class
std::vector<float> perClassF1Score; // F1-score per class
// Statistical data
std::vector<int> truePositives; // TP per class
std::vector<int> falsePositives; // FP per class
std::vector<int> falseNegatives; // FN per class
std::vector<int> trueNegatives; // TN per class
};
Factory Creation¶
// Create ClassifierEval instance
auto classifierEval = api::factory::ClassifierEval::create();
// Evaluate single-label classification
std::vector<std::string> categories = {"person", "vehicle", "bicycle"};
auto summary = classifierEval->evaluate(predictions, groundTruth, categories);
// Access results
float accuracy = summary.avgSystemAccuracy;
float microF1 = summary.f1scoreMicro;
float macroF1 = summary.f1scoreMacro;
Lua API¶
Classification Evaluation Setup¶
-- Create classifier evaluation instance
local classifierEval = api.factory.classifiereval.create()
-- Define evaluation categories
local categories = {"person", "vehicle", "bicycle", "animal"}
-- Perform single-label evaluation
function evaluateSingleLabel(predictions, groundTruth)
local summary = classifierEval:evaluate(predictions, groundTruth, categories)
api.logging.LogInfo("Classification Evaluation Results:")
api.logging.LogInfo(" Overall Accuracy: " .. summary.avgSystemAccuracy)
api.logging.LogInfo(" System Error Rate: " .. summary.systemErr)
api.logging.LogInfo(" Micro Precision: " .. summary.precisionMicro)
api.logging.LogInfo(" Micro Recall: " .. summary.recallMicro)
api.logging.LogInfo(" Micro F1-Score: " .. summary.f1scoreMicro)
api.logging.LogInfo(" Macro Precision: " .. summary.precisionMacro)
api.logging.LogInfo(" Macro Recall: " .. summary.recallMacro)
api.logging.LogInfo(" Macro F1-Score: " .. summary.f1scoreMacro)
return summary
end
Multi-Label Evaluation¶
-- Multi-label classification evaluation
function evaluateMultiLabel(predictions, groundTruth, threshold)
threshold = threshold or 0.5
local summary = classifierEval:evaluateMultiLabel(
predictions,
groundTruth,
categories,
threshold
)
api.logging.LogInfo("Multi-Label Classification Results:")
api.logging.LogInfo(" Confidence Threshold: " .. threshold)
api.logging.LogInfo(" Overall Accuracy: " .. summary.avgSystemAccuracy)
api.logging.LogInfo(" Micro F1-Score: " .. summary.f1scoreMicro)
api.logging.LogInfo(" Macro F1-Score: " .. summary.f1scoreMacro)
-- Print per-class metrics
for i, category in ipairs(categories) do
if summary.perClassPrecision[i] then
api.logging.LogInfo(string.format(" %s - P: %.3f, R: %.3f, F1: %.3f",
category,
summary.perClassPrecision[i],
summary.perClassRecall[i],
summary.perClassF1Score[i]
))
end
end
return summary
end
Model Comparison¶
-- Compare two classification models
function compareModels(model1_predictions, model2_predictions, groundTruth)
local comparison = classifierEval:diff(
model1_predictions,
model2_predictions,
groundTruth,
categories
)
api.logging.LogInfo("Model Comparison Results:")
api.logging.LogInfo(" Model 1 Accuracy: " .. (comparison.model1_accuracy or "N/A"))
api.logging.LogInfo(" Model 2 Accuracy: " .. (comparison.model2_accuracy or "N/A"))
api.logging.LogInfo(" Accuracy Difference: " .. (comparison.accuracy_difference or "N/A"))
api.logging.LogInfo(" Statistical Significance: " .. tostring(comparison.is_significant or "N/A"))
return comparison
end
Examples¶
Basic Single-Label Classification Evaluation¶
#include "classifiereval.h"
#include "rtcore.h"
// Basic classification evaluation system
class ClassificationEvaluator {
public:
void evaluateModel() {
// Create evaluator instance
evaluator_ = api::factory::ClassifierEval::create();
// Define categories
categories_ = {"person", "vehicle", "bicycle", "animal", "traffic_sign"};
// Load test dataset
auto testData = loadTestDataset();
// Run model inference
auto predictions = runModelInference(testData.images);
// Evaluate classification performance
auto summary = evaluator_->evaluate(predictions, testData.groundTruth, categories_);
// Display results
displayEvaluationResults(summary);
// Export detailed results
exportResults(summary, "classification_evaluation.json");
}
private:
std::unique_ptr<ClassifierEval> evaluator_;
std::vector<std::string> categories_;
void displayEvaluationResults(const iface::ClassifierSummary& summary) {
LOGI << "Classification Evaluation Results:";
LOGI << "================================";
LOGI << "Overall Accuracy: " << summary.avgSystemAccuracy * 100.0f << "%";
LOGI << "System Error Rate: " << summary.systemErr * 100.0f << "%";
LOGI << "";
LOGI << "Micro-Averaged Metrics:";
LOGI << " Precision: " << summary.precisionMicro;
LOGI << " Recall: " << summary.recallMicro;
LOGI << " F1-Score: " << summary.f1scoreMicro;
LOGI << "";
LOGI << "Macro-Averaged Metrics:";
LOGI << " Precision: " << summary.precisionMacro;
LOGI << " Recall: " << summary.recallMacro;
LOGI << " F1-Score: " << summary.f1scoreMacro;
LOGI << "";
// Display per-class metrics
LOGI << "Per-Class Metrics:";
for (size_t i = 0; i < categories_.size() && i < summary.perClassPrecision.size(); ++i) {
LOGI << " " << categories_[i] << ":";
LOGI << " Precision: " << summary.perClassPrecision[i];
LOGI << " Recall: " << summary.perClassRecall[i];
LOGI << " F1-Score: " << summary.perClassF1Score[i];
}
// Display confusion matrix
displayConfusionMatrix(summary.confusionMatrix);
}
void displayConfusionMatrix(const std::vector<std::vector<int>>& matrix) {
LOGI << "Confusion Matrix:";
LOGI << "=================";
// Header row
std::cout << std::setw(12) << "Actual\\Pred";
for (const auto& category : categories_) {
std::cout << std::setw(10) << category;
}
std::cout << std::endl;
// Matrix rows
for (size_t i = 0; i < matrix.size() && i < categories_.size(); ++i) {
std::cout << std::setw(12) << categories_[i];
for (size_t j = 0; j < matrix[i].size(); ++j) {
std::cout << std::setw(10) << matrix[i][j];
}
std::cout << std::endl;
}
}
};
Multi-Label Classification Evaluation¶
// Multi-label classification evaluation system
class MultiLabelEvaluator {
public:
void evaluateMultiLabelModel(float confidenceThreshold = 0.5f) {
// Initialize evaluator
auto evaluator = api::factory::ClassifierEval::create();
// Define multi-label categories
std::vector<std::string> categories = {
"person", "vehicle", "bicycle", "motorcycle",
"traffic_light", "stop_sign", "crosswalk"
};
// Load multi-label dataset
auto dataset = loadMultiLabelDataset();
// Run multi-label inference
auto predictions = runMultiLabelInference(dataset.images);
// Evaluate with different thresholds
std::vector<float> thresholds = {0.3f, 0.5f, 0.7f, 0.9f};
for (float threshold : thresholds) {
LOGI << "\nEvaluating with confidence threshold: " << threshold;
auto summary = evaluator->evaluateMultiLabel(
predictions,
dataset.groundTruth,
categories,
threshold
);
analyzeThresholdResults(summary, threshold);
}
// Find optimal threshold
float optimalThreshold = findOptimalThreshold(thresholds, categories);
LOGI << "\nOptimal confidence threshold: " << optimalThreshold;
}
private:
void analyzeThresholdResults(const iface::ClassifierSummary& summary, float threshold) {
LOGI << " Threshold: " << threshold;
LOGI << " Micro F1-Score: " << summary.f1scoreMicro;
LOGI << " Macro F1-Score: " << summary.f1scoreMacro;
LOGI << " Overall Accuracy: " << summary.avgSystemAccuracy;
// Analyze per-class performance
float avgPrecision = 0.0f, avgRecall = 0.0f;
int validClasses = 0;
for (size_t i = 0; i < summary.perClassPrecision.size(); ++i) {
if (!std::isnan(summary.perClassPrecision[i])) {
avgPrecision += summary.perClassPrecision[i];
avgRecall += summary.perClassRecall[i];
validClasses++;
}
}
if (validClasses > 0) {
avgPrecision /= validClasses;
avgRecall /= validClasses;
LOGI << " Average Precision: " << avgPrecision;
LOGI << " Average Recall: " << avgRecall;
}
}
};
Comprehensive Model Benchmarking System¶
-- Comprehensive model benchmarking and comparison system
local classifierEval = api.factory.classifiereval.create()
local fileSystem = api.factory.filesystem.create()
-- Benchmarking configuration
local benchmarkConfig = {
models = {
{name = "ResNet50", path = "/models/resnet50"},
{name = "EfficientNet", path = "/models/efficientnet"},
{name = "MobileNet", path = "/models/mobilenet"}
},
datasets = {
{name = "CIFAR-10", path = "/datasets/cifar10"},
{name = "ImageNet-Val", path = "/datasets/imagenet_val"}
},
categories = {"airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"},
output_directory = "/results/classification_benchmark"
}
-- Model performance tracking
local performanceResults = {}
-- Initialize benchmarking system
function initializeBenchmark()
api.logging.LogInfo("Initializing Classification Model Benchmark")
-- Create output directory
fileSystem:createDirectory(benchmarkConfig.output_directory)
-- Initialize performance tracking
for _, model in ipairs(benchmarkConfig.models) do
performanceResults[model.name] = {
evaluations = {},
average_metrics = {},
best_performance = {}
}
end
api.logging.LogInfo("Benchmark initialization complete")
end
-- Run comprehensive benchmark
function runComprehensiveBenchmark()
api.logging.LogInfo("Starting comprehensive classification benchmark")
-- Benchmark each model on each dataset
for _, model in ipairs(benchmarkConfig.models) do
api.logging.LogInfo("Benchmarking model: " .. model.name)
for _, dataset in ipairs(benchmarkConfig.datasets) do
api.logging.LogInfo(" Dataset: " .. dataset.name)
local result = benchmarkModelOnDataset(model, dataset)
if result then
table.insert(performanceResults[model.name].evaluations, {
dataset = dataset.name,
metrics = result,
timestamp = os.time()
})
api.logging.LogInfo(string.format(
" Accuracy: %.3f, F1-Micro: %.3f, F1-Macro: %.3f",
result.avgSystemAccuracy,
result.f1scoreMicro,
result.f1scoreMacro
))
end
end
-- Calculate average performance for model
calculateAveragePerformance(model.name)
end
-- Generate benchmark report
generateBenchmarkReport()
-- Perform model comparison
performModelComparison()
api.logging.LogInfo("Comprehensive benchmark completed")
end
-- Benchmark single model on single dataset
function benchmarkModelOnDataset(model, dataset)
api.logging.LogInfo("Loading model: " .. model.name)
-- Load model and dataset (implementation specific)
local modelInstance = loadClassificationModel(model.path)
local datasetInstance = loadDataset(dataset.path)
if not modelInstance or not datasetInstance then
api.logging.LogError("Failed to load model or dataset")
return nil
end
-- Run inference on test set
local predictions = {}
local groundTruth = datasetInstance.groundTruth
api.logging.LogInfo("Running inference on " .. #datasetInstance.images .. " images")
for i, image in ipairs(datasetInstance.images) do
local prediction = modelInstance:predict(image)
table.insert(predictions, prediction)
-- Progress logging
if i % 100 == 0 then
api.logging.LogInfo(" Processed " .. i .. " of " .. #datasetInstance.images .. " images")
end
end
-- Evaluate classification performance
local summary = classifierEval:evaluate(predictions, groundTruth, benchmarkConfig.categories)
-- Export detailed results
local outputPath = string.format("%s/%s_%s_detailed.json",
benchmarkConfig.output_directory, model.name, dataset.name)
exportDetailedResults(summary, outputPath)
return summary
end
-- Calculate average performance metrics
function calculateAveragePerformance(modelName)
local evaluations = performanceResults[modelName].evaluations
if #evaluations == 0 then return end
local avgMetrics = {
avgSystemAccuracy = 0,
f1scoreMicro = 0,
f1scoreMacro = 0,
precisionMicro = 0,
recallMicro = 0,
precisionMacro = 0,
recallMacro = 0
}
-- Sum all metrics
for _, evaluation in ipairs(evaluations) do
local metrics = evaluation.metrics
avgMetrics.avgSystemAccuracy = avgMetrics.avgSystemAccuracy + metrics.avgSystemAccuracy
avgMetrics.f1scoreMicro = avgMetrics.f1scoreMicro + metrics.f1scoreMicro
avgMetrics.f1scoreMacro = avgMetrics.f1scoreMacro + metrics.f1scoreMacro
avgMetrics.precisionMicro = avgMetrics.precisionMicro + metrics.precisionMicro
avgMetrics.recallMicro = avgMetrics.recallMicro + metrics.recallMicro
avgMetrics.precisionMacro = avgMetrics.precisionMacro + metrics.precisionMacro
avgMetrics.recallMacro = avgMetrics.recallMacro + metrics.recallMacro
end
-- Calculate averages
local count = #evaluations
for key, value in pairs(avgMetrics) do
avgMetrics[key] = value / count
end
performanceResults[modelName].average_metrics = avgMetrics
api.logging.LogInfo(string.format("Average performance for %s:", modelName))
api.logging.LogInfo(" Accuracy: " .. tostring(avgMetrics.avgSystemAccuracy))
api.logging.LogInfo(" F1-Score (Micro): " .. tostring(avgMetrics.f1scoreMicro))
api.logging.LogInfo(" F1-Score (Macro): " .. tostring(avgMetrics.f1scoreMacro))
end
-- Generate comprehensive benchmark report
function generateBenchmarkReport()
local reportPath = benchmarkConfig.output_directory .. "/benchmark_report.json"
local report = {
benchmark_config = benchmarkConfig,
execution_timestamp = os.time(),
model_results = performanceResults,
summary = generateBenchmarkSummary()
}
-- Export report
local success = fileSystem:writeJSON(reportPath, report)
if success then
api.logging.LogInfo("Benchmark report exported to: " .. reportPath)
else
api.logging.LogError("Failed to export benchmark report")
end
-- Generate human-readable summary
generateHumanReadableReport()
end
-- Perform statistical comparison between models
function performModelComparison()
api.logging.LogInfo("Performing statistical model comparison")
local modelNames = {}
for _, model in ipairs(benchmarkConfig.models) do
table.insert(modelNames, model.name)
end
-- Compare each pair of models
for i = 1, #modelNames do
for j = i + 1, #modelNames do
local model1 = modelNames[i]
local model2 = modelNames[j]
compareModelPair(model1, model2)
end
end
end
-- Compare two specific models
function compareModelPair(model1Name, model2Name)
local metrics1 = performanceResults[model1Name].average_metrics
local metrics2 = performanceResults[model2Name].average_metrics
if not metrics1 or not metrics2 then
api.logging.LogWarning("Missing metrics for comparison: " .. model1Name .. " vs " .. model2Name)
return
end
api.logging.LogInfo(string.format("Comparing %s vs %s:", model1Name, model2Name))
local accuracyDiff = metrics1.avgSystemAccuracy - metrics2.avgSystemAccuracy
local f1MicroDiff = metrics1.f1scoreMicro - metrics2.f1scoreMicro
local f1MacroDiff = metrics1.f1scoreMacro - metrics2.f1scoreMacro
api.logging.LogInfo(" Accuracy Difference: " .. tostring(accuracyDiff))
api.logging.LogInfo(" F1-Micro Difference: " .. tostring(f1MicroDiff))
api.logging.LogInfo(" F1-Macro Difference: " .. tostring(f1MacroDiff))
-- Determine winner
local winner = "tie"
if accuracyDiff > 0.01 then -- 1% threshold
winner = model1Name
elseif accuracyDiff < -0.01 then
winner = model2Name
end
api.logging.LogInfo(" Winner: " .. winner)
end
-- Export detailed evaluation results
function exportDetailedResults(summary, outputPath)
local detailedResults = {
timestamp = os.time(),
system_metrics = {
overall_accuracy = summary.avgSystemAccuracy,
system_error_rate = summary.systemErr
},
micro_averaged = {
precision = summary.precisionMicro,
recall = summary.recallMicro,
f1_score = summary.f1scoreMicro
},
macro_averaged = {
precision = summary.precisionMacro,
recall = summary.recallMacro,
f1_score = summary.f1scoreMacro
},
per_class_metrics = {},
confusion_matrix = {
raw = summary.confusionMatrix,
percentage = summary.confusionMatrixPerc
}
}
-- Add per-class metrics
for i, category in ipairs(benchmarkConfig.categories) do
if summary.perClassPrecision[i] then
detailedResults.per_class_metrics[category] = {
precision = summary.perClassPrecision[i],
recall = summary.perClassRecall[i],
f1_score = summary.perClassF1Score[i],
accuracy = summary.perClassAccuracy[i]
}
end
end
-- Export to JSON
local success = fileSystem:writeJSON(outputPath, detailedResults)
if success then
api.logging.LogInfo("Detailed results exported to: " .. outputPath)
else
api.logging.LogError("Failed to export detailed results")
end
end
-- Initialize and run the comprehensive benchmark
initializeBenchmark()
runComprehensiveBenchmark()
api.logging.LogInfo("Classification benchmarking system completed")
Best Practices¶
Data Preparation¶
- Ground Truth Quality: Ensure accurate and consistent ground truth labels
- Category Consistency: Use consistent naming conventions for all categories
- Data Balance: Consider class imbalance when interpreting results
- Validation Sets: Use proper train/validation/test splits for unbiased evaluation
Evaluation Methodology¶
- Multiple Metrics: Don't rely on accuracy alone; use precision, recall, and F1-score
- Threshold Tuning: For multi-label classification, optimize confidence thresholds
- Cross-Validation: Use cross-validation for robust performance estimation
- Statistical Significance: Test for statistical significance when comparing models
Performance Optimization¶
- Batch Evaluation: Process large datasets in batches to manage memory usage
- Parallel Processing: Utilize multi-threading for faster evaluation
- Memory Management: Monitor memory usage with large confusion matrices
- Result Caching: Cache evaluation results for repeated analysis
Integration Guidelines¶
- Automated Pipelines: Integrate evaluation into CI/CD pipelines for model validation
- Monitoring Systems: Set up alerts for performance degradation
- A/B Testing: Use for comparing model versions in production
- Documentation: Maintain detailed records of evaluation procedures and results
Troubleshooting¶
Common Issues¶
Category Mismatch Errors¶
// Handle missing or mismatched categories
std::vector<std::string> validateCategories(
const cvec& predictions,
const std::vector<std::string>& categories
) {
std::set<std::string> foundCategories;
// Extract unique categories from predictions
for (const auto& pred : predictions) {
if (pred->exists("label")) {
std::string label = pred->get("label").getString();
foundCategories.insert(label);
}
}
// Check for missing categories
std::vector<std::string> validCategories;
for (const auto& category : categories) {
if (foundCategories.find(category) != foundCategories.end()) {
validCategories.push_back(category);
} else {
LOGW << "Category not found in predictions: " << category;
}
}
return validCategories;
}
NaN Values in Metrics¶
- Cause: Division by zero when a class has no predictions or ground truth samples
- Solution: Check for empty classes and handle gracefully
- Prevention: Validate dataset completeness before evaluation
Memory Issues with Large Datasets¶
- Batch Processing: Process datasets in smaller chunks
- Streaming Evaluation: Implement streaming evaluation for very large datasets
- Memory Monitoring: Monitor memory usage and implement cleanup
Performance Degradation¶
- Profiling: Use profiling tools to identify bottlenecks
- Optimization: Optimize data structures and algorithms
- Parallel Processing: Utilize multiple threads for evaluation
Debugging Tools¶
// Classification evaluation diagnostics
void diagnoseClassificationEvaluation(
const iface::ClassifierSummary& summary,
const std::vector<std::string>& categories
) {
LOGI << "Classification Evaluation Diagnostics:";
LOGI << "=====================================";
// Check for NaN values
if (std::isnan(summary.avgSystemAccuracy)) {
LOGW << "System accuracy is NaN - check data consistency";
}
// Check confusion matrix dimensions
if (summary.confusionMatrix.size() != categories.size()) {
LOGW << "Confusion matrix size mismatch: "
<< summary.confusionMatrix.size() << " vs " << categories.size();
}
// Analyze per-class performance
for (size_t i = 0; i < categories.size() && i < summary.perClassPrecision.size(); ++i) {
if (std::isnan(summary.perClassPrecision[i])) {
LOGW << "NaN precision for category: " << categories[i];
}
if (summary.perClassPrecision[i] == 0.0f && summary.perClassRecall[i] == 0.0f) {
LOGW << "Zero precision and recall for category: " << categories[i];
}
}
// Check data distribution
int totalSamples = 0;
for (const auto& row : summary.confusionMatrix) {
for (int count : row) {
totalSamples += count;
}
}
LOGI << "Total samples processed: " << totalSamples;
if (totalSamples == 0) {
LOGE << "No samples processed - check input data";
}
}
Integration Examples¶
Automated Model Validation Pipeline¶
// Automated model validation and deployment pipeline
class ModelValidationPipeline {
public:
struct ValidationConfig {
float minAccuracyThreshold = 0.85f;
float minF1ScoreThreshold = 0.80f;
std::vector<std::string> requiredCategories;
std::string validationDataset;
std::string benchmarkModel;
};
bool validateModel(const std::string& modelPath, const ValidationConfig& config) {
LOGI << "Starting model validation for: " << modelPath;
// Load model and validation dataset
auto model = loadModel(modelPath);
auto dataset = loadValidationDataset(config.validationDataset);
if (!model || !dataset) {
LOGE << "Failed to load model or dataset";
return false;
}
// Run inference on validation set
auto predictions = runInference(model, dataset);
// Evaluate performance
auto evaluator = api::factory::ClassifierEval::create();
auto summary = evaluator->evaluate(
predictions,
dataset->getGroundTruth(),
config.requiredCategories
);
// Check validation criteria
ValidationResult result = checkValidationCriteria(summary, config);
// Generate validation report
generateValidationReport(result, modelPath);
// Decision on model approval
bool approved = result.meetsAllCriteria();
if (approved) {
LOGI << "Model validation PASSED - ready for deployment";
triggerModelDeployment(modelPath);
} else {
LOGW << "Model validation FAILED - improvement required";
notifyDevelopmentTeam(result);
}
return approved;
}
private:
struct ValidationResult {
bool accuracyPassed;
bool f1ScorePassed;
bool perClassPassed;
float actualAccuracy;
float actualF1Score;
std::vector<std::string> failingCategories;
bool meetsAllCriteria() const {
return accuracyPassed && f1ScorePassed && perClassPassed;
}
};
ValidationResult checkValidationCriteria(
const iface::ClassifierSummary& summary,
const ValidationConfig& config
) {
ValidationResult result;
// Check overall accuracy
result.actualAccuracy = summary.avgSystemAccuracy;
result.accuracyPassed = result.actualAccuracy >= config.minAccuracyThreshold;
// Check F1-score
result.actualF1Score = summary.f1scoreMacro;
result.f1ScorePassed = result.actualF1Score >= config.minF1ScoreThreshold;
// Check per-class performance
result.perClassPassed = true;
for (size_t i = 0; i < config.requiredCategories.size() && i < summary.perClassF1Score.size(); ++i) {
if (summary.perClassF1Score[i] < 0.5f) { // Minimum per-class threshold
result.failingCategories.push_back(config.requiredCategories[i]);
result.perClassPassed = false;
}
}
return result;
}
};
Real-Time Model Monitoring System¶
// Real-time classification model monitoring
class ModelMonitoringSystem {
public:
void initializeMonitoring() {
// Setup periodic evaluation
setupPeriodicEvaluation();
// Initialize performance tracking
initializePerformanceTracking();
// Setup alerting system
setupAlertingSystem();
LOGI << "Model monitoring system initialized";
}
void processInferenceResults(
const cvec& predictions,
const cvec& groundTruth,
const std::string& modelId
) {
// Accumulate results for periodic evaluation
accumulateResults(predictions, groundTruth, modelId);
// Check for immediate issues
checkImmediateIssues(predictions, modelId);
}
private:
struct ModelPerformanceTracker {
std::string modelId;
std::vector<cvec> predictions;
std::vector<cvec> groundTruth;
std::chrono::steady_clock::time_point lastEvaluation;
iface::ClassifierSummary lastSummary;
std::vector<float> performanceHistory;
};
std::map<std::string, ModelPerformanceTracker> trackers_;
std::vector<std::string> monitoredCategories_;
void setupPeriodicEvaluation() {
// Create timer for periodic evaluation (every hour)
std::thread([this]() {
while (running_) {
std::this_thread::sleep_for(std::chrono::hours(1));
performPeriodicEvaluation();
}
}).detach();
}
void performPeriodicEvaluation() {
LOGI << "Performing periodic model evaluation";
auto evaluator = api::factory::ClassifierEval::create();
for (auto& [modelId, tracker] : trackers_) {
if (tracker.predictions.empty()) {
continue;
}
// Flatten accumulated results
cvec allPredictions = flattenResults(tracker.predictions);
cvec allGroundTruth = flattenResults(tracker.groundTruth);
// Evaluate accumulated performance
auto summary = evaluator->evaluate(
allPredictions,
allGroundTruth,
monitoredCategories_
);
// Check for performance degradation
checkPerformanceDegradation(modelId, summary, tracker.lastSummary);
// Update tracking data
tracker.lastSummary = summary;
tracker.lastEvaluation = std::chrono::steady_clock::now();
tracker.performanceHistory.push_back(summary.avgSystemAccuracy);
// Clear accumulated data
tracker.predictions.clear();
tracker.groundTruth.clear();
// Log current performance
LOGI << "Model " << modelId << " current accuracy: " << summary.avgSystemAccuracy;
}
}
void checkPerformanceDegradation(
const std::string& modelId,
const iface::ClassifierSummary& current,
const iface::ClassifierSummary& previous
) {
if (previous.avgSystemAccuracy == 0.0f) {
return; // First evaluation
}
float accuracyDrop = previous.avgSystemAccuracy - current.avgSystemAccuracy;
float f1Drop = previous.f1scoreMacro - current.f1scoreMacro;
// Alert thresholds
const float ACCURACY_ALERT_THRESHOLD = 0.05f; // 5% drop
const float F1_ALERT_THRESHOLD = 0.05f; // 5% drop
if (accuracyDrop > ACCURACY_ALERT_THRESHOLD || f1Drop > F1_ALERT_THRESHOLD) {
triggerPerformanceAlert(modelId, accuracyDrop, f1Drop);
}
}
void triggerPerformanceAlert(
const std::string& modelId,
float accuracyDrop,
float f1Drop
) {
LOGW << "PERFORMANCE ALERT: Model " << modelId << " degradation detected";
LOGW << " Accuracy drop: " << accuracyDrop * 100.0f << "%";
LOGW << " F1-score drop: " << f1Drop * 100.0f << "%";
// Send notification (implementation specific)
sendPerformanceAlert(modelId, accuracyDrop, f1Drop);
}
};
See Also¶
- COCOEval Plugin - Object detection evaluation metrics
- Processing Plugins Overview - All processing plugins
- Plugin Overview - Complete plugin ecosystem
- SDK Documentation - Plugin development guide
- Inference Engine Guide - Model optimization patterns