COCOEval Plugin ¶

Description ¶

COCOEval is a model evaluation plugin for CVEDIA-RT that implements COCO (Common Objects in Context) dataset evaluation metrics. It provides comprehensive performance assessment for object detection models using standard COCO evaluation protocols and benchmarking methodologies.

This plugin enables rigorous evaluation of object detection models against ground truth data using industry-standard COCO metrics including Average Precision (AP), Average Recall (AR), precision-recall curves, and size-based performance analysis. It supports both single-image evaluation and batch processing for comprehensive model validation and benchmarking.

Key Features ¶

COCO-Compliant Evaluation: Full implementation of official COCO evaluation protocols and metrics
Average Precision Calculation: Comprehensive AP metrics at multiple IoU thresholds (0.5:0.95)
Size-Based Analysis: Performance evaluation across small, medium, and large object categories
Multi-Class Support: Evaluation across multiple object classes with per-class metrics
Precision-Recall Curves: Generation of detailed precision-recall curves for analysis
F1 Score Calculation: Combined precision and recall metrics for balanced assessment
Batch Evaluation: Efficient processing of large evaluation datasets
Statistical Reporting: Comprehensive statistical summaries and performance reports
IoU Matching: Configurable Intersection over Union thresholds for detection matching
Ground Truth Validation: Robust validation against annotated ground truth data

Requirements ¶

Hardware Requirements ¶

CPU: Multi-core processor for efficient batch processing
Memory: Minimum 4GB RAM (8GB+ recommended for large datasets)
Storage: Sufficient storage for evaluation datasets and result caching

Software Dependencies ¶

RTCORE: CVEDIA-RT core library for plugin infrastructure
Mathematical Libraries: Statistical computation and linear algebra libraries
JSON Library: JSON parsing for COCO format annotations
Image Processing: Basic image processing capabilities for bounding box operations
Threading Library: Multi-threading support for parallel evaluation

Data Requirements ¶

COCO Format Annotations: Ground truth annotations in COCO JSON format
Detection Results: Model predictions in compatible format
Image Metadata: Image dimensions and reference information
Category Definitions: Object class definitions and category mappings

Configuration ¶

Basic Configuration ¶

{
  "cocoeval": {
    "iou_thresholds": [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95],
    "area_ranges": {
      "small": [0, 32],
      "medium": [32, 96], 
      "large": [96, 10000000000]
    },
    "max_detections": [1, 10, 100],
    "confidence_threshold": 0.1
  }
}

Advanced Configuration ¶

{
  "cocoeval": {
    "iou_thresholds": [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95],
    "area_ranges": {
      "small": [0, 32],
      "medium": [32, 96],
      "large": [96, 10000000000],
      "extra_large": [256, 10000000000]
    },
    "max_detections": [1, 10, 100, 300],
    "confidence_threshold": 0.05,
    "evaluation_settings": {
      "use_segm": false,
      "use_crowd": true,
      "area_rng_lbl": ["all", "small", "medium", "large"],
      "max_dets_lbl": ["all", "small", "medium", "large"]
    },
    "output_settings": {
      "save_results": true,
      "result_format": "json",
      "include_curves": true,
      "detailed_stats": true
    },
    "performance": {
      "parallel_processing": true,
      "batch_size": 100,
      "cache_results": true
    }
  }
}

Configuration Schema ¶

Parameter	Type	Default	Description
`iou_thresholds`	array	[0.5:0.95]	IoU thresholds for evaluation
`area_ranges`	object	COCO standard	Object size ranges (pixels²)
`max_detections`	array	[1, 10, 100]	Maximum detections per image
`confidence_threshold`	float	0.1	Minimum confidence for evaluation
`use_segm`	bool	false	Use segmentation masks
`use_crowd`	bool	true	Include crowd annotations
`save_results`	bool	true	Save evaluation results
`result_format`	string	"json"	Output format ("json", "csv")
`include_curves`	bool	true	Include precision-recall curves
`detailed_stats`	bool	true	Generate detailed statistics
`parallel_processing`	bool	true	Enable parallel evaluation
`batch_size`	int	100	Evaluation batch size
`cache_results`	bool	true	Cache intermediate results

class CocoEvalImpl : public CocoEval {
public:
    // Main evaluation functions
    expected<CocoSummary> evaluate(const cvec& detections, 
                                  const cvec& groundtruth,
                                  const std::vector<std::string>& categories,
                                  int refWidth, int refHeight) override;

    expected<CocoSummary> evaluateImages(const std::vector<ImageEvaluation>& imageEvals) override;

    // Statistical analysis
    expected<void> accumulate() override;
    expected<std::string> summaryToString(const CocoSummary& summary) override;

    // Configuration and setup
    expected<void> setCategories(const std::vector<std::string>& categories) override;
    expected<void> setEvaluationParams(const EvaluationParams& params) override;
    expected<EvaluationParams> getEvaluationParams() override;
};

Data Structures ¶

struct CocoSummary {
    std::vector<APAR> apar;         // Average Precision/Average Recall results
    float f1Score = 0.0f;           // F1 score
    float ap = 0.0f;                // Average Precision (AP @0.5:0.95)
    float ap50 = 0.0f;              // AP @ IoU=0.5
    float ap75 = 0.0f;              // AP @ IoU=0.75
    float apSmall = 0.0f;           // AP for small objects
    float apMedium = 0.0f;          // AP for medium objects  
    float apLarge = 0.0f;           // AP for large objects
    float ar = 0.0f;                // Average Recall (AR)
    float arSmall = 0.0f;           // AR for small objects
    float arMedium = 0.0f;          // AR for medium objects
    float arLarge = 0.0f;           // AR for large objects

    // Per-class metrics
    std::map<std::string, float> classAP;    // Per-class Average Precision
    std::map<std::string, float> classAR;    // Per-class Average Recall
};

struct APAR {
    float ap = 0.0f;                // Average Precision
    float ar = 0.0f;                // Average Recall
    float f1 = 0.0f;                // F1 Score
    std::string category;           // Object category
    std::string areaRange;          // Size range ("small", "medium", "large")
    float iouThreshold = 0.5f;      // IoU threshold
};

struct EvaluationParams {
    std::vector<float> iouThresholds = {0.5f, 0.55f, 0.6f, 0.65f, 0.7f, 0.75f, 
                                       0.8f, 0.85f, 0.9f, 0.95f};
    std::vector<int> areaRanges = {0, 32, 96, 10000000000};
    std::vector<int> maxDetections = {1, 10, 100};
    float confidenceThreshold = 0.1f;
    bool useCrowd = true;
};

Detection and Ground Truth Structures ¶

struct Detection {
    cv::Rect2f bbox;               // Bounding box (normalized or pixel coordinates)
    float confidence = 0.0f;       // Detection confidence
    std::string category;          // Object category/class
    int imageId = -1;              // Image identifier
    std::vector<cv::Point2f> segmentation;  // Segmentation points (optional)
};

struct GroundTruth {
    cv::Rect2f bbox;               // Ground truth bounding box
    std::string category;          // Object category/class
    int imageId = -1;              // Image identifier
    bool isCrowd = false;          // Crowd annotation flag
    float area = 0.0f;             // Object area
    std::vector<cv::Point2f> segmentation;  // Ground truth segmentation
};

Lua API ¶

COCO Evaluation Setup ¶

-- Create COCO evaluation instance
local cocoEval = api.factory.cocoeval.create(instance, "model_evaluator")

-- Configure COCO evaluation parameters
cocoEval:configure({
    iou_thresholds = {0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95},
    area_ranges = {
        small = {0, 32},
        medium = {32, 96},
        large = {96, 10000000000}
    },
    max_detections = {1, 10, 100},
    confidence_threshold = 0.1,
    save_results = true,
    detailed_stats = true
})

-- Set object categories
cocoEval:setCategories({"person", "car", "bicycle", "dog", "cat"})

Model Evaluation Processing ¶

-- Evaluate model performance against ground truth
function evaluateModelPerformance(detections, groundTruth, imageWidth, imageHeight)
    -- Perform COCO evaluation
    local results = cocoEval:evaluate(detections, groundTruth, 
                                     {"person", "car", "bicycle"}, 
                                     imageWidth, imageHeight)

    if results then
        api.logging.LogInfo("COCO Evaluation Results:")
        api.logging.LogInfo("  Average Precision (AP): " .. results.ap)
        api.logging.LogInfo("  AP @ IoU=0.5: " .. results.ap50)
        api.logging.LogInfo("  AP @ IoU=0.75: " .. results.ap75)
        api.logging.LogInfo("  AP (small objects): " .. results.apSmall)
        api.logging.LogInfo("  AP (medium objects): " .. results.apMedium)
        api.logging.LogInfo("  AP (large objects): " .. results.apLarge)
        api.logging.LogInfo("  Average Recall (AR): " .. results.ar)
        api.logging.LogInfo("  F1 Score: " .. results.f1Score)

        -- Print per-class results
        if results.classAP then
            api.logging.LogInfo("\nPer-Class Average Precision:")
            for class, ap in pairs(results.classAP) do
                api.logging.LogInfo(string.format("  %s: %.3f", class, ap))
            end
        end

        return results
    else
        api.logging.LogError("Evaluation failed")
        return nil
    end
end

-- Process detection results for evaluation
function processDetectionResults(modelOutput)
    local detections = {}

    for _, detection in ipairs(modelOutput) do
        table.insert(detections, {
            bbox = {
                x = detection.x,
                y = detection.y,
                w = detection.w, 
                h = detection.h
            },
            confidence = detection.confidence,
            category = detection.class,
            image_id = detection.image_id
        })
    end

    return detections
end

Batch Evaluation ¶

-- Batch evaluation for dataset validation
function performBatchEvaluation(datasetPath, modelResultsPath)
    api.logging.LogInfo("Starting batch COCO evaluation...")

    -- Load ground truth annotations
    local groundTruth = loadCocoAnnotations(datasetPath .. "/annotations.json")

    -- Load model detection results  
    local detections = loadDetectionResults(modelResultsPath)

    -- Perform batch evaluation
    local batchResults = {}
    local totalImages = #groundTruth

    for i = 1, totalImages do
        local imageGT = groundTruth[i]
        local imageDets = filterDetectionsByImage(detections, imageGT.image_id)

        -- Evaluate single image
        local imageResult = cocoEval:evaluate(imageDets, {imageGT}, 
                                            cocoEval:getCategories(),
                                            imageGT.width, imageGT.height)

        if imageResult then
            table.insert(batchResults, imageResult)
        end

        -- Progress reporting
        if i % 100 == 0 then
            api.logging.LogInfo(string.format("Processed %d/%d images", i, totalImages))
        end
    end

    -- Accumulate results
    cocoEval:accumulate()

    -- Generate final summary
    local finalSummary = calculateBatchSummary(batchResults)

    api.logging.LogInfo("Batch evaluation completed:")
    api.logging.LogInfo("  Dataset size: " .. totalImages)
    api.logging.LogInfo("  Overall AP: " .. finalSummary.ap)
    api.logging.LogInfo("  Overall AR: " .. finalSummary.ar)

    return finalSummary
end

-- Calculate summary statistics from batch results
function calculateBatchSummary(batchResults)
    local summary = {
        ap = 0,
        ap50 = 0,
        ap75 = 0,
        ar = 0,
        f1Score = 0,
        totalImages = #batchResults
    }

    for _, result in ipairs(batchResults) do
        summary.ap = summary.ap + result.ap
        summary.ap50 = summary.ap50 + result.ap50
        summary.ap75 = summary.ap75 + result.ap75
        summary.ar = summary.ar + result.ar
        summary.f1Score = summary.f1Score + result.f1Score
    end

    -- Calculate averages
    if summary.totalImages > 0 then
        summary.ap = summary.ap / summary.totalImages
        summary.ap50 = summary.ap50 / summary.totalImages
        summary.ap75 = summary.ap75 / summary.totalImages
        summary.ar = summary.ar / summary.totalImages
        summary.f1Score = summary.f1Score / summary.totalImages
    end

    return summary
end

Examples ¶

Basic Model Evaluation ¶

#include "cocoeval.h"
#include "rtcore.h"

// Basic COCO evaluation implementation
class ModelEvaluator {
public:
    void initialize() {
        // Create COCO evaluator
        cocoEval_ = std::unique_ptr<CocoEval>(
            static_cast<CocoEval*>(
                CocoEval::create("model_evaluator").release()
            )
        );

        // Set evaluation categories
        std::vector<std::string> categories = {
            "person", "bicycle", "car", "motorcycle", "airplane",
            "bus", "train", "truck", "boat", "traffic light"
        };

        cocoEval_->setCategories(categories);

        // Configure evaluation parameters
        EvaluationParams params;
        params.iouThresholds = {0.5f, 0.55f, 0.6f, 0.65f, 0.7f, 
                               0.75f, 0.8f, 0.85f, 0.9f, 0.95f};
        params.confidenceThreshold = 0.1f;
        params.useCrowd = true;

        cocoEval_->setEvaluationParams(params);

        LOGI << "COCO evaluator initialized with " << categories.size() << " categories";
    }

    CocoSummary evaluateModel(const std::vector<Detection>& detections,
                             const std::vector<GroundTruth>& groundTruth,
                             int imageWidth, int imageHeight) {

        // Convert data to plugin format
        cvec detectionsVec = convertDetections(detections);
        cvec groundTruthVec = convertGroundTruth(groundTruth);

        // Get category names
        auto categories = getCategoryNames();

        // Perform evaluation
        auto result = cocoEval_->evaluate(detectionsVec, groundTruthVec, 
                                         categories, imageWidth, imageHeight);

        if (!result) {
            LOGE << "COCO evaluation failed: " << result.error().message();
            return CocoSummary{};
        }

        auto summary = result.value();

        // Log results
        LOGI << "COCO Evaluation Results:";
        LOGI << "  Average Precision (AP): " << summary.ap;
        LOGI << "  AP @ IoU=0.5: " << summary.ap50;
        LOGI << "  AP @ IoU=0.75: " << summary.ap75;
        LOGI << "  AP (small): " << summary.apSmall;
        LOGI << "  AP (medium): " << summary.apMedium;
        LOGI << "  AP (large): " << summary.apLarge;
        LOGI << "  Average Recall (AR): " << summary.ar;
        LOGI << "  F1 Score: " << summary.f1Score;

        return summary;
    }

private:
    std::unique_ptr<CocoEval> cocoEval_;

    cvec convertDetections(const std::vector<Detection>& detections) {
        cvec result;

        for (const auto& det : detections) {
            auto detection = CValue::create();

            // Bounding box
            auto bbox = CValue::create();
            bbox->set("x", det.bbox.x);
            bbox->set("y", det.bbox.y);
            bbox->set("w", det.bbox.width);
            bbox->set("h", det.bbox.height);
            detection->set("bbox", bbox);

            // Detection properties
            detection->set("confidence", det.confidence);
            detection->set("category", det.category);
            detection->set("image_id", det.imageId);

            result.push_back(detection);
        }

        return result;
    }

    cvec convertGroundTruth(const std::vector<GroundTruth>& groundTruth) {
        cvec result;

        for (const auto& gt : groundTruth) {
            auto truth = CValue::create();

            // Bounding box
            auto bbox = CValue::create();
            bbox->set("x", gt.bbox.x);
            bbox->set("y", gt.bbox.y);
            bbox->set("w", gt.bbox.width);
            bbox->set("h", gt.bbox.height);
            truth->set("bbox", bbox);

            // Ground truth properties
            truth->set("category", gt.category);
            truth->set("image_id", gt.imageId);
            truth->set("is_crowd", gt.isCrowd);
            truth->set("area", gt.area);

            result.push_back(truth);
        }

        return result;
    }

    std::vector<std::string> getCategoryNames() {
        return {"person", "bicycle", "car", "motorcycle", "airplane",
                "bus", "train", "truck", "boat", "traffic light"};
    }
};

Advanced Dataset Evaluation ¶

// Advanced dataset evaluation with detailed analysis
class DatasetEvaluator {
public:
    void evaluateDataset(const std::string& datasetPath, 
                        const std::string& resultsPath) {

        // Initialize evaluation
        initialize();

        // Load dataset
        auto dataset = loadCocoDataset(datasetPath);
        auto detections = loadDetectionResults(resultsPath);

        // Perform comprehensive evaluation
        auto overallResults = performComprehensiveEvaluation(dataset, detections);

        // Generate detailed report
        generateEvaluationReport(overallResults);

        // Save results
        saveEvaluationResults(overallResults);
    }

private:
    std::unique_ptr<CocoEval> cocoEval_;

    struct DatasetResults {
        CocoSummary overallSummary;
        std::map<std::string, CocoSummary> perClassResults;
        std::map<std::string, CocoSummary> perSizeResults;
        std::vector<ImageEvaluationResult> imageResults;
        EvaluationStatistics statistics;
    };

    struct EvaluationStatistics {
        int totalImages = 0;
        int totalDetections = 0;
        int totalGroundTruth = 0;
        int truePositives = 0;
        int falsePositives = 0;
        int falseNegatives = 0;
        float averageConfidence = 0.0f;
        std::map<std::string, int> categoryDistribution;
    };

    DatasetResults performComprehensiveEvaluation(const CocoDataset& dataset,
                                                const std::vector<Detection>& detections) {
        DatasetResults results;

        // Overall evaluation
        results.overallSummary = evaluateOverall(dataset, detections);

        // Per-class evaluation
        results.perClassResults = evaluatePerClass(dataset, detections);

        // Per-size evaluation
        results.perSizeResults = evaluatePerSize(dataset, detections);

        // Per-image evaluation
        results.imageResults = evaluatePerImage(dataset, detections);

        // Calculate statistics
        results.statistics = calculateStatistics(dataset, detections);

        return results;
    }

    void generateEvaluationReport(const DatasetResults& results) {
        LOGI << "=== COCO Dataset Evaluation Report ===";
        LOGI << "";
        LOGI << "Overall Performance:";
        LOGI << "  Average Precision (AP): " << results.overallSummary.ap;
        LOGI << "  Average Recall (AR): " << results.overallSummary.ar;
        LOGI << "  F1 Score: " << results.overallSummary.f1Score;
        LOGI << "";

        LOGI << "Size-based Performance:";
        LOGI << "  Small objects AP: " << results.overallSummary.apSmall;
        LOGI << "  Medium objects AP: " << results.overallSummary.apMedium; 
        LOGI << "  Large objects AP: " << results.overallSummary.apLarge;
        LOGI << "";

        LOGI << "Per-Class Performance:";
        for (const auto& [className, summary] : results.perClassResults) {
            LOGI << "  " << className << " AP: " << summary.ap;
        }
        LOGI << "";

        LOGI << "Dataset Statistics:";
        LOGI << "  Total Images: " << results.statistics.totalImages;
        LOGI << "  Total Detections: " << results.statistics.totalDetections;
        LOGI << "  Total Ground Truth: " << results.statistics.totalGroundTruth;
        LOGI << "  True Positives: " << results.statistics.truePositives;
        LOGI << "  False Positives: " << results.statistics.falsePositives;
        LOGI << "  False Negatives: " << results.statistics.falseNegatives;

        // Calculate precision and recall
        float precision = static_cast<float>(results.statistics.truePositives) /
                         (results.statistics.truePositives + results.statistics.falsePositives);
        float recall = static_cast<float>(results.statistics.truePositives) /
                      (results.statistics.truePositives + results.statistics.falseNegatives);

        LOGI << "  Precision: " << precision;
        LOGI << "  Recall: " << recall;
        LOGI << "================================";
    }
};

Complete Model Benchmarking System ¶

-- Complete model benchmarking system with COCO evaluation
local cocoEval = api.factory.cocoeval.create(instance, "benchmark_evaluator")
local modelManager = api.factory.inference.create(instance, "benchmark_models")

-- Benchmarking configuration
local benchmarkConfig = {
    models_to_test = {
        {name = "yolov5s", path = "/models/yolov5s.onnx", type = "detection"},
        {name = "yolov5m", path = "/models/yolov5m.onnx", type = "detection"},
        {name = "efficientdet", path = "/models/efficientdet.trt", type = "detection"}
    },
    datasets = {
        {name = "coco_val", path = "/datasets/coco/val2017", annotations = "/datasets/coco/annotations/instances_val2017.json"},
        {name = "custom_test", path = "/datasets/custom/test", annotations = "/datasets/custom/annotations.json"}
    },
    evaluation_settings = {
        iou_thresholds = {0.5, 0.75, 0.9},
        confidence_thresholds = {0.1, 0.3, 0.5},
        max_detections = {100, 300, 1000}
    }
}

-- Model benchmarking results
local benchmarkResults = {
    model_results = {},
    comparative_analysis = {},
    performance_metrics = {}
}

-- Initialize comprehensive benchmarking system
function initializeBenchmarkingSystem()
    api.logging.LogInfo("Initializing COCO model benchmarking system")

    -- Configure COCO evaluation
    cocoEval:configure({
        iou_thresholds = benchmarkConfig.evaluation_settings.iou_thresholds,
        area_ranges = {
            small = {0, 32},
            medium = {32, 96},
            large = {96, 10000000000}
        },
        max_detections = benchmarkConfig.evaluation_settings.max_detections,
        confidence_threshold = 0.05,  -- Low threshold for comprehensive analysis
        save_results = true,
        detailed_stats = true,
        include_curves = true
    })

    -- Set COCO categories (80 classes)
    local cocoCategories = {
        "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck",
        "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench",
        "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra",
        "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee"
        -- ... full COCO category list
    }

    cocoEval:setCategories(cocoCategories)

    api.logging.LogInfo("Benchmarking system initialized with " .. #benchmarkConfig.models_to_test .. " models")
    return true
end

-- Run comprehensive model benchmarking
function runModelBenchmarking()
    api.logging.LogInfo("Starting comprehensive model benchmarking")

    for _, dataset in ipairs(benchmarkConfig.datasets) do
        api.logging.LogInfo("Evaluating on dataset: " .. dataset.name)

        -- Load ground truth for dataset
        local groundTruth = loadCocoGroundTruth(dataset.annotations)

        for _, model in ipairs(benchmarkConfig.models_to_test) do
            api.logging.LogInfo("Benchmarking model: " .. model.name)

            -- Initialize model
            local modelResults = benchmarkModel(model, dataset, groundTruth)

            -- Store results
            benchmarkResults.model_results[model.name] = benchmarkResults.model_results[model.name] or {}
            benchmarkResults.model_results[model.name][dataset.name] = modelResults
        end
    end

    -- Perform comparative analysis
    performComparativeAnalysis()

    -- Generate benchmark report
    generateBenchmarkReport()
end

-- Benchmark individual model
function benchmarkModel(model, dataset, groundTruth)
    local modelResults = {
        model_name = model.name,
        dataset_name = dataset.name,
        evaluation_results = {},
        performance_metrics = {},
        detailed_analysis = {}
    }

    -- Load and configure model
    modelManager:loadModel(model.path)

    -- Test different confidence thresholds
    for _, confidenceThreshold in ipairs(benchmarkConfig.evaluation_settings.confidence_thresholds) do
        api.logging.LogInfo(string.format("Testing %s with confidence threshold %.2f", 
                                     model.name, confidenceThreshold))

        -- Run inference on dataset
        local detections = runInferenceOnDataset(model, dataset, confidenceThreshold)

        -- Evaluate with COCO metrics
        local cocoResults = cocoEval:evaluate(detections, groundTruth, 
                                            cocoEval:getCategories(),
                                            dataset.image_width or 640,
                                            dataset.image_height or 480)

        if cocoResults then
            modelResults.evaluation_results[tostring(confidenceThreshold)] = cocoResults

            -- Log intermediate results
            api.logging.LogInfo(string.format(
                "Results for %s @ %.2f: AP=%.3f, AR=%.3f, F1=%.3f",
                model.name, confidenceThreshold, 
                cocoResults.ap, cocoResults.ar, cocoResults.f1Score
            ))
        end
    end

    -- Calculate performance metrics
    modelResults.performance_metrics = calculateModelPerformanceMetrics(modelResults.evaluation_results)

    -- Detailed analysis per category and size
    modelResults.detailed_analysis = performDetailedModelAnalysis(model, dataset, groundTruth)

    return modelResults
end

-- Perform comparative analysis across models
function performComparativeAnalysis()
    api.logging.LogInfo("Performing comparative analysis across models")

    local analysis = {
        best_overall_ap = {model = "", dataset = "", ap = 0},
        best_small_objects = {model = "", dataset = "", ap = 0},
        best_large_objects = {model = "", dataset = "", ap = 0},
        fastest_inference = {model = "", fps = 0},
        model_rankings = {}
    }

    -- Compare models across datasets
    for modelName, modelData in pairs(benchmarkResults.model_results) do
        for datasetName, datasetResults in pairs(modelData) do
            -- Find best confidence threshold for each model
            local bestResult = findBestConfidenceThreshold(datasetResults.evaluation_results)

            if bestResult then
                -- Track best overall performance
                if bestResult.ap > analysis.best_overall_ap.ap then
                    analysis.best_overall_ap = {
                        model = modelName,
                        dataset = datasetName,
                        ap = bestResult.ap
                    }
                end

                -- Track best small object performance
                if bestResult.apSmall > analysis.best_small_objects.ap then
                    analysis.best_small_objects = {
                        model = modelName,
                        dataset = datasetName,
                        ap = bestResult.apSmall
                    }
                end

                -- Track best large object performance
                if bestResult.apLarge > analysis.best_large_objects.ap then
                    analysis.best_large_objects = {
                        model = modelName,
                        dataset = datasetName,
                        ap = bestResult.apLarge
                    }
                end

                -- Add to rankings
                table.insert(analysis.model_rankings, {
                    model = modelName,
                    dataset = datasetName,
                    ap = bestResult.ap,
                    ar = bestResult.ar,
                    f1 = bestResult.f1Score,
                    performance = datasetResults.performance_metrics
                })
            end
        end
    end

    -- Sort rankings by AP
    table.sort(analysis.model_rankings, function(a, b) 
        return a.ap > b.ap 
    end)

    benchmarkResults.comparative_analysis = analysis
end

-- Generate comprehensive benchmark report
function generateBenchmarkReport()
    api.logging.LogInfo("=== COCO Model Benchmarking Report ===")
    api.logging.LogInfo("")

    -- Overall best performers
    local analysis = benchmarkResults.comparative_analysis

    api.logging.LogInfo("Best Overall Performance:")
    api.logging.LogInfo(string.format("  Model: %s on %s", 
                                  analysis.best_overall_ap.model,
                                  analysis.best_overall_ap.dataset))
    api.logging.LogInfo(string.format("  Average Precision: %.3f", analysis.best_overall_ap.ap))
    api.logging.LogInfo("")

    api.logging.LogInfo("Best Small Object Detection:")
    api.logging.LogInfo(string.format("  Model: %s on %s", 
                                  analysis.best_small_objects.model,
                                  analysis.best_small_objects.dataset))
    api.logging.LogInfo(string.format("  Small Objects AP: %.3f", analysis.best_small_objects.ap))
    api.logging.LogInfo("")

    api.logging.LogInfo("Model Rankings (by Average Precision):")
    for i, ranking in ipairs(analysis.model_rankings) do
        if i <= 10 then  -- Top 10
            api.logging.LogInfo(string.format("%d. %s (%s): AP=%.3f, AR=%.3f, F1=%.3f",
                                          i, ranking.model, ranking.dataset,
                                          ranking.ap, ranking.ar, ranking.f1))
        end
    end
    api.logging.LogInfo("")

    -- Detailed per-model analysis
    api.logging.LogInfo("Detailed Model Analysis:")
    for modelName, modelData in pairs(benchmarkResults.model_results) do
        api.logging.LogInfo(string.format("Model: %s", modelName))

        for datasetName, results in pairs(modelData) do
            local bestResult = findBestConfidenceThreshold(results.evaluation_results)
            if bestResult then
                api.logging.LogInfo(string.format("  Dataset %s: AP=%.3f, AP50=%.3f, AP75=%.3f",
                                              datasetName, bestResult.ap, 
                                              bestResult.ap50, bestResult.ap75))
                api.logging.LogInfo(string.format("    Small/Medium/Large: %.3f/%.3f/%.3f",
                                              bestResult.apSmall, bestResult.apMedium, 
                                              bestResult.apLarge))
            end
        end
        api.logging.LogInfo("")
    end

    api.logging.LogInfo("=== End of Benchmark Report ===")

    -- Save detailed results to file
    saveBenchmarkResults()
end

-- Save benchmark results to JSON file
function saveBenchmarkResults()
    local resultsFile = "/tmp/coco_benchmark_results.json"
    local jsonResults = api.json.encode(benchmarkResults)

    local file = io.open(resultsFile, "w")
    if file then
        file:write(jsonResults)
        file:close()
        api.logging.LogInfo("Benchmark results saved to: " .. resultsFile)
    else
        api.logging.LogError("Failed to save benchmark results")
    end
end

-- Initialize and run benchmarking
initializeBenchmarkingSystem()
runModelBenchmarking()

api.logging.LogInfo("COCO model benchmarking completed")

Best Practices ¶

Evaluation Accuracy ¶

Ground Truth Quality: Ensure high-quality, consistent ground truth annotations
IoU Threshold Selection: Use appropriate IoU thresholds based on application requirements
Category Consistency: Maintain consistent category definitions across datasets
Annotation Validation: Validate annotations for completeness and accuracy

Performance Optimization ¶

Batch Processing: Use batch evaluation for large datasets to improve efficiency
Parallel Processing: Enable parallel processing for multi-core systems
Result Caching: Cache intermediate results to avoid redundant calculations
Memory Management: Optimize memory usage for large-scale evaluations

Statistical Rigor ¶

Multiple Metrics: Use multiple evaluation metrics for comprehensive assessment
Cross-Validation: Perform cross-validation across different datasets
Statistical Significance: Ensure statistical significance in comparisons
Confidence Intervals: Report confidence intervals for performance metrics

Integration Guidelines ¶

Model Comparison: Use consistent evaluation protocols for fair model comparison
Threshold Optimization: Systematically optimize confidence thresholds
Error Analysis: Perform detailed error analysis to identify model weaknesses
Reporting Standards: Follow standard reporting practices for reproducibility

// Validate COCO annotation format
bool validateCocoAnnotations(const std::string& annotationFile) {
    // Check required fields
    auto annotations = loadJsonFile(annotationFile);

    if (!annotations.contains("images") || 
        !annotations.contains("annotations") ||
        !annotations.contains("categories")) {
        LOGE << "Missing required COCO annotation fields";
        return false;
    }

    // Validate bounding box formats
    for (const auto& ann : annotations["annotations"]) {
        if (!ann.contains("bbox") || ann["bbox"].size() != 4) {
            LOGE << "Invalid bounding box format in annotation ID: " << ann["id"];
            return false;
        }
    }

    return true;
}

Evaluation Discrepancies ¶

Coordinate Systems: Ensure consistent coordinate systems (pixel vs normalized)
Image Scaling: Account for image scaling in detection coordinates
Category Mapping: Verify correct category ID mappings
IoU Calculation: Validate IoU calculation accuracy

Performance Issues ¶

Large Datasets: Optimize memory usage for large evaluation datasets
Batch Size: Adjust batch size based on available memory
Parallel Processing: Enable parallel processing for faster evaluation
Result Storage: Optimize result storage for large-scale evaluations

Metric Interpretation ¶

AP vs AP50: Understand differences between different AP metrics
Size Categories: Correctly interpret small/medium/large object categories
Confidence Thresholds: Select appropriate confidence thresholds for evaluation
Statistical Significance: Ensure sufficient sample sizes for reliable metrics

Debugging Tools ¶

// COCO evaluation diagnostics
void diagnoseCocoEvaluation(CocoEval* cocoEval, 
                           const cvec& detections, 
                           const cvec& groundTruth) {
    // Validate input data
    LOGI << "Detection count: " << detections.size();
    LOGI << "Ground truth count: " << groundTruth.size();

    // Check category distribution
    std::map<std::string, int> detectionCounts;
    std::map<std::string, int> groundTruthCounts;

    for (const auto& det : detections) {
        detectionCounts[det->get("category").getString()]++;
    }

    for (const auto& gt : groundTruth) {
        groundTruthCounts[gt->get("category").getString()]++;
    }

    LOGI << "Category Distribution:";
    for (const auto& [category, count] : groundTruthCounts) {
        int detCount = detectionCounts[category];
        LOGI << "  " << category << ": GT=" << count << ", Det=" << detCount;
    }

    // Validate bounding boxes
    validateBoundingBoxes(detections, "detections");
    validateBoundingBoxes(groundTruth, "ground truth");
}

Integration Examples ¶

Model Training Pipeline Integration ¶

// Complete model training pipeline with COCO evaluation
class TrainingPipelineWithEvaluation {
public:
    void trainWithEvaluation(const TrainingConfig& config) {
        // Initialize training
        initializeTraining(config);

        // Training loop with periodic evaluation
        for (int epoch = 0; epoch < config.maxEpochs; ++epoch) {
            // Training step
            performTrainingEpoch(epoch);

            // Periodic evaluation
            if (epoch % config.evaluationInterval == 0) {
                auto evalResults = evaluateModel(epoch);

                // Log evaluation results
                logEvaluationResults(epoch, evalResults);

                // Model selection based on evaluation
                if (evalResults.ap > bestAP_) {
                    saveModelCheckpoint(epoch, evalResults);
                    bestAP_ = evalResults.ap;
                }
            }
        }

        // Final evaluation
        performFinalEvaluation();
    }

private:
    std::unique_ptr<CocoEval> cocoEval_;
    float bestAP_ = 0.0f;

    CocoSummary evaluateModel(int epoch) {
        // Load validation dataset
        auto validationData = loadValidationDataset();

        // Run inference on validation set
        auto detections = runValidationInference(validationData);

        // Perform COCO evaluation
        auto results = cocoEval_->evaluate(detections.detections, 
                                          validationData.groundTruth,
                                          validationData.categories,
                                          validationData.imageWidth,
                                          validationData.imageHeight);

        return results.value_or(CocoSummary{});
    }
};