Processing Plugin Collection ¶

Description ¶

The Processing plugin collection provides comprehensive pre-processing and post-processing capabilities for AI inference within CVEDIA-RT. It includes specialized processors for various AI model architectures, enabling seamless integration of state-of-the-art computer vision models with optimized performance and accuracy.

Key Features ¶

Comprehensive YOLO Support: Complete support for all major YOLO variants (v4, v5, v6, v7, v10, v11, YOLOX) with optimized post-processing
Multi-Model Architecture Support: Specialized processors for classification, OCR, face detection, pose estimation, and style transfer
High-Performance Processing: XTensor-based operations with SIMD optimization for maximum throughput
Flexible Configuration: Configurable confidence thresholds, NMS parameters, and class filtering
Memory Optimization: Thread-safe tensor caching and efficient memory management
Standardized Interface: Unified processing interface across all model types
Debug and Profiling: Built-in Tracy profiling and comprehensive logging

Use Cases ¶

Real-Time Object Detection: High-performance YOLO-based detection pipelines
Multi-Class Classification: Image classification with support for both single-label and multi-label scenarios
Text Recognition: OCR processing with advanced text detection and recognition
Pose Estimation: Keypoint detection and pose analysis for human activity recognition
Custom AI Models: Framework for integrating custom neural network architectures

Requirements ¶

Software Dependencies ¶

Core Requirements: - RTCORE (CVEDIA-RT core libraries) - xtensor with SIMD support for high-performance tensor operations - Tracy profiler for performance monitoring - plog for structured logging

Optional Dependencies: - OpenCV for advanced image processing operations - CUDA/cuDNN for GPU acceleration (when available) - Intel MKL for optimized CPU operations

Hardware Requirements ¶

Minimum: - x64 CPU with SSE4.2 support - 4GB RAM for basic processing - Storage for model weights and temporary data

Recommended: - Multi-core CPU with AVX2 support - 8GB+ RAM for high-throughput processing - GPU with CUDA support for acceleration - SSD storage for optimal I/O performance

Supported AI Models ¶

YOLO Family Support ¶

YOLO Version	Plugin	Architecture Type	Key Features
YOLOv4	PostYoloV4	Anchor-based	Classic YOLO with CSPDarknet53 backbone
YOLOv5	PostYoloV5	Anchor-based	Improved anchor system with auto-anchor support
YOLOv6	PostYoloV6	Anchor-free	Enhanced multi-scale detection
YOLOv7	PostYoloV7	Anchor-based	Advanced training techniques, improved accuracy
YOLOv10	PostYoloV10	NMS-free	Dual assignment, end-to-end detection
YOLOv11	PostYoloV11	Next-generation	Latest architectural innovations
YOLOX	PostYoloX	Anchor-free	Decoupled head design, advanced augmentation

Specialized Model Support ¶

Model Type	Plugin	Supported Architectures	Applications
Classification	PostClassifier	ResNet, EfficientNet, MobileNet, ViT	Multi-class/multi-label classification
OCR	post_ocr	CRNN, EAST, DBNet, PaddleOCR	Text detection and recognition
Face Detection	PostUltraface	Ultraface, MTCNN, RetinaFace	Lightweight face detection
Point Detection	PostPoint	OpenPose, PoseNet, AlphaPose	Keypoint and pose estimation
GFL Detection	PostGFL	GFL-based architectures	Distribution-based object regression
Style Transfer	PostNeuralstyle	Neural style transfer models	Real-time artistic effects
Custom Processing	PostPassthru	Any architecture	Framework for custom processing

Configuration ¶

Basic YOLO Configuration ¶

{
  "processing": {
    "yolo": {
      "confThreshold": 0.2,
      "nmsIouThreshold": 0.5,
      "nmsScoreThreshold": 0.1,
      "filterEdgeDetections": false,
      "class_ids": [0, 1, 2, 15, 16]
    }
  }
}

Advanced Multi-Model Configuration ¶

{
  "processing": {
    "detection": {
      "model_type": "yolov8",
      "confThreshold": 0.25,
      "nmsIouThreshold": 0.45,
      "nmsScoreThreshold": 0.15,
      "filterEdgeDetections": true,
      "nmsMergeBatches": false,
      "class_filtering": {
        "enabled": true,
        "class_ids": [0, 1, 2, 3, 5, 7],
        "label_remapping": {
          "0": "person",
          "1": "bicycle",
          "2": "car"
        }
      }
    },
    "classification": {
      "model_type": "resnet50",
      "threshold": 0.5,
      "top_k": 5,
      "multi_label": false
    },
    "ocr": {
      "text_detection": {
        "threshold": 0.7,
        "link_threshold": 0.4
      },
      "text_recognition": {
        "beam_width": 5,
        "dictionary_enabled": true
      }
    },
    "performance": {
      "enable_profiling": true,
      "cache_anchors": true,
      "simd_optimization": true,
      "batch_processing": true
    }
  }
}

Configuration Schema ¶

Parameter	Type	Default	Description
`confThreshold`	float	0.2	Minimum confidence threshold for detections
`nmsIouThreshold`	float	0.5	IoU threshold for Non-Maximum Suppression
`nmsScoreThreshold`	float	0.1	Score threshold for NMS filtering
`filterEdgeDetections`	boolean	false	Filter detections near image edges
`nmsMergeBatches`	boolean	false	Merge batches during NMS processing
`class_ids`	array	[]	Filter specific class IDs (empty = all classes)
`label_remapping`	object	{}	Custom label name mapping
`regMax`	integer	16	DFL register maximum for supported models
`strides`	array	[8,16,32]	Multi-scale detection strides
`enable_profiling`	boolean	false	Enable Tracy profiling
`cache_anchors`	boolean	true	Cache anchor grids for performance
`simd_optimization`	boolean	true	Enable SIMD acceleration
`batch_processing`	boolean	true	Enable batch processing optimization

API Reference ¶

C++ Processing Interface ¶

All post-processing plugins implement a standardized interface:

namespace cvedia::rt {
    // Standard post-processing interface
    template<typename T>
    expected<cvec> postProcess(InferenceContext& ctx, 
                              std::vector<xt::xarray<float>>& output, 
                              CValue* inferenceConf);

    // YOLO-specific processing
    class YoloProcessor {
    public:
        struct Config {
            float confThreshold = 0.2f;
            float nmsIouThreshold = 0.5f;
            float nmsScoreThreshold = 0.1f;
            bool filterEdgeDetections = false;
            bool nmsMergeBatches = false;
            std::vector<int> class_ids;
            std::map<int, std::string> label_remapping;
        };

        static expected<cvec> process(const Config& config,
                                    const std::vector<xt::xarray<float>>& outputs,
                                    const ImageMetadata& metadata);
    };

    // Classification processing
    class ClassificationProcessor {
    public:
        struct Result {
            int class_id;
            float confidence;
            std::string label;
        };

        static expected<std::vector<Result>> classify(
            const xt::xarray<float>& output,
            float threshold = 0.5f,
            int top_k = 5,
            bool multi_label = false);
    };
}

Tensor Specifications ¶

YOLO Models ¶

// Input tensor format
struct YoloInput {
    xt::xarray<float> image;  // Shape: [B, C, H, W] or [B, H, W, C]
    ImageMetadata metadata;   // Original image dimensions and preprocessing info
};

// Output tensor formats
struct YoloOutput {
    // Grid-based format (YOLOv4, YOLOv5, YOLOv7)
    xt::xarray<float> grid_output;  // Shape: [B, A, H, W, C] where C = classes + 5

    // Flattened format (YOLOv6, YOLOv8)
    xt::xarray<float> flat_output;  // Shape: [B, N, C] where N = H*W*anchors

    // NMS-free format (YOLOv10)
    xt::xarray<float> direct_output; // Shape: [B, detections, 7] (x1,y1,x2,y2,conf,cls,batch_id)
};

Classification Models ¶

struct ClassificationOutput {
    xt::xarray<float> logits;     // Shape: [B, num_classes]
    xt::xarray<float> probabilities; // Softmax applied
    std::vector<std::string> labels;  // Class labels
};

Specialized Processing APIs ¶

// OCR processing
namespace ocr {
    struct TextDetection {
        std::vector<cv::Point2f> bbox;
        float confidence;
        std::string text;
    };

    expected<std::vector<TextDetection>> processOCR(
        const xt::xarray<float>& detection_output,
        const xt::xarray<float>& recognition_output,
        float det_threshold = 0.7f,
        float rec_threshold = 0.5f);
}

// Point detection
namespace pose {
    struct Keypoint {
        cv::Point2f point;
        float confidence;
        int joint_id;
    };

    struct PoseResult {
        std::vector<Keypoint> keypoints;
        float overall_confidence;
        cv::Rect2f bbox;
    };

    expected<std::vector<PoseResult>> processPose(
        const xt::xarray<float>& heatmaps,
        const xt::xarray<float>& pafs,
        float keypoint_threshold = 0.1f);
}

Examples ¶

Basic YOLO Object Detection ¶

#include "processing/yolo/postprocessor.h"
#include "inference/inference_context.h"

// Configure YOLO post-processing
YoloProcessor::Config config;
config.confThreshold = 0.25f;
config.nmsIouThreshold = 0.45f;
config.nmsScoreThreshold = 0.15f;
config.class_ids = {0, 1, 2, 3, 5, 7}; // person, bicycle, car, motorcycle, bus, truck

// Process inference results
InferenceContext ctx;
std::vector<xt::xarray<float>> model_outputs = runInference(input_image);

// Apply post-processing
auto detections = YoloProcessor::process(config, model_outputs, image_metadata);

if (detections) {
    for (const auto& detection : detections.value()) {
        std::cout << "Class: " << detection.class_id 
                  << ", Confidence: " << detection.confidence
                  << ", BBox: [" << detection.bbox.x << ", " 
                  << detection.bbox.y << ", " 
                  << detection.bbox.width << ", " 
                  << detection.bbox.height << "]" << std::endl;
    }
} else {
    std::cerr << "Detection failed: " << detections.error() << std::endl;
}

Multi-Label Classification ¶

#include "processing/classification/classifier.h"

// Configure classification
float threshold = 0.3f;  // Lower threshold for multi-label
int top_k = 10;
bool multi_label = true;

// Process classification output
xt::xarray<float> logits = runClassificationModel(input_image);
auto results = ClassificationProcessor::classify(logits, threshold, top_k, multi_label);

if (results) {
    std::cout << "Multi-label classification results:" << std::endl;
    for (const auto& result : results.value()) {
        std::cout << "Label: " << result.label 
                  << ", Confidence: " << result.confidence << std::endl;
    }
}

Custom Processing Pipeline ¶

#include "processing/common/post_common.h"

class CustomProcessor {
public:
    static expected<cvec> processCustomModel(
        InferenceContext& ctx,
        std::vector<xt::xarray<float>>& outputs,
        CValue* config) {

        // Extract configuration
        float custom_threshold = config->getValue("custom_threshold", 0.5f);
        bool enable_filtering = config->getValue("enable_filtering", true);

        // Process first output tensor
        const auto& primary_output = outputs[0];
        auto shape = primary_output.shape();

        cvec detections;

        // Custom processing logic
        for (size_t i = 0; i < shape[0]; ++i) {
            for (size_t j = 0; j < shape[1]; ++j) {
                float confidence = primary_output(i, j, 4); // Confidence channel

                if (confidence > custom_threshold) {
                    // Extract detection data
                    Detection detection;
                    detection.confidence = confidence;
                    detection.class_id = static_cast<int>(primary_output(i, j, 5));

                    // Convert normalized coordinates to absolute
                    detection.bbox.x = primary_output(i, j, 0) * ctx.imageWidth;
                    detection.bbox.y = primary_output(i, j, 1) * ctx.imageHeight;
                    detection.bbox.width = primary_output(i, j, 2) * ctx.imageWidth;
                    detection.bbox.height = primary_output(i, j, 3) * ctx.imageHeight;

                    detections.push_back(std::make_shared<CValue>(detection.toCValue()));
                }
            }
        }

        // Apply custom filtering if enabled
        if (enable_filtering) {
            detections = applyCustomFiltering(detections);
        }

        return detections;
    }

private:
    static cvec applyCustomFiltering(const cvec& input) {
        // Custom filtering logic
        cvec filtered;
        for (const auto& detection : input) {
            // Apply custom filtering criteria
            if (meetsCustomCriteria(detection)) {
                filtered.push_back(detection);
            }
        }
        return filtered;
    }
};

Performance Monitoring ¶

#include "tracy/Tracy.hpp"
#include "processing/performance/monitor.h"

class PerformanceOptimizedProcessor {
public:
    static expected<cvec> processWithProfiling(
        InferenceContext& ctx,
        std::vector<xt::xarray<float>>& outputs,
        CValue* config) {

        ZoneScoped;  // Tracy profiling zone

        {
            ZoneScopedN("Preprocessing");
            // Pre-processing operations
        }

        {
            ZoneScopedN("Core Processing");
            // Main processing logic
        }

        {
            ZoneScopedN("Post-filtering");
            // NMS and filtering operations
        }

        return results;
    }
};

// Performance monitoring
class ProcessingMonitor {
public:
    void recordProcessingTime(const std::string& processor, 
                             std::chrono::milliseconds duration) {
        processing_times_[processor].push_back(duration);

        // Keep only last 100 measurements
        if (processing_times_[processor].size() > 100) {
            processing_times_[processor].erase(processing_times_[processor].begin());
        }
    }

    double getAverageProcessingTime(const std::string& processor) const {
        const auto& times = processing_times_.at(processor);
        if (times.empty()) return 0.0;

        auto total = std::accumulate(times.begin(), times.end(), std::chrono::milliseconds{0});
        return total.count() / static_cast<double>(times.size());
    }

    double getProcessingFps(const std::string& processor) const {
        double avg_time = getAverageProcessingTime(processor);
        return avg_time > 0.0 ? 1000.0 / avg_time : 0.0;
    }

private:
    std::unordered_map<std::string, std::vector<std::chrono::milliseconds>> processing_times_;
};

Performance Optimization ¶

Memory Optimization ¶

// Thread-safe anchor caching
class AnchorCache {
public:
    static xt::xarray<float> getAnchors(const std::vector<int>& strides, 
                                       int input_height, int input_width) {
        std::lock_guard<std::mutex> lock(cache_mutex_);

        std::string key = createCacheKey(strides, input_height, input_width);
        auto it = anchor_cache_.find(key);

        if (it != anchor_cache_.end()) {
            return it->second;
        }

        // Generate and cache anchors
        auto anchors = generateAnchors(strides, input_height, input_width);
        anchor_cache_[key] = anchors;

        return anchors;
    }

private:
    static std::mutex cache_mutex_;
    static std::unordered_map<std::string, xt::xarray<float>> anchor_cache_;
};

SIMD Optimization ¶

// XTensor SIMD-optimized operations
namespace simd_ops {
    // Vectorized confidence thresholding
    xt::xarray<bool> threshold_confidence(const xt::xarray<float>& confidences, 
                                         float threshold) {
        return xt::greater(confidences, threshold);
    }

    // Vectorized IoU calculation
    xt::xarray<float> compute_iou_matrix(const xt::xarray<float>& boxes1,
                                        const xt::xarray<float>& boxes2) {
        // SIMD-optimized IoU computation using xtensor
        auto areas1 = (boxes1(xt::all(), 2) - boxes1(xt::all(), 0)) * 
                     (boxes1(xt::all(), 3) - boxes1(xt::all(), 1));
        auto areas2 = (boxes2(xt::all(), 2) - boxes2(xt::all(), 0)) * 
                     (boxes2(xt::all(), 3) - boxes2(xt::all(), 1));

        // Intersection computation
        auto inter_x1 = xt::maximum(xt::expand_dims(boxes1(xt::all(), 0), 1), 
                                   xt::expand_dims(boxes2(xt::all(), 0), 0));
        auto inter_y1 = xt::maximum(xt::expand_dims(boxes1(xt::all(), 1), 1), 
                                   xt::expand_dims(boxes2(xt::all(), 1), 0));
        auto inter_x2 = xt::minimum(xt::expand_dims(boxes1(xt::all(), 2), 1), 
                                   xt::expand_dims(boxes2(xt::all(), 2), 0));
        auto inter_y2 = xt::minimum(xt::expand_dims(boxes1(xt::all(), 3), 1), 
                                   xt::expand_dims(boxes2(xt::all(), 3), 0));

        auto inter_area = xt::maximum(0.0f, inter_x2 - inter_x1) * 
                         xt::maximum(0.0f, inter_y2 - inter_y1);

        auto union_area = xt::expand_dims(areas1, 1) + xt::expand_dims(areas2, 0) - inter_area;

        return inter_area / xt::maximum(union_area, 1e-6f);
    }
}

Batch Processing Optimization ¶

class BatchProcessor {
public:
    static expected<cvec> processBatch(const std::vector<InferenceContext>& contexts,
                                      const std::vector<std::vector<xt::xarray<float>>>& batch_outputs,
                                      const ProcessingConfig& config) {

        cvec all_detections;

        // Process batches in parallel when possible
        if (config.parallel_processing && batch_outputs.size() > 1) {
            std::vector<std::future<expected<cvec>>> futures;

            for (size_t i = 0; i < batch_outputs.size(); ++i) {
                futures.push_back(std::async(std::launch::async, [&, i]() {
                    return processSingle(contexts[i], batch_outputs[i], config);
                }));
            }

            for (auto& future : futures) {
                auto result = future.get();
                if (result) {
                    all_detections.insert(all_detections.end(), 
                                         result.value().begin(), 
                                         result.value().end());
                }
            }
        } else {
            // Sequential processing
            for (size_t i = 0; i < batch_outputs.size(); ++i) {
                auto result = processSingle(contexts[i], batch_outputs[i], config);
                if (result) {
                    all_detections.insert(all_detections.end(), 
                                         result.value().begin(), 
                                         result.value().end());
                }
            }
        }

        return all_detections;
    }
};

Troubleshooting ¶

Common Issues ¶

Low Detection Accuracy

// Adjust confidence thresholds
config.confThreshold = 0.1f;  // Lower threshold for more detections
config.nmsScoreThreshold = 0.05f;  // Lower NMS threshold

// Disable edge filtering
config.filterEdgeDetections = false;

// Check class filtering
config.class_ids.clear();  // Allow all classes

High Memory Usage

// Enable anchor caching
config.cache_anchors = true;

// Optimize tensor operations
config.simd_optimization = true;

// Reduce batch size
config.max_batch_size = 1;

Performance Issues

// Enable profiling to identify bottlenecks
config.enable_profiling = true;

// Optimize NMS parameters
config.nmsIouThreshold = 0.7f;  // Higher threshold = fewer comparisons
config.nmsScoreThreshold = 0.2f;  // Higher threshold = fewer candidates

Error Recovery ¶

class RobustProcessor {
public:
    static expected<cvec> processWithFallback(
        InferenceContext& ctx,
        std::vector<xt::xarray<float>>& outputs,
        CValue* config) {

        try {
            // Primary processing
            return primaryProcessor(ctx, outputs, config);
        } catch (const std::exception& e) {
            PLOG_WARNING << "Primary processor failed: " << e.what();

            try {
                // Fallback processing
                return fallbackProcessor(ctx, outputs, config);
            } catch (const std::exception& e2) {
                PLOG_ERROR << "Fallback processor failed: " << e2.what();
                return tl::unexpected(ProcessingError::TOTAL_FAILURE);
            }
        }
    }

private:
    static expected<cvec> fallbackProcessor(
        InferenceContext& ctx,
        std::vector<xt::xarray<float>>& outputs,
        CValue* config) {

        // Simplified processing with relaxed parameters
        ProcessingConfig fallback_config;
        fallback_config.confThreshold = 0.1f;
        fallback_config.nmsIouThreshold = 0.8f;
        fallback_config.enable_simd = false;

        return basicProcessor(ctx, outputs, &fallback_config);
    }
};

Debug Tools ¶

#include "debug/visualization.h"
#include "debug/tensor_inspector.h"

class ProcessingDebugger {
public:
    static void debugDetections(const cvec& detections, 
                               const cv::Mat& image,
                               const std::string& save_path) {

        cv::Mat debug_image = image.clone();

        for (const auto& det : detections) {
            auto detection = extractDetection(det);

            // Draw bounding box
            cv::rectangle(debug_image, detection.bbox, cv::Scalar(0, 255, 0), 2);

            // Draw confidence and class
            std::string label = std::to_string(detection.class_id) + 
                               ": " + std::to_string(detection.confidence);
            cv::putText(debug_image, label, 
                       cv::Point(detection.bbox.x, detection.bbox.y - 10),
                       cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 255, 0), 1);
        }

        cv::imwrite(save_path, debug_image);
    }

    static void inspectTensor(const xt::xarray<float>& tensor, 
                             const std::string& name) {

        PLOG_INFO << "Tensor " << name << " inspection:";
        PLOG_INFO << "  Shape: " << xt::adapt(tensor.shape());
        PLOG_INFO << "  Min: " << xt::amin(tensor)[0];
        PLOG_INFO << "  Max: " << xt::amax(tensor)[0];
        PLOG_INFO << "  Mean: " << xt::mean(tensor)[0];
        PLOG_INFO << "  Std: " << xt::stddev(tensor)[0];

        // Check for NaN or infinity values
        bool has_nan = xt::any(xt::isnan(tensor));
        bool has_inf = xt::any(xt::isinf(tensor));

        if (has_nan) PLOG_WARNING << "  Contains NaN values!";
        if (has_inf) PLOG_WARNING << "  Contains infinite values!";
    }
};

Best Practices ¶

Configuration Management ¶

Threshold Tuning: Start with default values and adjust based on precision/recall requirements
Class Filtering: Use class filtering to improve performance when only specific objects are needed
Memory Management: Enable caching and SIMD optimization for production deployments
Profiling: Use Tracy profiling to identify performance bottlenecks

Integration Guidelines ¶

Error Handling: Implement comprehensive error handling with fallback mechanisms
Performance Monitoring: Track processing times and memory usage in production
Model Compatibility: Verify tensor shapes and formats match expected model outputs
Testing: Validate processing accuracy across different model architectures

Performance Optimization ¶

Batch Processing: Use batch processing when handling multiple images simultaneously
Parallel Processing: Leverage multi-threading for independent processing tasks
Memory Pooling: Implement tensor memory pooling for high-throughput scenarios
Hardware Acceleration: Utilize SIMD instructions and GPU acceleration when available