Engines Plugin Collection ¶

Description ¶

The Engines plugin collection provides unified AI inference acceleration across a wide range of hardware platforms and AI frameworks. It offers a common interface for neural network inference while leveraging platform-specific optimizations, enabling CVEDIA-RT to deliver optimal performance across diverse hardware accelerators.

Each engine plugin implements a standardized interface while providing hardware-specific optimizations, automatic device detection, and performance monitoring capabilities. This architecture enables seamless deployment across different platforms without code changes.

Key Features ¶

Multi-Platform Support: Unified interface across NVIDIA, Intel, ARM, and specialized AI processors
Automatic Device Detection: Runtime discovery and selection of optimal hardware
Hardware-Specific Optimizations: Leverage unique capabilities of each platform
Fallback Support: Graceful degradation to software inference when hardware unavailable
Performance Monitoring: Unified statistics and profiling across all engines
Model Format Support: ONNX, TensorFlow, PyTorch, and framework-native formats
Dynamic Configuration: Runtime switching between different inference engines

Available Engine Plugins ¶

Plugin	Description	Hardware Platform	Model Formats
ARMnn	ARM Compute Library integration for ARM Cortex processors with CPU and GPU acceleration	ARM CPUs/GPUs	ONNX, TensorFlow Lite
Blaize	Blaize Graph API integration for ultra-low power AI inference with structured sparsity	Blaize AI Processors	Blaize Native
CVFlow	Ambarella CVFlow AI acceleration for real-time vision processing on embedded systems	Ambarella SoCs	CVFlow Native
Hailo	Hailo AI processor acceleration with hardware-aware model optimization and multi-stream support	Hailo-8/Hailo-15	Hailo HEF
MEMX	MEMX DLA integration for memory-centric AI processing with near-data computing	MEMX Accelerators	MEMX Native
MNN	Mobile Neural Network framework for cross-platform inference on mobile and embedded devices	Mobile/Embedded	ONNX, TensorFlow
OpenVINO	Intel OpenVINO inference platform with multi-device support and heterogeneous execution	Intel CPUs/GPUs/VPUs	IR, ONNX, PaddlePaddle
Paddle	PaddlePaddle inference integration for industrial AI deployment with server optimization	CPUs/GPUs	PaddlePaddle Native
RKNN	Rockchip Neural Network runtime for NPU acceleration on embedded Rockchip SoCs	Rockchip SoCs	RKNN Native
SNPE	Qualcomm Snapdragon Neural Processing Engine for mobile and automotive applications	Qualcomm SoCs	DLC (Deep Learning Container)
SigmaStar	SigmaStar SoC AI acceleration for embedded vision with integrated ISP pipeline	SigmaStar SoCs	SigmaStar Native
TensorRT	NVIDIA TensorRT GPU acceleration with dynamic shapes and mixed precision (legacy)	NVIDIA GPUs	TensorRT Engines, ONNX
TensorRT10	NVIDIA TensorRT 10.x GPU acceleration with latest optimizations and features	NVIDIA GPUs	TensorRT Engines, ONNX

Requirements ¶

Hardware Requirements ¶

Varies by engine, but generally includes:

CPU: Multi-core processor (ARM, x86, x64)
Memory: Minimum 2GB RAM, recommended 8GB+ for complex models
Storage: Sufficient space for model files and engine-specific libraries
Specialized Hardware: Target AI accelerators (GPUs, NPUs, VPUs) as required by specific engines

Software Dependencies ¶

Common dependencies:

CVEDIA-RT Core: Base plugin infrastructure
XTensor: Multi-dimensional array library for tensor operations
Tracy: Performance profiling and monitoring
Engine-Specific SDKs: Platform-specific runtime libraries and drivers

Engine-specific requirements:

NVIDIA TensorRT: CUDA runtime, cuDNN library
Intel OpenVINO: Intel distribution of OpenVINO toolkit
ARM NN: ARM Compute Library
Qualcomm SNPE: Snapdragon Neural Processing Engine SDK
Others: Respective hardware vendor SDKs and runtime libraries

Configuration ¶

Basic Configuration ¶

{
  "inference": {
    "engine": "auto",
    "device": "auto",
    "model_file": "/path/to/model.onnx",
    "optimization_level": 1,
    "enable_profiling": false
  }
}

Advanced Multi-Engine Configuration ¶

{
  "inference": {
    "engines": [
      {
        "name": "tensorrt",
        "priority": 1,
        "device": "GPU",
        "config": {
          "precision": "FP16",
          "max_batch_size": 8,
          "workspace_size": "1GB"
        }
      },
      {
        "name": "openvino",
        "priority": 2, 
        "device": "CPU",
        "config": {
          "num_threads": 4,
          "optimization_level": 2
        }
      },
      {
        "name": "armnn",
        "priority": 3,
        "device": "ARM_GPU",
        "config": {
          "fast_math": true,
          "tuning_level": 2
        }
      }
    ],
    "fallback_enabled": true,
    "auto_selection": {
      "enabled": true,
      "criteria": ["performance", "power_efficiency"],
      "benchmark_iterations": 10
    }
  }
}

Configuration Schema ¶

Parameter	Type	Default	Description
`engine`	string	"auto"	Target engine name or "auto" for automatic selection
`device`	string	"auto"	Target device or "auto" for automatic detection
`model_file`	string	required	Path to model file
`optimization_level`	int	1	Optimization level (0-3, higher = more aggressive)
`enable_profiling`	bool	false	Enable performance profiling
`engines[]`	array	[]	Engine priority configuration for multi-engine setups
`fallback_enabled`	bool	true	Enable automatic fallback to alternative engines
`auto_selection.enabled`	bool	true	Enable automatic engine selection
`auto_selection.criteria`	array	["performance"]	Selection criteria for automatic engine selection

API Reference ¶

Common Engine Interface ¶

All engine plugins implement a standardized interface through the InferenceHandler class:

class InferenceHandler {
public:
    // Model management
    virtual expected<void> loadModel(std::string const& path) = 0;
    virtual expected<void> loadBackend() = 0;

    // Device management
    virtual expected<void> setDevice(std::string const& device) = 0;
    virtual std::string getActiveDevice() = 0;
    virtual std::vector<std::pair<std::string, std::string>> getDeviceGuids() = 0;

    // Inference execution
    virtual expected<std::vector<xt::xarray<float>>> runInference(std::vector<Tensor>& inputs) = 0;

    // Model information
    virtual std::vector<int> getInputShape() = 0;
    virtual std::vector<int> getOutputShape() = 0;
    virtual pCValue getCapabilities() = 0;

    // Performance monitoring
    virtual internal::ResourceUsage getResourceUsage() = 0;
};

Engine Factory Pattern ¶

// Engine creation through factory pattern
class EngineFactory {
public:
    static std::shared_ptr<InferenceHandler> createEngine(
        const std::string& engineName,
        const CValue& config = {});

    static std::vector<std::string> getAvailableEngines();
    static EngineCapabilities getEngineCapabilities(const std::string& engineName);
};

Lua Integration ¶

-- Automatic engine selection
local inference = api.factory.inference.create(instance, "auto_engine")
inference:loadModel("/models/detection.onnx")

-- Specific engine selection
local tensorrt_inference = api.factory.inference.create(instance, "tensorrt_engine")
tensorrt_inference:setDevice("GPU")
tensorrt_inference:loadModel("/models/detection.engine")

-- Multi-engine configuration
local multi_engine = api.factory.inference.create(instance, "multi_engine")
multi_engine:configureEngines({
    {name = "tensorrt", priority = 1, device = "GPU"},
    {name = "openvino", priority = 2, device = "CPU"}
})

Examples ¶

Automatic Engine Selection ¶

#include "inference/engines/enginefactory.h"

// Create engine with automatic selection
auto engine = EngineFactory::createEngine("auto", {
    {"model_file", "/path/to/model.onnx"},
    {"optimization_level", 2},
    {"enable_profiling", true}
});

// Load and initialize
engine->loadModel("/path/to/model.onnx");
engine->loadBackend();

// Get selected engine information
auto capabilities = engine->getCapabilities();
LOGI << "Selected engine: " << capabilities->get("engine_name").getString();
LOGI << "Target device: " << engine->getActiveDevice();

Multi-Engine Fallback Setup ¶

// Configure primary and fallback engines
std::vector<EngineConfig> engineConfigs = {
    {"tensorrt", "GPU", 1, {{"precision", "FP16"}, {"max_batch_size", 8}}},
    {"openvino", "CPU", 2, {{"num_threads", 4}}},
    {"armnn", "ARM_GPU", 3, {{"fast_math", true}}}
};

auto engine = EngineFactory::createMultiEngine(engineConfigs);

// Inference with automatic fallback
std::vector<Tensor> inputs = {inputTensor};
auto result = engine->runInference(inputs);

if (result) {
    auto outputs = result.value();
    // Process results from whichever engine succeeded
} else {
    LETE << "All engines failed: " << result.error().message();
}

Platform-Specific Optimization ¶

// NVIDIA GPU optimization
if (EngineFactory::isEngineAvailable("tensorrt")) {
    auto trtEngine = EngineFactory::createEngine("tensorrt", {
        {"precision", "FP16"},
        {"use_cuda_graph", true},
        {"optimization_profile", "high_throughput"}
    });

    trtEngine->setDevice("0"); // GPU 0
    trtEngine->loadModel("/models/optimized.engine");
}

// Intel CPU/VPU optimization
if (EngineFactory::isEngineAvailable("openvino")) {
    auto oviEngine = EngineFactory::createEngine("openvino", {
        {"device", "AUTO"}, // Auto-select CPU/GPU/VPU
        {"performance_hint", "THROUGHPUT"},
        {"cache_dir", "/tmp/openvino_cache"}
    });

    oviEngine->loadModel("/models/model.xml");
}

Performance Benchmarking ¶

-- Benchmark multiple engines
local engines = {"tensorrt", "openvino", "armnn"}
local results = {}

for _, engine_name in ipairs(engines) do
    if api.factory.inference.isAvailable(engine_name) then
        local engine = api.factory.inference.create(instance, engine_name)
        engine:loadModel("/models/benchmark.onnx")

        -- Warmup
        for i = 1, 5 do
            engine:runInference({sample_input})
        end

        -- Benchmark
        local start_time = api.system.getCurrentTime()
        for i = 1, 100 do
            engine:runInference({sample_input})
        end
        local end_time = api.system.getCurrentTime()

        results[engine_name] = {
            avg_latency = (end_time - start_time) / 100,
            device = engine:getActiveDevice(),
            memory_usage = engine:getResourceUsage().memory_mb
        }
    end
end

-- Select best performing engine
local best_engine = selectBestEngine(results)
print("Best engine:", best_engine)

Hardware Support Matrix ¶

NVIDIA Platforms ¶

TensorRT: RTX series, Tesla, Jetson platforms
Supported Precisions: FP32, FP16, INT8, Sparsity
Features: Dynamic shapes, CUDA graphs, DLA acceleration (Jetson)

Intel Platforms ¶

OpenVINO: Core, Xeon, Atom processors, Intel Arc GPUs, Movidius VPUs
Supported Precisions: FP32, FP16, INT8, VNNI
Features: Heterogeneous execution, auto-batching, model caching

ARM Platforms ¶

ARM NN: Cortex-A series CPUs, Mali GPUs, Ethos NPUs
Supported Precisions: FP32, FP16, INT8
Features: NEON optimization, GPU compute shaders

Mobile/Edge Platforms ¶

Qualcomm SNPE: Snapdragon 855+, automotive platforms
Rockchip RKNN: RK3588, RK3566, RK3568 series
Hailo: Hailo-8, Hailo-15 AI processors
Features: Low power optimization, on-device learning

Best Practices ¶

Engine Selection Strategy ¶

Automatic Selection: Start with engine: "auto" for optimal performance
Hardware Matching: Align engine choice with available hardware
Model Format: Use native formats when available for best performance
Fallback Planning: Configure multiple engines for reliability
Performance Testing: Benchmark engines with representative workloads

Configuration Optimization ¶

Precision Selection: Use mixed precision (FP16/INT8) for better performance
Batch Size Tuning: Optimize batch size for your hardware and latency requirements
Memory Management: Configure appropriate workspace/cache sizes
Threading: Tune thread counts for multi-threaded engines

Deployment Considerations ¶

Model Optimization: Pre-optimize models for target hardware when possible
Driver Dependencies: Ensure proper drivers and SDKs installed
Resource Monitoring: Monitor GPU/CPU utilization and memory usage
Error Handling: Implement robust fallback mechanisms
Version Compatibility: Test engine version compatibility with models

if (!EngineFactory::isEngineAvailable("tensorrt")) {
    LETE << "TensorRT engine not available";
    // Check: CUDA installation, TensorRT libraries, GPU drivers
}

Model Loading Failures ¶

Format Mismatch: Verify model format matches engine expectations
Path Issues: Check file paths and permissions
Version Compatibility: Ensure model version matches engine version
Hardware Requirements: Verify hardware meets model requirements

Performance Issues ¶

Suboptimal Performance: Try different precision settings (FP16, INT8)
High Memory Usage: Reduce batch size or model complexity
CPU Bottleneck: Increase thread count for CPU-based engines
GPU Underutilization: Increase batch size or use multiple streams

Diagnostic Tools ¶

// Engine diagnostics
auto engines = EngineFactory::getAvailableEngines();
for (const auto& name : engines) {
    auto caps = EngineFactory::getEngineCapabilities(name);
    LOGI << name << " - Devices: " << caps.supportedDevices.size();
}

// Runtime diagnostics
auto usage = engine->getResourceUsage();
LOGI << "Memory: " << usage.memory_mb << "MB";
LOGI << "GPU Util: " << usage.gpu_utilization << "%";

Integration Examples ¶

Computer Vision Pipeline ¶

// Complete inference pipeline with automatic engine selection
class VisionPipeline {
private:
    std::shared_ptr<InferenceHandler> detectionEngine_;
    std::shared_ptr<InferenceHandler> classificationEngine_;

public:
    void initialize() {
        // Detection with GPU acceleration
        detectionEngine_ = EngineFactory::createEngine("auto", {
            {"model_file", "/models/yolo.onnx"},
            {"preferred_device", "GPU"},
            {"optimization_level", 2}
        });

        // Classification with CPU fallback
        classificationEngine_ = EngineFactory::createEngine("auto", {
            {"model_file", "/models/classifier.onnx"},
            {"preferred_device", "CPU"},
            {"max_batch_size", 16}
        });
    }

    std::vector<Detection> processFrame(const cv::Mat& frame) {
        // Detection inference
        auto detectionResult = detectionEngine_->runInference({frameToTensor(frame)});
        auto detections = postprocessDetections(detectionResult.value());

        // Classification inference for detected objects
        std::vector<cv::Mat> crops;
        for (const auto& detection : detections) {
            crops.push_back(cropObject(frame, detection.bbox));
        }

        auto classificationResult = classificationEngine_->runInference(cropsToTensors(crops));
        return combineResults(detections, classificationResult.value());
    }
};

Edge Deployment with Power Management ¶

-- Edge-optimized inference with power management
local edge_config = {
    engines = {
        {name = "hailo", priority = 1, device = "NPU"}, -- Lowest power
        {name = "armnn", priority = 2, device = "ARM_GPU"}, -- Medium power
        {name = "openvino", priority = 3, device = "CPU"} -- Highest power
    },
    power_management = {
        enabled = true,
        battery_threshold = 20, -- Switch to low power engines below 20%
        thermal_throttling = true
    }
}

local inference = api.factory.inference.create(instance, "edge_optimized")
inference:configure(edge_config)
inference:loadModel("/models/lightweight.onnx")