Skip to content

Engines Plugin Collection

Description

The Engines plugin collection provides unified AI inference acceleration across a wide range of hardware platforms and AI frameworks. It offers a common interface for neural network inference while leveraging platform-specific optimizations, enabling CVEDIA-RT to deliver optimal performance across diverse hardware accelerators.

Each engine plugin implements a standardized interface while providing hardware-specific optimizations, automatic device detection, and performance monitoring capabilities. This architecture enables seamless deployment across different platforms without code changes.

Key Features

  • Multi-Platform Support: Unified interface across NVIDIA, Intel, ARM, and specialized AI processors
  • Automatic Device Detection: Runtime discovery and selection of optimal hardware
  • Hardware-Specific Optimizations: Leverage unique capabilities of each platform
  • Fallback Support: Graceful degradation to software inference when hardware unavailable
  • Performance Monitoring: Unified statistics and profiling across all engines
  • Model Format Support: ONNX, TensorFlow, PyTorch, and framework-native formats
  • Dynamic Configuration: Runtime switching between different inference engines

Available Engine Plugins

Plugin Description Hardware Platform Model Formats
ARMnn ARM Compute Library integration for ARM Cortex processors with CPU and GPU acceleration ARM CPUs/GPUs ONNX, TensorFlow Lite
Blaize Blaize Graph API integration for ultra-low power AI inference with structured sparsity Blaize AI Processors Blaize Native
CVFlow Ambarella CVFlow AI acceleration for real-time vision processing on embedded systems Ambarella SoCs CVFlow Native
Hailo Hailo AI processor acceleration with hardware-aware model optimization and multi-stream support Hailo-8/Hailo-15 Hailo HEF
MEMX MEMX DLA integration for memory-centric AI processing with near-data computing MEMX Accelerators MEMX Native
MNN Mobile Neural Network framework for cross-platform inference on mobile and embedded devices Mobile/Embedded ONNX, TensorFlow
OpenVINO Intel OpenVINO inference platform with multi-device support and heterogeneous execution Intel CPUs/GPUs/VPUs IR, ONNX, PaddlePaddle
Paddle PaddlePaddle inference integration for industrial AI deployment with server optimization CPUs/GPUs PaddlePaddle Native
RKNN Rockchip Neural Network runtime for NPU acceleration on embedded Rockchip SoCs Rockchip SoCs RKNN Native
SNPE Qualcomm Snapdragon Neural Processing Engine for mobile and automotive applications Qualcomm SoCs DLC (Deep Learning Container)
SigmaStar SigmaStar SoC AI acceleration for embedded vision with integrated ISP pipeline SigmaStar SoCs SigmaStar Native
TensorRT NVIDIA TensorRT GPU acceleration with dynamic shapes and mixed precision (legacy) NVIDIA GPUs TensorRT Engines, ONNX
TensorRT10 NVIDIA TensorRT 10.x GPU acceleration with latest optimizations and features NVIDIA GPUs TensorRT Engines, ONNX

Requirements

Hardware Requirements

Varies by engine, but generally includes:

  • CPU: Multi-core processor (ARM, x86, x64)
  • Memory: Minimum 2GB RAM, recommended 8GB+ for complex models
  • Storage: Sufficient space for model files and engine-specific libraries
  • Specialized Hardware: Target AI accelerators (GPUs, NPUs, VPUs) as required by specific engines

Software Dependencies

Common dependencies:

  • CVEDIA-RT Core: Base plugin infrastructure
  • XTensor: Multi-dimensional array library for tensor operations
  • Tracy: Performance profiling and monitoring
  • Engine-Specific SDKs: Platform-specific runtime libraries and drivers

Engine-specific requirements:

  • NVIDIA TensorRT: CUDA runtime, cuDNN library
  • Intel OpenVINO: Intel distribution of OpenVINO toolkit
  • ARM NN: ARM Compute Library
  • Qualcomm SNPE: Snapdragon Neural Processing Engine SDK
  • Others: Respective hardware vendor SDKs and runtime libraries

Configuration

Basic Configuration

{
  "inference": {
    "engine": "auto",
    "device": "auto",
    "model_file": "/path/to/model.onnx",
    "optimization_level": 1,
    "enable_profiling": false
  }
}

Advanced Multi-Engine Configuration

{
  "inference": {
    "engines": [
      {
        "name": "tensorrt",
        "priority": 1,
        "device": "GPU",
        "config": {
          "precision": "FP16",
          "max_batch_size": 8,
          "workspace_size": "1GB"
        }
      },
      {
        "name": "openvino",
        "priority": 2, 
        "device": "CPU",
        "config": {
          "num_threads": 4,
          "optimization_level": 2
        }
      },
      {
        "name": "armnn",
        "priority": 3,
        "device": "ARM_GPU",
        "config": {
          "fast_math": true,
          "tuning_level": 2
        }
      }
    ],
    "fallback_enabled": true,
    "auto_selection": {
      "enabled": true,
      "criteria": ["performance", "power_efficiency"],
      "benchmark_iterations": 10
    }
  }
}

Configuration Schema

Parameter Type Default Description
engine string "auto" Target engine name or "auto" for automatic selection
device string "auto" Target device or "auto" for automatic detection
model_file string required Path to model file
optimization_level int 1 Optimization level (0-3, higher = more aggressive)
enable_profiling bool false Enable performance profiling
engines[] array [] Engine priority configuration for multi-engine setups
fallback_enabled bool true Enable automatic fallback to alternative engines
auto_selection.enabled bool true Enable automatic engine selection
auto_selection.criteria array ["performance"] Selection criteria for automatic engine selection

API Reference

Common Engine Interface

All engine plugins implement a standardized interface through the InferenceHandler class:

class InferenceHandler {
public:
    // Model management
    virtual expected<void> loadModel(std::string const& path) = 0;
    virtual expected<void> loadBackend() = 0;

    // Device management
    virtual expected<void> setDevice(std::string const& device) = 0;
    virtual std::string getActiveDevice() = 0;
    virtual std::vector<std::pair<std::string, std::string>> getDeviceGuids() = 0;

    // Inference execution
    virtual expected<std::vector<xt::xarray<float>>> runInference(std::vector<Tensor>& inputs) = 0;

    // Model information
    virtual std::vector<int> getInputShape() = 0;
    virtual std::vector<int> getOutputShape() = 0;
    virtual pCValue getCapabilities() = 0;

    // Performance monitoring
    virtual internal::ResourceUsage getResourceUsage() = 0;
};

Engine Factory Pattern

// Engine creation through factory pattern
class EngineFactory {
public:
    static std::shared_ptr<InferenceHandler> createEngine(
        const std::string& engineName,
        const CValue& config = {});

    static std::vector<std::string> getAvailableEngines();
    static EngineCapabilities getEngineCapabilities(const std::string& engineName);
};

Lua Integration

-- Automatic engine selection
local inference = api.factory.inference.create(instance, "auto_engine")
inference:loadModel("/models/detection.onnx")

-- Specific engine selection
local tensorrt_inference = api.factory.inference.create(instance, "tensorrt_engine")
tensorrt_inference:setDevice("GPU")
tensorrt_inference:loadModel("/models/detection.engine")

-- Multi-engine configuration
local multi_engine = api.factory.inference.create(instance, "multi_engine")
multi_engine:configureEngines({
    {name = "tensorrt", priority = 1, device = "GPU"},
    {name = "openvino", priority = 2, device = "CPU"}
})

Examples

Automatic Engine Selection

#include "inference/engines/enginefactory.h"

// Create engine with automatic selection
auto engine = EngineFactory::createEngine("auto", {
    {"model_file", "/path/to/model.onnx"},
    {"optimization_level", 2},
    {"enable_profiling", true}
});

// Load and initialize
engine->loadModel("/path/to/model.onnx");
engine->loadBackend();

// Get selected engine information
auto capabilities = engine->getCapabilities();
LOGI << "Selected engine: " << capabilities->get("engine_name").getString();
LOGI << "Target device: " << engine->getActiveDevice();

Multi-Engine Fallback Setup

// Configure primary and fallback engines
std::vector<EngineConfig> engineConfigs = {
    {"tensorrt", "GPU", 1, {{"precision", "FP16"}, {"max_batch_size", 8}}},
    {"openvino", "CPU", 2, {{"num_threads", 4}}},
    {"armnn", "ARM_GPU", 3, {{"fast_math", true}}}
};

auto engine = EngineFactory::createMultiEngine(engineConfigs);

// Inference with automatic fallback
std::vector<Tensor> inputs = {inputTensor};
auto result = engine->runInference(inputs);

if (result) {
    auto outputs = result.value();
    // Process results from whichever engine succeeded
} else {
    LETE << "All engines failed: " << result.error().message();
}

Platform-Specific Optimization

// NVIDIA GPU optimization
if (EngineFactory::isEngineAvailable("tensorrt")) {
    auto trtEngine = EngineFactory::createEngine("tensorrt", {
        {"precision", "FP16"},
        {"use_cuda_graph", true},
        {"optimization_profile", "high_throughput"}
    });

    trtEngine->setDevice("0"); // GPU 0
    trtEngine->loadModel("/models/optimized.engine");
}

// Intel CPU/VPU optimization
if (EngineFactory::isEngineAvailable("openvino")) {
    auto oviEngine = EngineFactory::createEngine("openvino", {
        {"device", "AUTO"}, // Auto-select CPU/GPU/VPU
        {"performance_hint", "THROUGHPUT"},
        {"cache_dir", "/tmp/openvino_cache"}
    });

    oviEngine->loadModel("/models/model.xml");
}

Performance Benchmarking

-- Benchmark multiple engines
local engines = {"tensorrt", "openvino", "armnn"}
local results = {}

for _, engine_name in ipairs(engines) do
    if api.factory.inference.isAvailable(engine_name) then
        local engine = api.factory.inference.create(instance, engine_name)
        engine:loadModel("/models/benchmark.onnx")

        -- Warmup
        for i = 1, 5 do
            engine:runInference({sample_input})
        end

        -- Benchmark
        local start_time = api.system.getCurrentTime()
        for i = 1, 100 do
            engine:runInference({sample_input})
        end
        local end_time = api.system.getCurrentTime()

        results[engine_name] = {
            avg_latency = (end_time - start_time) / 100,
            device = engine:getActiveDevice(),
            memory_usage = engine:getResourceUsage().memory_mb
        }
    end
end

-- Select best performing engine
local best_engine = selectBestEngine(results)
print("Best engine:", best_engine)

Hardware Support Matrix

NVIDIA Platforms

  • TensorRT: RTX series, Tesla, Jetson platforms
  • Supported Precisions: FP32, FP16, INT8, Sparsity
  • Features: Dynamic shapes, CUDA graphs, DLA acceleration (Jetson)

Intel Platforms

  • OpenVINO: Core, Xeon, Atom processors, Intel Arc GPUs, Movidius VPUs
  • Supported Precisions: FP32, FP16, INT8, VNNI
  • Features: Heterogeneous execution, auto-batching, model caching

ARM Platforms

  • ARM NN: Cortex-A series CPUs, Mali GPUs, Ethos NPUs
  • Supported Precisions: FP32, FP16, INT8
  • Features: NEON optimization, GPU compute shaders

Mobile/Edge Platforms

  • Qualcomm SNPE: Snapdragon 855+, automotive platforms
  • Rockchip RKNN: RK3588, RK3566, RK3568 series
  • Hailo: Hailo-8, Hailo-15 AI processors
  • Features: Low power optimization, on-device learning

Best Practices

Engine Selection Strategy

  1. Automatic Selection: Start with engine: "auto" for optimal performance
  2. Hardware Matching: Align engine choice with available hardware
  3. Model Format: Use native formats when available for best performance
  4. Fallback Planning: Configure multiple engines for reliability
  5. Performance Testing: Benchmark engines with representative workloads

Configuration Optimization

  1. Precision Selection: Use mixed precision (FP16/INT8) for better performance
  2. Batch Size Tuning: Optimize batch size for your hardware and latency requirements
  3. Memory Management: Configure appropriate workspace/cache sizes
  4. Threading: Tune thread counts for multi-threaded engines

Deployment Considerations

  1. Model Optimization: Pre-optimize models for target hardware when possible
  2. Driver Dependencies: Ensure proper drivers and SDKs installed
  3. Resource Monitoring: Monitor GPU/CPU utilization and memory usage
  4. Error Handling: Implement robust fallback mechanisms
  5. Version Compatibility: Test engine version compatibility with models

Troubleshooting

Common Issues

Engine Not Available

if (!EngineFactory::isEngineAvailable("tensorrt")) {
    LETE << "TensorRT engine not available";
    // Check: CUDA installation, TensorRT libraries, GPU drivers
}

Model Loading Failures

  • Format Mismatch: Verify model format matches engine expectations
  • Path Issues: Check file paths and permissions
  • Version Compatibility: Ensure model version matches engine version
  • Hardware Requirements: Verify hardware meets model requirements

Performance Issues

  • Suboptimal Performance: Try different precision settings (FP16, INT8)
  • High Memory Usage: Reduce batch size or model complexity
  • CPU Bottleneck: Increase thread count for CPU-based engines
  • GPU Underutilization: Increase batch size or use multiple streams

Diagnostic Tools

// Engine diagnostics
auto engines = EngineFactory::getAvailableEngines();
for (const auto& name : engines) {
    auto caps = EngineFactory::getEngineCapabilities(name);
    LOGI << name << " - Devices: " << caps.supportedDevices.size();
}

// Runtime diagnostics
auto usage = engine->getResourceUsage();
LOGI << "Memory: " << usage.memory_mb << "MB";
LOGI << "GPU Util: " << usage.gpu_utilization << "%";

Integration Examples

Computer Vision Pipeline

// Complete inference pipeline with automatic engine selection
class VisionPipeline {
private:
    std::shared_ptr<InferenceHandler> detectionEngine_;
    std::shared_ptr<InferenceHandler> classificationEngine_;

public:
    void initialize() {
        // Detection with GPU acceleration
        detectionEngine_ = EngineFactory::createEngine("auto", {
            {"model_file", "/models/yolo.onnx"},
            {"preferred_device", "GPU"},
            {"optimization_level", 2}
        });

        // Classification with CPU fallback
        classificationEngine_ = EngineFactory::createEngine("auto", {
            {"model_file", "/models/classifier.onnx"},
            {"preferred_device", "CPU"},
            {"max_batch_size", 16}
        });
    }

    std::vector<Detection> processFrame(const cv::Mat& frame) {
        // Detection inference
        auto detectionResult = detectionEngine_->runInference({frameToTensor(frame)});
        auto detections = postprocessDetections(detectionResult.value());

        // Classification inference for detected objects
        std::vector<cv::Mat> crops;
        for (const auto& detection : detections) {
            crops.push_back(cropObject(frame, detection.bbox));
        }

        auto classificationResult = classificationEngine_->runInference(cropsToTensors(crops));
        return combineResults(detections, classificationResult.value());
    }
};

Edge Deployment with Power Management

-- Edge-optimized inference with power management
local edge_config = {
    engines = {
        {name = "hailo", priority = 1, device = "NPU"}, -- Lowest power
        {name = "armnn", priority = 2, device = "ARM_GPU"}, -- Medium power
        {name = "openvino", priority = 3, device = "CPU"} -- Highest power
    },
    power_management = {
        enabled = true,
        battery_threshold = 20, -- Switch to low power engines below 20%
        thermal_throttling = true
    }
}

local inference = api.factory.inference.create(instance, "edge_optimized")
inference:configure(edge_config)
inference:loadModel("/models/lightweight.onnx")

See Also