Engines Plugin Collection¶
Description¶
The Engines plugin collection provides unified AI inference acceleration across a wide range of hardware platforms and AI frameworks. It offers a common interface for neural network inference while leveraging platform-specific optimizations, enabling CVEDIA-RT to deliver optimal performance across diverse hardware accelerators.
Each engine plugin implements a standardized interface while providing hardware-specific optimizations, automatic device detection, and performance monitoring capabilities. This architecture enables seamless deployment across different platforms without code changes.
Key Features¶
- Multi-Platform Support: Unified interface across NVIDIA, Intel, ARM, and specialized AI processors
- Automatic Device Detection: Runtime discovery and selection of optimal hardware
- Hardware-Specific Optimizations: Leverage unique capabilities of each platform
- Fallback Support: Graceful degradation to software inference when hardware unavailable
- Performance Monitoring: Unified statistics and profiling across all engines
- Model Format Support: ONNX, TensorFlow, PyTorch, and framework-native formats
- Dynamic Configuration: Runtime switching between different inference engines
Available Engine Plugins¶
Plugin | Description | Hardware Platform | Model Formats |
---|---|---|---|
ARMnn | ARM Compute Library integration for ARM Cortex processors with CPU and GPU acceleration | ARM CPUs/GPUs | ONNX, TensorFlow Lite |
Blaize | Blaize Graph API integration for ultra-low power AI inference with structured sparsity | Blaize AI Processors | Blaize Native |
CVFlow | Ambarella CVFlow AI acceleration for real-time vision processing on embedded systems | Ambarella SoCs | CVFlow Native |
Hailo | Hailo AI processor acceleration with hardware-aware model optimization and multi-stream support | Hailo-8/Hailo-15 | Hailo HEF |
MEMX | MEMX DLA integration for memory-centric AI processing with near-data computing | MEMX Accelerators | MEMX Native |
MNN | Mobile Neural Network framework for cross-platform inference on mobile and embedded devices | Mobile/Embedded | ONNX, TensorFlow |
OpenVINO | Intel OpenVINO inference platform with multi-device support and heterogeneous execution | Intel CPUs/GPUs/VPUs | IR, ONNX, PaddlePaddle |
Paddle | PaddlePaddle inference integration for industrial AI deployment with server optimization | CPUs/GPUs | PaddlePaddle Native |
RKNN | Rockchip Neural Network runtime for NPU acceleration on embedded Rockchip SoCs | Rockchip SoCs | RKNN Native |
SNPE | Qualcomm Snapdragon Neural Processing Engine for mobile and automotive applications | Qualcomm SoCs | DLC (Deep Learning Container) |
SigmaStar | SigmaStar SoC AI acceleration for embedded vision with integrated ISP pipeline | SigmaStar SoCs | SigmaStar Native |
TensorRT | NVIDIA TensorRT GPU acceleration with dynamic shapes and mixed precision (legacy) | NVIDIA GPUs | TensorRT Engines, ONNX |
TensorRT10 | NVIDIA TensorRT 10.x GPU acceleration with latest optimizations and features | NVIDIA GPUs | TensorRT Engines, ONNX |
Requirements¶
Hardware Requirements¶
Varies by engine, but generally includes:
- CPU: Multi-core processor (ARM, x86, x64)
- Memory: Minimum 2GB RAM, recommended 8GB+ for complex models
- Storage: Sufficient space for model files and engine-specific libraries
- Specialized Hardware: Target AI accelerators (GPUs, NPUs, VPUs) as required by specific engines
Software Dependencies¶
Common dependencies:
- CVEDIA-RT Core: Base plugin infrastructure
- XTensor: Multi-dimensional array library for tensor operations
- Tracy: Performance profiling and monitoring
- Engine-Specific SDKs: Platform-specific runtime libraries and drivers
Engine-specific requirements:
- NVIDIA TensorRT: CUDA runtime, cuDNN library
- Intel OpenVINO: Intel distribution of OpenVINO toolkit
- ARM NN: ARM Compute Library
- Qualcomm SNPE: Snapdragon Neural Processing Engine SDK
- Others: Respective hardware vendor SDKs and runtime libraries
Configuration¶
Basic Configuration¶
{
"inference": {
"engine": "auto",
"device": "auto",
"model_file": "/path/to/model.onnx",
"optimization_level": 1,
"enable_profiling": false
}
}
Advanced Multi-Engine Configuration¶
{
"inference": {
"engines": [
{
"name": "tensorrt",
"priority": 1,
"device": "GPU",
"config": {
"precision": "FP16",
"max_batch_size": 8,
"workspace_size": "1GB"
}
},
{
"name": "openvino",
"priority": 2,
"device": "CPU",
"config": {
"num_threads": 4,
"optimization_level": 2
}
},
{
"name": "armnn",
"priority": 3,
"device": "ARM_GPU",
"config": {
"fast_math": true,
"tuning_level": 2
}
}
],
"fallback_enabled": true,
"auto_selection": {
"enabled": true,
"criteria": ["performance", "power_efficiency"],
"benchmark_iterations": 10
}
}
}
Configuration Schema¶
Parameter | Type | Default | Description |
---|---|---|---|
engine |
string | "auto" | Target engine name or "auto" for automatic selection |
device |
string | "auto" | Target device or "auto" for automatic detection |
model_file |
string | required | Path to model file |
optimization_level |
int | 1 | Optimization level (0-3, higher = more aggressive) |
enable_profiling |
bool | false | Enable performance profiling |
engines[] |
array | [] | Engine priority configuration for multi-engine setups |
fallback_enabled |
bool | true | Enable automatic fallback to alternative engines |
auto_selection.enabled |
bool | true | Enable automatic engine selection |
auto_selection.criteria |
array | ["performance"] | Selection criteria for automatic engine selection |
API Reference¶
Common Engine Interface¶
All engine plugins implement a standardized interface through the InferenceHandler
class:
class InferenceHandler {
public:
// Model management
virtual expected<void> loadModel(std::string const& path) = 0;
virtual expected<void> loadBackend() = 0;
// Device management
virtual expected<void> setDevice(std::string const& device) = 0;
virtual std::string getActiveDevice() = 0;
virtual std::vector<std::pair<std::string, std::string>> getDeviceGuids() = 0;
// Inference execution
virtual expected<std::vector<xt::xarray<float>>> runInference(std::vector<Tensor>& inputs) = 0;
// Model information
virtual std::vector<int> getInputShape() = 0;
virtual std::vector<int> getOutputShape() = 0;
virtual pCValue getCapabilities() = 0;
// Performance monitoring
virtual internal::ResourceUsage getResourceUsage() = 0;
};
Engine Factory Pattern¶
// Engine creation through factory pattern
class EngineFactory {
public:
static std::shared_ptr<InferenceHandler> createEngine(
const std::string& engineName,
const CValue& config = {});
static std::vector<std::string> getAvailableEngines();
static EngineCapabilities getEngineCapabilities(const std::string& engineName);
};
Lua Integration¶
-- Automatic engine selection
local inference = api.factory.inference.create(instance, "auto_engine")
inference:loadModel("/models/detection.onnx")
-- Specific engine selection
local tensorrt_inference = api.factory.inference.create(instance, "tensorrt_engine")
tensorrt_inference:setDevice("GPU")
tensorrt_inference:loadModel("/models/detection.engine")
-- Multi-engine configuration
local multi_engine = api.factory.inference.create(instance, "multi_engine")
multi_engine:configureEngines({
{name = "tensorrt", priority = 1, device = "GPU"},
{name = "openvino", priority = 2, device = "CPU"}
})
Examples¶
Automatic Engine Selection¶
#include "inference/engines/enginefactory.h"
// Create engine with automatic selection
auto engine = EngineFactory::createEngine("auto", {
{"model_file", "/path/to/model.onnx"},
{"optimization_level", 2},
{"enable_profiling", true}
});
// Load and initialize
engine->loadModel("/path/to/model.onnx");
engine->loadBackend();
// Get selected engine information
auto capabilities = engine->getCapabilities();
LOGI << "Selected engine: " << capabilities->get("engine_name").getString();
LOGI << "Target device: " << engine->getActiveDevice();
Multi-Engine Fallback Setup¶
// Configure primary and fallback engines
std::vector<EngineConfig> engineConfigs = {
{"tensorrt", "GPU", 1, {{"precision", "FP16"}, {"max_batch_size", 8}}},
{"openvino", "CPU", 2, {{"num_threads", 4}}},
{"armnn", "ARM_GPU", 3, {{"fast_math", true}}}
};
auto engine = EngineFactory::createMultiEngine(engineConfigs);
// Inference with automatic fallback
std::vector<Tensor> inputs = {inputTensor};
auto result = engine->runInference(inputs);
if (result) {
auto outputs = result.value();
// Process results from whichever engine succeeded
} else {
LETE << "All engines failed: " << result.error().message();
}
Platform-Specific Optimization¶
// NVIDIA GPU optimization
if (EngineFactory::isEngineAvailable("tensorrt")) {
auto trtEngine = EngineFactory::createEngine("tensorrt", {
{"precision", "FP16"},
{"use_cuda_graph", true},
{"optimization_profile", "high_throughput"}
});
trtEngine->setDevice("0"); // GPU 0
trtEngine->loadModel("/models/optimized.engine");
}
// Intel CPU/VPU optimization
if (EngineFactory::isEngineAvailable("openvino")) {
auto oviEngine = EngineFactory::createEngine("openvino", {
{"device", "AUTO"}, // Auto-select CPU/GPU/VPU
{"performance_hint", "THROUGHPUT"},
{"cache_dir", "/tmp/openvino_cache"}
});
oviEngine->loadModel("/models/model.xml");
}
Performance Benchmarking¶
-- Benchmark multiple engines
local engines = {"tensorrt", "openvino", "armnn"}
local results = {}
for _, engine_name in ipairs(engines) do
if api.factory.inference.isAvailable(engine_name) then
local engine = api.factory.inference.create(instance, engine_name)
engine:loadModel("/models/benchmark.onnx")
-- Warmup
for i = 1, 5 do
engine:runInference({sample_input})
end
-- Benchmark
local start_time = api.system.getCurrentTime()
for i = 1, 100 do
engine:runInference({sample_input})
end
local end_time = api.system.getCurrentTime()
results[engine_name] = {
avg_latency = (end_time - start_time) / 100,
device = engine:getActiveDevice(),
memory_usage = engine:getResourceUsage().memory_mb
}
end
end
-- Select best performing engine
local best_engine = selectBestEngine(results)
print("Best engine:", best_engine)
Hardware Support Matrix¶
NVIDIA Platforms¶
- TensorRT: RTX series, Tesla, Jetson platforms
- Supported Precisions: FP32, FP16, INT8, Sparsity
- Features: Dynamic shapes, CUDA graphs, DLA acceleration (Jetson)
Intel Platforms¶
- OpenVINO: Core, Xeon, Atom processors, Intel Arc GPUs, Movidius VPUs
- Supported Precisions: FP32, FP16, INT8, VNNI
- Features: Heterogeneous execution, auto-batching, model caching
ARM Platforms¶
- ARM NN: Cortex-A series CPUs, Mali GPUs, Ethos NPUs
- Supported Precisions: FP32, FP16, INT8
- Features: NEON optimization, GPU compute shaders
Mobile/Edge Platforms¶
- Qualcomm SNPE: Snapdragon 855+, automotive platforms
- Rockchip RKNN: RK3588, RK3566, RK3568 series
- Hailo: Hailo-8, Hailo-15 AI processors
- Features: Low power optimization, on-device learning
Best Practices¶
Engine Selection Strategy¶
- Automatic Selection: Start with
engine: "auto"
for optimal performance - Hardware Matching: Align engine choice with available hardware
- Model Format: Use native formats when available for best performance
- Fallback Planning: Configure multiple engines for reliability
- Performance Testing: Benchmark engines with representative workloads
Configuration Optimization¶
- Precision Selection: Use mixed precision (FP16/INT8) for better performance
- Batch Size Tuning: Optimize batch size for your hardware and latency requirements
- Memory Management: Configure appropriate workspace/cache sizes
- Threading: Tune thread counts for multi-threaded engines
Deployment Considerations¶
- Model Optimization: Pre-optimize models for target hardware when possible
- Driver Dependencies: Ensure proper drivers and SDKs installed
- Resource Monitoring: Monitor GPU/CPU utilization and memory usage
- Error Handling: Implement robust fallback mechanisms
- Version Compatibility: Test engine version compatibility with models
Troubleshooting¶
Common Issues¶
Engine Not Available¶
if (!EngineFactory::isEngineAvailable("tensorrt")) {
LETE << "TensorRT engine not available";
// Check: CUDA installation, TensorRT libraries, GPU drivers
}
Model Loading Failures¶
- Format Mismatch: Verify model format matches engine expectations
- Path Issues: Check file paths and permissions
- Version Compatibility: Ensure model version matches engine version
- Hardware Requirements: Verify hardware meets model requirements
Performance Issues¶
- Suboptimal Performance: Try different precision settings (FP16, INT8)
- High Memory Usage: Reduce batch size or model complexity
- CPU Bottleneck: Increase thread count for CPU-based engines
- GPU Underutilization: Increase batch size or use multiple streams
Diagnostic Tools¶
// Engine diagnostics
auto engines = EngineFactory::getAvailableEngines();
for (const auto& name : engines) {
auto caps = EngineFactory::getEngineCapabilities(name);
LOGI << name << " - Devices: " << caps.supportedDevices.size();
}
// Runtime diagnostics
auto usage = engine->getResourceUsage();
LOGI << "Memory: " << usage.memory_mb << "MB";
LOGI << "GPU Util: " << usage.gpu_utilization << "%";
Integration Examples¶
Computer Vision Pipeline¶
// Complete inference pipeline with automatic engine selection
class VisionPipeline {
private:
std::shared_ptr<InferenceHandler> detectionEngine_;
std::shared_ptr<InferenceHandler> classificationEngine_;
public:
void initialize() {
// Detection with GPU acceleration
detectionEngine_ = EngineFactory::createEngine("auto", {
{"model_file", "/models/yolo.onnx"},
{"preferred_device", "GPU"},
{"optimization_level", 2}
});
// Classification with CPU fallback
classificationEngine_ = EngineFactory::createEngine("auto", {
{"model_file", "/models/classifier.onnx"},
{"preferred_device", "CPU"},
{"max_batch_size", 16}
});
}
std::vector<Detection> processFrame(const cv::Mat& frame) {
// Detection inference
auto detectionResult = detectionEngine_->runInference({frameToTensor(frame)});
auto detections = postprocessDetections(detectionResult.value());
// Classification inference for detected objects
std::vector<cv::Mat> crops;
for (const auto& detection : detections) {
crops.push_back(cropObject(frame, detection.bbox));
}
auto classificationResult = classificationEngine_->runInference(cropsToTensors(crops));
return combineResults(detections, classificationResult.value());
}
};
Edge Deployment with Power Management¶
-- Edge-optimized inference with power management
local edge_config = {
engines = {
{name = "hailo", priority = 1, device = "NPU"}, -- Lowest power
{name = "armnn", priority = 2, device = "ARM_GPU"}, -- Medium power
{name = "openvino", priority = 3, device = "CPU"} -- Highest power
},
power_management = {
enabled = true,
battery_threshold = 20, -- Switch to low power engines below 20%
thermal_throttling = true
}
}
local inference = api.factory.inference.create(instance, "edge_optimized")
inference:configure(edge_config)
inference:loadModel("/models/lightweight.onnx")
See Also¶
- Inference Plugin - Core inference coordination
- Inference Overview - Complete inference plugin ecosystem
- Platform Plugins - Hardware integration plugins
- Plugin Overview - All available plugins