Inference Plugin¶
Description¶
Inference is the core AI inference engine plugin for CVEDIA-RT that provides universal interface for running machine learning models on various hardware platforms. It manages model loading, execution, and result processing across different inference backends and chipsets.
The Inference plugin serves as the central hub for AI model execution in CVEDIA-RT, supporting multiple inference backends and providing both synchronous and asynchronous inference capabilities. It handles model management, backend pooling, and resource optimization for efficient AI processing.
Key Features¶
- Multi-backend inference support (TensorRT, OpenVINO, ONNX, etc.)
- Synchronous and asynchronous inference execution
- Dynamic backend pooling and load balancing
- Model configuration and capability detection
- Input/output tensor management
- Performance monitoring and resource usage tracking
- Thread-safe inference execution
- Configurable pool sizes for concurrent processing
Requirements¶
Hardware Requirements¶
- CPU: Multi-core processor for backend pooling
- Memory: Sufficient RAM for model loading and tensor operations
- GPU: Optional but recommended for accelerated inference (NVIDIA, Intel, AMD)
- AI Accelerators: Optional support for specialized hardware (Hailo, RKNN, etc.)
Software Dependencies¶
- RTCORE: CVEDIA-RT core library
- XTensor: Multi-dimensional array library for tensor operations
- Tracy: Performance profiling and monitoring
- Sol2: Lua scripting integration
- Backend libraries: TensorRT, OpenVINO, ONNX Runtime (depending on configuration)
Configuration¶
Basic Configuration¶
{
"enabled": true,
"model_file": "/path/to/model.onnx",
"poolSize": 1,
"asyncResultsCollectionDuration": 100,
"backend": {
"device": "CPU",
"batch_size": 1,
"precision": "FP32",
"normalize_input": true,
"channel_layout": "RGB"
}
}
Advanced Configuration¶
{
"enabled": true,
"model_file": "/path/to/model.engine",
"poolSize": 4,
"asyncResultsCollectionDuration": 50,
"backend": {
"device": "GPU",
"batch_size": 8,
"precision": "FP16",
"normalize_input": true,
"channel_layout": "BGR",
"optimization_level": 3,
"enable_fp16": true
}
}
Configuration Schema¶
Parameter | Type | Default | Description |
---|---|---|---|
enabled |
bool | true | Enable/disable inference processing |
model_file |
string | "" | Path to model file |
poolSize |
int | 1 | Number of concurrent inference backends |
asyncResultsCollectionDuration |
int | 100 | Async results batching window (ms) |
backend.device |
string | "CPU" | Target device identifier |
backend.batch_size |
int | 1 | Inference batch size |
backend.precision |
string | "FP32" | Model precision (FP32, FP16, INT8) |
backend.normalize_input |
bool | false | Enable input normalization |
backend.channel_layout |
string | "RGB" | Input channel order (RGB/BGR) |
API Reference¶
C++ API (InferenceManaged)¶
Synchronous Inference¶
expected<cvec> runInference(cvec const& jobs)
Execute synchronous inference on input jobs. Thread-safe with backend pooling.
Asynchronous Inference¶
expected<void> runInferenceAsync(cvec const& jobs, int ttl = 60)
expected<cvec> getAsyncInferenceResults()
expected<void> setAsyncResultsCollectionDuration(int durationMs)
Submit jobs for asynchronous processing and retrieve results.
Model Management¶
expected<void> loadModel(std::string const& path)
expected<void> loadModel(std::string const& path, CValue& handlerConfig)
expected<void> loadModelFromConfig()
Load AI models with optional backend-specific configuration.
Backend Pool Management¶
expected<void> setPoolSize(int poolSize)
void setBackendConfig(pCValue conf)
Configure inference backend pool size and settings.
Model Information¶
ssize_t inputBatchSize()
ssize_t inputWidth()
ssize_t inputHeight()
ssize_t inputChannels()
std::vector<int> inputShape()
std::vector<int> outputShape()
Query model tensor dimensions and requirements.
Lua API¶
Factory Methods¶
-- Create inference engine
local inference = api.factory.inference.create(instance, "my_inference")
-- Get existing inference engine
local inference = api.factory.inference.get(instance, "my_inference")
Basic Operations¶
-- Load model
inference:loadModel("/path/to/model.onnx")
-- Set pool size
inference:setPoolSize(4)
-- Execute inference
local jobs = {} -- populate with input data
local results = inference:runInference(jobs)
Examples¶
Basic Synchronous Inference¶
// Create inference engine
auto inference = InferenceManaged::create("inference_engine");
// Load model
inference->loadModel("/path/to/model.onnx");
// Configure backend pool
inference->setPoolSize(4); // 4 concurrent backends
// Prepare inference jobs
cvec jobs;
// ... populate jobs with input data
// Execute inference
auto result = inference->runInference(jobs);
if (result) {
auto outputs = result.value();
// Process inference results
}
Asynchronous Inference Pattern¶
// Submit jobs for async processing
cvec jobs;
// ... populate jobs
auto submitResult = inference->runInferenceAsync(jobs, 120); // 2 minute TTL
if (submitResult) {
// Jobs submitted successfully
// Later, retrieve results
auto results = inference->getAsyncInferenceResults();
if (results && !results.value().empty()) {
// Process available results
for (const auto& result : results.value()) {
// Handle each result
}
}
}
Lua Integration Example¶
-- Create and configure inference engine
local inference = api.factory.inference.create(instance, "my_inference")
inference:loadModel("/models/detection.onnx")
inference:setPoolSize(2)
-- Process frame with inference
function processFrame(frame)
local jobs = {frame} -- Create job from frame
local results = inference:runInference(jobs)
if results and #results > 0 then
-- Process detection results
for _, result in ipairs(results) do
-- Handle detection output
end
end
end
Best Practices¶
Performance Optimization¶
- Pool Sizing: Set pool size to match concurrent inference needs
- Backend Selection: Choose appropriate backend for target hardware
- Batch Processing: Use larger batches for higher throughput
- Async Processing: Use async inference for non-blocking operations
- Memory Management: Monitor memory usage with multiple backends
Threading Guidelines¶
- Synchronous calls are thread-safe with automatic backend locking
- Configure pool size and backend settings before starting inference
- Use async processing for high-throughput scenarios
- Avoid reconfiguring during active inference
Integration Patterns¶
- Integrate with input plugins for frame preprocessing
- Connect to processing plugins for post-inference analysis
- Use with tracking plugins for temporal object analysis
- Combine with output plugins for result forwarding
Troubleshooting¶
Common Issues¶
Model Loading Failures¶
auto result = inference->loadModel("/path/to/model.onnx");
if (!result) {
LETE << "Model loading failed: " << result.error().message();
// Check model file path and format compatibility
}
No Available Backends¶
auto devices = inference->getActiveDevices();
if (!devices || devices.value().empty()) {
LETE << "No inference devices available";
// Check backend configuration and hardware availability
}
Performance Issues¶
- Increase pool size for concurrent processing
- Check backend-specific optimization settings
- Monitor resource usage and memory constraints
- Verify model format matches target hardware
Error Messages¶
"Model not loaded"
: CallloadModel()
before inference"Backend not available"
: Check device configuration and drivers"Pool exhausted"
: Increase pool size or reduce concurrent load"Invalid tensor dimensions"
: Verify input tensor shapes
Integration Examples¶
With Detection Pipeline¶
// Integration with detection workflow
auto inference = InferenceManaged::create("detector");
inference->loadModel("/models/yolo.engine");
inference->setPoolSize(3);
// Process detection frames
for (const auto& frame : inputFrames) {
cvec jobs = {preprocessFrame(frame)};
auto results = inference->runInference(jobs);
if (results) {
auto detections = postprocessResults(results.value());
// Forward to tracking or output modules
}
}
With Tracking Integration¶
// Inference feeding tracking pipeline
auto inference = InferenceManaged::create("detector");
auto tracker = TrackerManaged::create("tracker");
// Configure inference for tracking workflow
inference->loadModel("/models/detection.onnx");
tracker->initialize();
// Process video stream
while (stream.hasFrame()) {
auto frame = stream.getFrame();
auto detections = inference->runInference({frame});
if (detections) {
tracker->trackObjects(frame, detections.value());
}
}
See Also¶
- Engines Plugin Collection - Backend inference engines
- Processing Plugins - Pre/post-processing modules
- Tracking Plugins - Object tracking integration
- Plugin Overview - All available plugins