Skip to content

Inference Plugin

Description

Inference is the core AI inference engine plugin for CVEDIA-RT that provides universal interface for running machine learning models on various hardware platforms. It manages model loading, execution, and result processing across different inference backends and chipsets.

The Inference plugin serves as the central hub for AI model execution in CVEDIA-RT, supporting multiple inference backends and providing both synchronous and asynchronous inference capabilities. It handles model management, backend pooling, and resource optimization for efficient AI processing.

Key Features

  • Multi-backend inference support (TensorRT, OpenVINO, ONNX, etc.)
  • Synchronous and asynchronous inference execution
  • Dynamic backend pooling and load balancing
  • Model configuration and capability detection
  • Input/output tensor management
  • Performance monitoring and resource usage tracking
  • Thread-safe inference execution
  • Configurable pool sizes for concurrent processing

Requirements

Hardware Requirements

  • CPU: Multi-core processor for backend pooling
  • Memory: Sufficient RAM for model loading and tensor operations
  • GPU: Optional but recommended for accelerated inference (NVIDIA, Intel, AMD)
  • AI Accelerators: Optional support for specialized hardware (Hailo, RKNN, etc.)

Software Dependencies

  • RTCORE: CVEDIA-RT core library
  • XTensor: Multi-dimensional array library for tensor operations
  • Tracy: Performance profiling and monitoring
  • Sol2: Lua scripting integration
  • Backend libraries: TensorRT, OpenVINO, ONNX Runtime (depending on configuration)

Configuration

Basic Configuration

{
  "enabled": true,
  "model_file": "/path/to/model.onnx",
  "poolSize": 1,
  "asyncResultsCollectionDuration": 100,
  "backend": {
    "device": "CPU",
    "batch_size": 1,
    "precision": "FP32",
    "normalize_input": true,
    "channel_layout": "RGB"
  }
}

Advanced Configuration

{
  "enabled": true,
  "model_file": "/path/to/model.engine",
  "poolSize": 4,
  "asyncResultsCollectionDuration": 50,
  "backend": {
    "device": "GPU",
    "batch_size": 8,
    "precision": "FP16",
    "normalize_input": true,
    "channel_layout": "BGR",
    "optimization_level": 3,
    "enable_fp16": true
  }
}

Configuration Schema

Parameter Type Default Description
enabled bool true Enable/disable inference processing
model_file string "" Path to model file
poolSize int 1 Number of concurrent inference backends
asyncResultsCollectionDuration int 100 Async results batching window (ms)
backend.device string "CPU" Target device identifier
backend.batch_size int 1 Inference batch size
backend.precision string "FP32" Model precision (FP32, FP16, INT8)
backend.normalize_input bool false Enable input normalization
backend.channel_layout string "RGB" Input channel order (RGB/BGR)

API Reference

C++ API (InferenceManaged)

Synchronous Inference

expected<cvec> runInference(cvec const& jobs)

Execute synchronous inference on input jobs. Thread-safe with backend pooling.

Asynchronous Inference

expected<void> runInferenceAsync(cvec const& jobs, int ttl = 60)
expected<cvec> getAsyncInferenceResults()
expected<void> setAsyncResultsCollectionDuration(int durationMs)

Submit jobs for asynchronous processing and retrieve results.

Model Management

expected<void> loadModel(std::string const& path)
expected<void> loadModel(std::string const& path, CValue& handlerConfig)
expected<void> loadModelFromConfig()

Load AI models with optional backend-specific configuration.

Backend Pool Management

expected<void> setPoolSize(int poolSize)
void setBackendConfig(pCValue conf)

Configure inference backend pool size and settings.

Model Information

ssize_t inputBatchSize()
ssize_t inputWidth()
ssize_t inputHeight() 
ssize_t inputChannels()
std::vector<int> inputShape()
std::vector<int> outputShape()

Query model tensor dimensions and requirements.

Lua API

Factory Methods

-- Create inference engine
local inference = api.factory.inference.create(instance, "my_inference")

-- Get existing inference engine
local inference = api.factory.inference.get(instance, "my_inference")

Basic Operations

-- Load model
inference:loadModel("/path/to/model.onnx")

-- Set pool size
inference:setPoolSize(4)

-- Execute inference
local jobs = {} -- populate with input data
local results = inference:runInference(jobs)

Examples

Basic Synchronous Inference

// Create inference engine
auto inference = InferenceManaged::create("inference_engine");

// Load model
inference->loadModel("/path/to/model.onnx");

// Configure backend pool
inference->setPoolSize(4);  // 4 concurrent backends

// Prepare inference jobs
cvec jobs;
// ... populate jobs with input data

// Execute inference
auto result = inference->runInference(jobs);
if (result) {
    auto outputs = result.value();
    // Process inference results
}

Asynchronous Inference Pattern

// Submit jobs for async processing
cvec jobs;
// ... populate jobs

auto submitResult = inference->runInferenceAsync(jobs, 120); // 2 minute TTL
if (submitResult) {
    // Jobs submitted successfully

    // Later, retrieve results
    auto results = inference->getAsyncInferenceResults();
    if (results && !results.value().empty()) {
        // Process available results
        for (const auto& result : results.value()) {
            // Handle each result
        }
    }
}

Lua Integration Example

-- Create and configure inference engine
local inference = api.factory.inference.create(instance, "my_inference")
inference:loadModel("/models/detection.onnx")
inference:setPoolSize(2)

-- Process frame with inference
function processFrame(frame)
    local jobs = {frame} -- Create job from frame
    local results = inference:runInference(jobs)

    if results and #results > 0 then
        -- Process detection results
        for _, result in ipairs(results) do
            -- Handle detection output
        end
    end
end

Best Practices

Performance Optimization

  • Pool Sizing: Set pool size to match concurrent inference needs
  • Backend Selection: Choose appropriate backend for target hardware
  • Batch Processing: Use larger batches for higher throughput
  • Async Processing: Use async inference for non-blocking operations
  • Memory Management: Monitor memory usage with multiple backends

Threading Guidelines

  • Synchronous calls are thread-safe with automatic backend locking
  • Configure pool size and backend settings before starting inference
  • Use async processing for high-throughput scenarios
  • Avoid reconfiguring during active inference

Integration Patterns

  • Integrate with input plugins for frame preprocessing
  • Connect to processing plugins for post-inference analysis
  • Use with tracking plugins for temporal object analysis
  • Combine with output plugins for result forwarding

Troubleshooting

Common Issues

Model Loading Failures

auto result = inference->loadModel("/path/to/model.onnx");
if (!result) {
    LETE << "Model loading failed: " << result.error().message();
    // Check model file path and format compatibility
}

No Available Backends

auto devices = inference->getActiveDevices();
if (!devices || devices.value().empty()) {
    LETE << "No inference devices available";
    // Check backend configuration and hardware availability
}

Performance Issues

  • Increase pool size for concurrent processing
  • Check backend-specific optimization settings
  • Monitor resource usage and memory constraints
  • Verify model format matches target hardware

Error Messages

  • "Model not loaded": Call loadModel() before inference
  • "Backend not available": Check device configuration and drivers
  • "Pool exhausted": Increase pool size or reduce concurrent load
  • "Invalid tensor dimensions": Verify input tensor shapes

Integration Examples

With Detection Pipeline

// Integration with detection workflow
auto inference = InferenceManaged::create("detector");
inference->loadModel("/models/yolo.engine");
inference->setPoolSize(3);

// Process detection frames
for (const auto& frame : inputFrames) {
    cvec jobs = {preprocessFrame(frame)};
    auto results = inference->runInference(jobs);
    if (results) {
        auto detections = postprocessResults(results.value());
        // Forward to tracking or output modules
    }
}

With Tracking Integration

// Inference feeding tracking pipeline  
auto inference = InferenceManaged::create("detector");
auto tracker = TrackerManaged::create("tracker");

// Configure inference for tracking workflow
inference->loadModel("/models/detection.onnx");
tracker->initialize();

// Process video stream
while (stream.hasFrame()) {
    auto frame = stream.getFrame();
    auto detections = inference->runInference({frame});
    if (detections) {
        tracker->trackObjects(frame, detections.value());
    }
}

See Also