Inference Plugin ¶

Description ¶

Inference is the core AI inference engine plugin for CVEDIA-RT that provides universal interface for running machine learning models on various hardware platforms. It manages model loading, execution, and result processing across different inference backends and chipsets.

The Inference plugin serves as the central hub for AI model execution in CVEDIA-RT, supporting multiple inference backends and providing both synchronous and asynchronous inference capabilities. It handles model management, backend pooling, and resource optimization for efficient AI processing.

Key Features ¶

Multi-backend inference support (TensorRT, OpenVINO, ONNX, etc.)
Synchronous and asynchronous inference execution
Dynamic backend pooling and load balancing
Model configuration and capability detection
Input/output tensor management
Performance monitoring and resource usage tracking
Thread-safe inference execution
Configurable pool sizes for concurrent processing

Requirements ¶

Hardware Requirements ¶

CPU: Multi-core processor for backend pooling
Memory: Sufficient RAM for model loading and tensor operations
GPU: Optional but recommended for accelerated inference (NVIDIA, Intel, AMD)
AI Accelerators: Optional support for specialized hardware (Hailo, RKNN, etc.)

Software Dependencies ¶

RTCORE: CVEDIA-RT core library
XTensor: Multi-dimensional array library for tensor operations
Tracy: Performance profiling and monitoring
Sol2: Lua scripting integration
Backend libraries: TensorRT, OpenVINO, ONNX Runtime (depending on configuration)

Configuration ¶

Basic Configuration ¶

{
  "enabled": true,
  "model_file": "/path/to/model.onnx",
  "poolSize": 1,
  "asyncResultsCollectionDuration": 100,
  "backend": {
    "device": "CPU",
    "batch_size": 1,
    "precision": "FP32",
    "normalize_input": true,
    "channel_layout": "RGB"
  }
}

Advanced Configuration ¶

{
  "enabled": true,
  "model_file": "/path/to/model.engine",
  "poolSize": 4,
  "asyncResultsCollectionDuration": 50,
  "backend": {
    "device": "GPU",
    "batch_size": 8,
    "precision": "FP16",
    "normalize_input": true,
    "channel_layout": "BGR",
    "optimization_level": 3,
    "enable_fp16": true
  }
}

Configuration Schema ¶

Parameter	Type	Default	Description
`enabled`	bool	true	Enable/disable inference processing
`model_file`	string	""	Path to model file
`poolSize`	int	1	Number of concurrent inference backends
`asyncResultsCollectionDuration`	int	100	Async results batching window (ms)
`backend.device`	string	"CPU"	Target device identifier
`backend.batch_size`	int	1	Inference batch size
`backend.precision`	string	"FP32"	Model precision (FP32, FP16, INT8)
`backend.normalize_input`	bool	false	Enable input normalization
`backend.channel_layout`	string	"RGB"	Input channel order (RGB/BGR)

expected<cvec> runInference(cvec const& jobs)

Execute synchronous inference on input jobs. Thread-safe with backend pooling.

Asynchronous Inference ¶

expected<void> runInferenceAsync(cvec const& jobs, int ttl = 60)
expected<cvec> getAsyncInferenceResults()
expected<void> setAsyncResultsCollectionDuration(int durationMs)

Submit jobs for asynchronous processing and retrieve results.

Model Management ¶

expected<void> loadModel(std::string const& path)
expected<void> loadModel(std::string const& path, CValue& handlerConfig)
expected<void> loadModelFromConfig()

Load AI models with optional backend-specific configuration.

Backend Pool Management ¶

expected<void> setPoolSize(int poolSize)
void setBackendConfig(pCValue conf)

Configure inference backend pool size and settings.

Model Information ¶

ssize_t inputBatchSize()
ssize_t inputWidth()
ssize_t inputHeight() 
ssize_t inputChannels()
std::vector<int> inputShape()
std::vector<int> outputShape()

Query model tensor dimensions and requirements.

Lua API ¶

Factory Methods ¶

-- Create inference engine
local inference = api.factory.inference.create(instance, "my_inference")

-- Get existing inference engine
local inference = api.factory.inference.get(instance, "my_inference")

Basic Operations ¶

-- Load model
inference:loadModel("/path/to/model.onnx")

-- Set pool size
inference:setPoolSize(4)

-- Execute inference
local jobs = {} -- populate with input data
local results = inference:runInference(jobs)

Examples ¶

Basic Synchronous Inference ¶

// Create inference engine
auto inference = InferenceManaged::create("inference_engine");

// Load model
inference->loadModel("/path/to/model.onnx");

// Configure backend pool
inference->setPoolSize(4);  // 4 concurrent backends

// Prepare inference jobs
cvec jobs;
// ... populate jobs with input data

// Execute inference
auto result = inference->runInference(jobs);
if (result) {
    auto outputs = result.value();
    // Process inference results
}

Asynchronous Inference Pattern ¶

// Submit jobs for async processing
cvec jobs;
// ... populate jobs

auto submitResult = inference->runInferenceAsync(jobs, 120); // 2 minute TTL
if (submitResult) {
    // Jobs submitted successfully

    // Later, retrieve results
    auto results = inference->getAsyncInferenceResults();
    if (results && !results.value().empty()) {
        // Process available results
        for (const auto& result : results.value()) {
            // Handle each result
        }
    }
}

Lua Integration Example ¶

-- Create and configure inference engine
local inference = api.factory.inference.create(instance, "my_inference")
inference:loadModel("/models/detection.onnx")
inference:setPoolSize(2)

-- Process frame with inference
function processFrame(frame)
    local jobs = {frame} -- Create job from frame
    local results = inference:runInference(jobs)

    if results and #results > 0 then
        -- Process detection results
        for _, result in ipairs(results) do
            -- Handle detection output
        end
    end
end

Best Practices ¶

Performance Optimization ¶

Pool Sizing: Set pool size to match concurrent inference needs
Backend Selection: Choose appropriate backend for target hardware
Batch Processing: Use larger batches for higher throughput
Async Processing: Use async inference for non-blocking operations
Memory Management: Monitor memory usage with multiple backends

Threading Guidelines ¶

Synchronous calls are thread-safe with automatic backend locking
Configure pool size and backend settings before starting inference
Use async processing for high-throughput scenarios
Avoid reconfiguring during active inference

Integration Patterns ¶

Integrate with input plugins for frame preprocessing
Connect to processing plugins for post-inference analysis
Use with tracking plugins for temporal object analysis
Combine with output plugins for result forwarding

auto result = inference->loadModel("/path/to/model.onnx");
if (!result) {
    LETE << "Model loading failed: " << result.error().message();
    // Check model file path and format compatibility
}

No Available Backends ¶

auto devices = inference->getActiveDevices();
if (!devices || devices.value().empty()) {
    LETE << "No inference devices available";
    // Check backend configuration and hardware availability
}

Performance Issues ¶

Increase pool size for concurrent processing
Check backend-specific optimization settings
Monitor resource usage and memory constraints
Verify model format matches target hardware

Error Messages ¶

"Model not loaded": Call loadModel() before inference
"Backend not available": Check device configuration and drivers
"Pool exhausted": Increase pool size or reduce concurrent load
"Invalid tensor dimensions": Verify input tensor shapes

Integration Examples ¶

With Detection Pipeline ¶

// Integration with detection workflow
auto inference = InferenceManaged::create("detector");
inference->loadModel("/models/yolo.engine");
inference->setPoolSize(3);

// Process detection frames
for (const auto& frame : inputFrames) {
    cvec jobs = {preprocessFrame(frame)};
    auto results = inference->runInference(jobs);
    if (results) {
        auto detections = postprocessResults(results.value());
        // Forward to tracking or output modules
    }
}

With Tracking Integration ¶

// Inference feeding tracking pipeline  
auto inference = InferenceManaged::create("detector");
auto tracker = TrackerManaged::create("tracker");

// Configure inference for tracking workflow
inference->loadModel("/models/detection.onnx");
tracker->initialize();

// Process video stream
while (stream.hasFrame()) {
    auto frame = stream.getFrame();
    auto detections = inference->runInference({frame});
    if (detections) {
        tracker->trackObjects(frame, detections.value());
    }
}