Inference plugin¶
Intro¶
A inference plugin allows directly interfacing with a inference backend, this could be a driver or hardware.
It's important to understand that CVEDIA-RT will do all the pre and post processing of the tensors, the inference plugin is responsible for:
- Setting up the device
- Identifiying and registering device capabilities and driver versions
- Enumerating how many backends of the same type are available
- Thread safety
- Loading models
- Forward pass
- Unloading models
- Gracefully shutting down
Key methods¶
All sources for the methods described here can be found at out github plugin repo.
We will be mainly talking about core
(mnncore.cpp) methods.
1. getCapabilities¶
Here you can return any specifics about the device, this depends if you're bound to a driver, firmware, etc.
This is all optional, you can return a empty VAL()
.
Example:
pCValue MNNCore::getCapabilities() {
cmap propmap;
propmap["firmware"] = VAL(std::string("1.0.0"));
return VAL(propmap);
}
2. getDeviceGuids¶
This function complements the declaration within inferencehandler where you will register a backend and a file format:
extern "C" EXPORT void registerHandler() {
api::inference::registerSchemeHandler("mnn", &MNNInferenceHandler::create);
api::inference::registerExtHandler(".mnn", &MNNInferenceHandler::create);
}
In the core method, you will define what backends are available, potentially scanning for available devices within the hardware.
For example, if there's multiple NVIDIA GPUs, the user can address individual cards with tensorrt.1://...
where tensorrt
comes from registerHandler
and 1
from getDeviceGuids
.
If the device in question doesn't support or care about individual addressing you can simply return auto
, like in the MNN function below:
std::vector<std::pair<std::string, std::string>> MNNCore::getDeviceGuids() {
std::vector<std::pair<std::string, std::string>> out;
out.push_back(std::make_pair(string("auto"), "Runs on best available device"));
return out;
}
3. setDevice¶
While getDeviceGuids
list available devices, setDevice
set the context to a specific device.
MNN example:
expected<void> MNNCore::setDevice(std::string const& device) {
LOGD << "Setting device to " << device;
if (!loadBackend()) {
return unexpected(RTErrc::NoSuchDevice);
}
return {};
}
4. loadModel¶
This method should be implemented in two ways, first with a file path and second with weights as input.
A model in the compatible format for the backend is expected to be loaded into the device, this happens once after we setDevice
in the very first runInference
call.
CVEDIA-RT can load models directly from our distribution platform (model forge) or straight from disk, that's why there's two implementations of the same method.
expected<void> MNNCore::loadModel(string const& path) {
auto weights = readFile(path);
return loadModel(path, weights);
}
expected<void> MNNCore::loadModel(string const& path, std::vector<unsigned char> const& weights) {
unique_lock<mutex> m(sessMux_);
auto ptr = MNN::Interpreter::createFromBuffer(weights.data(), weights.size());
if (ptr) {
network_ = std::shared_ptr<MNN::Interpreter>(ptr);
MNN::ScheduleConfig config;
config.type = MNN_FORWARD_AUTO;
MNN::BackendConfig backendConfig;
backendConfig.precision = MNN::BackendConfig::Precision_Normal;
backendConfig.memory = MNN::BackendConfig::Memory_Normal;
backendConfig.power = MNN::BackendConfig::Power_Normal;
config.backendConfig = &backendConfig;
LOGD << "Creating network session";
session_ = network_->createSession(config);
if (!session_) {
LOGE << "createSession returned nullptr";
return unexpected(RTErrc::OperationFailed);
}
LOGD << "Input tensors";
auto inputs = network_->getSessionInputAll(session_);
if (inputs.size() != 1) {
LOGE << "Found " << inputs.size() << " input tensors. Only one supported currently";
return unexpected(RTErrc::UnsupportedModel);
}
for (auto const& input : inputs) {
for (auto const& s : input.second->shape()) {
inputShape_.push_back(s);
}
LOGD << "- " << input.first << " (" << shapeToString(inputShape_) << ")";
deviceInputTensor_ = input.second;
// Only support 1 input
break;
}
LOGD << "Output tensors";
auto outputs = network_->getSessionOutputAll(session_);
for (auto const& output : outputs) {
std::vector<int> op;
op = output.second->shape();
size_t total = 1;
for (auto const& s : output.second->shape()) {
total *= static_cast<size_t>(s);
}
std::stringstream outShapeStr;
std::copy(op.begin(), op.end(), std::ostream_iterator<int>(outShapeStr, " "));
deviceOutputTensors_.push_back(output.second);
LOGD << "- " << output.first << " (" << shapeToString(op) << ")";
outputShape_.push_back(op);
outSize_.push_back(total);
}
hostInputTensor_ = new MNN::Tensor(deviceInputTensor_, deviceInputTensor_->getDimensionType());
for (auto t : deviceOutputTensors_) {
hostOutputTensors_.push_back(new MNN::Tensor(t, t->getDimensionType()));
}
modelLoaded_ = true;
network_->releaseModel();
return {};
}
else {
modelLoaded_ = false;
LOGE << "MNN failed to load model at " << path;
return unexpected(RTErrc::LoadModelFailed);
}
}
5. runInference¶
Here is where everything comes toguether, this method receives a tensor with the input already normalized (depending on the model configuration).
This data now needs to be send to the backend in a type the driver understands, so you might need to manipulate this object. CVEDIA-RT uses xtensor library, which allow easy and hardware accelerated vector transformation.
Note that, if your platform works with quantized data you may need to transform the float
tensor into a suitable type. The same happens when the backend replies, if data needs to be dequantized / transformed , you will have to handle it here.
MNN example:
expected<vector<xt::xarray<float>>> MNNCore::runInference(std::vector<cvedia::rt::Tensor>& input) {
unique_lock<mutex> m(sessMux_);
vector<xt::xarray<float>> output;
if (input.empty())
return output;
auto data = input[0].move<float>();
memcpy(hostInputTensor_->host<float>(), data.data(), data.size() * sizeof(float));
// Copy input data to MNN
deviceInputTensor_->copyFromHostTensor(hostInputTensor_);
network_->runSession(session_);
for (size_t i = 0; i < outputShape_.size(); i++) {
deviceOutputTensors_[i]->copyToHostTensor(hostOutputTensors_[i]);
float* data = hostOutputTensors_[i]->host<float>();
std::vector<size_t> sizetShape(outputShape_[i].begin(), outputShape_[i].end());
auto xarr = xt::adapt(data, outSize_[i], xt::no_ownership(), sizetShape);
output.push_back(xarr);
}
return output;
}
6. unloadBackend and destroy¶
When the backend stops being used, either because a new model is being loaded or there's no more references to it, CVEDIA-RT will destroy it, this will cause a unloadBackend call followed by a class destruction.
Your backend needs to handle this gracefully avoiding any memory and threading issues.
MNN example:
void MNNCore::unloadBackend() {
unique_lock<mutex> m(sessMux_);
backendLoaded_ = false;
}
MNNCore::~MNNCore()
{
if (session_) {
network_->releaseSession(session_);
}
unloadBackend();
}
Different edges¶
CVEDIA-RT can run in different edges and operating systems, we also compile against a few different toolchains.
Where and how you will run also depends on your backend and your driver support.
We currently support:
- GCC 5.5+ (8.4 preferred)
- x86_64
- aarch64
- armv6
- armv7