Intel OpenVINO: Inference Engine

Published in

Analytics Vidhya

8 min readAug 4, 2020

In my previous articles, I have discussed the basics of the OpenVINO toolkit and OpenVINO’s Model Optimizer. In this article, we will be exploring:-

What is Inference Engine?
Supported Devices
Feeding an Intermediate Representation(IR) to the Inference Engine
Inference Requests
Handling Output

What is Inference Engine?

Inference Engine, as the name suggests, runs the actual inference on the model. It only works with the Intermediate Representations(IR) that come from the Model Optimizer or the Intel pre-trained models which are already present in the IR format.

Like the Model Optimizer, which provides improvements on the basis of size and complexity of the models to improve memory and computation times, the Inference Engine provides hardware-based optimizations to get further improvements in the model.

The Inference Engine itself is actually built in C++, leading to overall faster operations; however, it is very common to utilize the built-in Python wrapper to interact with it in Python code.

Supported Devices

The supported devices for the Inference Engine are all Intel hardware:-

CPU (Central Processing Unit)
GPU (Graphics Processing Unit)
NCS-2 (Neural Compute Stick)
FPGA (Field Programmable Gate Array)

Mostly, the operation on one device will be the same as other supported devices, but sometimes, certain hardware does not support some layers(unsupported layers) when working with the Inference Engine, in this case, there are some extensions available that can add support. We will discuss them later in this article.

Feeding an IR to the Inference Engine

The Inference Engine has two classes:-

IENetwork
IECore

IENetwork

This class read the Intermediate Representation(.xml & .bin files) of a models and loads the model.

Class Attributes

name → Name of the loaded network.
inputs → A dictionary which contains the inputs required by the network model.
outputs → A dictionary which contains the output from the network model.
batch_size → Batch size of the network.
precision → Precision of the network (INT8, FP16, FP32)
layers → Return dictionary that maps network layer names in topological order to IENetLayer objects containing layer properties.
stats → Returns LayersStatsMap object containing dictionary that maps network layer names to calibration statistics represented by LayerStats objects.

__init__()

It is the Class constructor. It takes two parameters:-

model → The path to .xml file.
weights → The path to .bin file.

Returns an instance of IENetwork class.

Member Functions

from_ir()

Reads the model from the .xml and .bin files of the IR. It takes two parameters:-

model:- Path to .xml file of the IR
weights:- Path to .bin file of the IR

Returns an instance of IENetwork class.

NOTE: You can use the IENetwork class constructor instead of from_ir()

reshape()

Reshapes the network to change spatial dimensions, batch size, or any dimension. It uses one parameter:-

input_shapes → A dictionary that maps input layer names to tuples with the target shape

NOTE: Before using this method, make sure that the target shape is applicable to the network. Changing the network shape to an arbitrary value may lead to unpredictable behaviour.

serialize()

Serializes the network and stores it in files. It takes two parameters:-

path_to_xml → Path to a file, where a serialized model will be stored.
path_to_bin → Path to a file, where serialized weights will be stored

IECore

This is a Python Wrapper Class to work with the Inference Engine.

Class Attributes

available_devices → The devices are returned as [CPU, FPGA.0, FPGA.1, MYRIAD]. If there is more than one device of a specific type, they all are listed followed by a dot and a number.

Member functions

It has many member functions, but I will be focusing on three main functions:-

load_network()

Loads a network that was read from the Intermediate Representation (IR) to the plugin with the specified device name and creates an ExecutableNetwork object of the IENetwork class.

It takes four parameters:-

network → A valid IENetwork instance.
device_name → A device name of a target plugin.
config → A dictionary of plugin configuration keys and their values(optional).
num_requests → A positive integer value of infer requests to be created. The number of infer requests is limited by device capabilities.

Returns an ExecutableNetwork object.

add_extension()

Loads extension library to the plugin with a specified device name.

It takes two parameters:-

extension_path → Path to the extensions library file to load to a plugin.
device_name → A device name of a plugin to load the extensions to.

query_network()

Queries the plugin with specified device name what network layers are supported in the current configuration.

It takes three parameters:-

network → A valid IENetwork instance.
device_name → A device name of a target plugin.
config → A dictionary of plugin configuration keys and their values(optional).

Returns dictionary mapping layers and device names on which they are supported

Loading a model(in IR) to the IE

We will first start by importing the required libraries(I will be working with Python)

from openvino.inference_engine import IENetwork 
from openvino.inference_engine import IECore

Let's Define a function to load a model.

def load_IR_to_IE(model_xml):
    ### Load the Inference Engine API
    plugin = IECore()    ### Loading the IR files to IENetwork class
    model_bin = model_xml[:-3]+"bin" 
    network = IENetwork(model=model_xml, weights=model_bin)

Checking for unsupported layers

As mentioned above, even after successfully converting to IR, there are some hardware devices which do not support certain layers and we have some extensions that can add support.

When an Inference Engine is used on a CPU, there might be certain layers which the CPU doesn't support, in that case, we have CPU extensions, which can be added to support additional layers.

I will be using the “query_network()” of IECore class (mentioned above) to get the list of layers supported by the Inference Engine. You can then iterate through the layers in the IENetwork you created, and check whether they are in the supported layers list. If a layer was not supported, a CPU extension may be able to help.

The “device_name” argument is just a string for which device is being used “CPU”, “GPU”, “FPGA”, or “MYRIAD”(which applies for the Neural Compute Stick).

Let’s add the CPU extensions and check for unsupported layers

    ### Defining CPU Extension path
    CPU_EXT_PATH=      "/opt/intel/openvino/deployment_tools/inference_engine/lib/intel64/ libcpu_extension_sse4.so"    ### Adding CPU Extension
    plugin.add_extension(CPU_EXT_PATH,"CPU")

These do differ by operating system a bit(I am working on Linux), although they should still be in the same overall location. If you navigate to your OpenVINO™ install directory, then deployment_tools, inference_engine, lib, intel64.

    ### Get the supported layers of the network
    supported_layers = plugin.query_network(network=network, device_name="CPU")    ### Finding unsupported layers
    unsupported_layers = [l for l in network.layers.keys() if l not in supported_layers]    ### Checking for unsupported layers
    if len(unsupported_layers) != 0:
        print("Unsupported layers found")
        print(unsupported_layers)
        exit(1)

The above code checks for the presence of unsupported layers. Let’s break the above code for simplicity.

I have used query_network() which is a member function of IECore class(mentioned above) to get the list of supported layers.
“network” is an object of IENetwork class which has an attribute called “layers”(mentioned above), it returns a dictionary that contains the name of each supported layer(as key) and its properties(as values). Using this we find out the unsupported layers(if present).
Finally, if any unsupported layers are present. We display the message and the unsupported layers, and we exit.

Let’s load the network

    ### Loading the network
    executable_net = plugin.load_network(network,"CPU")    print("Network succesfully loaded into the Inference Engine")    return executable_net

NOTE:- CPU Extensions are required only till 2019R3 version of the OpenVINO™ Toolkit. In 2020R1 (and likely future updates), CPU extensions no longer need to be added to the Inference Engine.

Inference Requests

After you load the IENetwork into the IECore, you get back an ExecutableNetwok, which is what you will send inference requests to.

There are two types of Inference Requests:-

Synchronous
Asynchronous

Synchronous

In the case of Synchronous Inference, the system will wait and remain idle until the inference response is returned(blocking the main thread). In this case, only one frame is processed at once and the next frame cannot be gathered until the current frame’s inference is complete.

For Synchronous Inference, we use “infer()”

infer()

It takes one parameter:-

inputs → A dictionary that maps input layer names to numpy.ndarray objects of proper shape with input data for the layer

Returns a dictionary that maps output layer names to numpy.ndarray objects with output data of the layer

Asynchronous

As you might have guessed, in case of Asynchronous Inference, if the response for a particular requests takes a long time, then you don’t hold up, rather you continue with the next process while the current process is executing. Asynchronous Inference ensures faster inference as compared to Synchronous Inference.

Where the main thread was blocked in synchronous, asynchronous does not block the main thread. So, you could have a frame sent for inference, while still gathering and pre-processing the next frame. You can make use of the “wait” process to wait for the inference result to be available.

For Asynchronous Inference, we use “start_async()”

start_async()

Takes two parameters:-

request_id → Index of infer request to start inference.
inputs → A dictionary that maps input layer names to numpy.ndarray objects of proper shape with input data for the layer

Let’s implement synchronous and asynchronous inferences

def synchronous_inference(executable_net, image):    ### Get the input blob for the inference request
    input_blob = next(iter(executable_net.inputs))    ### Perform Synchronous Inference
    result = executable_net.infer(inputs = {input_blob: image})
    return result

The above code shows synchronous inference, the input_blob is used as a key for the dictionary which is given as a parameter to the infer(). The infer() returns the result which is then returned by the function.

def asynchronous_inference(executable_net, request_id=0, image):    ### Get the input blob for the inference request
    input_blob = next(iter(executable_net.inputs))    ### Perform asynchronous inference
    executable_net.start_async(request_id=request_id, inputs={input_blob: image})    while True:
        status = executable_net.requests[request_id].wait(-1)
        if status == 0:
            break
        else:
            time.sleep(1)
    return executable_net

The above code shows the asynchronous inference, here we provide a request_id of the inference request(in case of multiple inferences). As mentioned above, asynchronous inference uses wait() to wait for the inference results to be available, if the results are available(status=0), then it comes out of the loop, else it waits for 1second. If we call wait(0), it will instantly return the status, even if the processing is not complete. But if we call wait(-1), it will wait for the process to complete.

Hence, asynchronous inference does not block the main thread as done in synchronous inference.

Handling Output

Let’s see how we can extract the results from an asynchronous inference request.

def asynchronous_inference(executable_net, request_id=0, image):    ### Get the output_blob
    output_blob = next(iter(executable_net.outputs))    ### Get the status
    status = executable_net.requests[request_id].wait(-1)
    
    if status == 0:
        result = executable_net.requests[request_id].outputs[output_blob]
        return result

As mentioned above “outputs” is a dictionary which holds the output from the network model. The “output_blob” acts as a key to access a particular output.

Thank you so much for reading this article. I hope by now you have a proper understanding of an Inference Engine.