Solution Architectures for Computer Vision Projects

A write-up on various engineering architectures available for deploying a computer vision solution

Published in

Analytics Vidhya

9 min readAug 13, 2021

Recently a colleague was discussing with me various architectures that he can use for deploying a Computer Vision solution. We sat and created a list which I thought would be nice to share here. Please do share your thoughts.

We were speaking of the best way to deploy a deep learning based Computer Vision solution. There were a number of pros-and-cons to be considered for it and here is a listing of various architectures available for engineers.

Introduction

Computer Vision techniques can be broadly grouped into various different use cases like Image Classification, Object Detection, Depth Perception, Image Generation etc. Requirements like OCR, face recognition generally be fit into one of these blocks. This document is about listing the various architectures that can be used for deploying the solutions for these use cases.

Generally speaking, the solutions can be architected using one of the below approaches. There are cost versus performance trade-offs for each solution and chose what works for you.

Running on a PC

CPU Vs GPU

2. Centralised processing of video stream

Dedicated Server
Cloud CPUs, Cloud GPUs, Cloud VPUs
Off-the-shelf solutions

3. Edge Computing options

Intel OpenVino
Qualcomm SNPE

Further more, I will be splitting the entire process of Computer Vision into two parts here. One is Model Training and the other is model inferencing. Most interesting aspects are happening in both these areas. My discussion here is limited to Inferencing for the most part. I will briefly touch upon the training as well.

Out of scope: Discussion about the merits/demerits of individual algorithms/models is out of scope for this discussion. Unless such models/algorithm are unsuitable for a particular architecture, I will stay away from any discussion regarding a comparison of models/algorithms. In other words, if you want to know a vis-à-vis comparison of Inception and Resnet then, this document is not for you.

Disambiguation of terms

· Classical Computer Vision algorithms: I am including all non-deep learning algorithms into this. Whether it is for changing colour space or for pattern recognition, if it is not based on deep learning, I am including them into the definition of “classical”

· Deep learning models: Irrespective of the framework used to build them (like Pytorch, Tensorflow or plain C++), if the model uses perceptron and CNN based approaches, I will be referring to them as deep learning models. Whether it is a shallow network or an ultra-deep one which you wrote from scratch, they are all deep learning models for me.

The architectural flow

The architectural components of training and inference can be depicted as below. Similar to any software project, the Training ala Development is separated in time from Inference ala Production.

Systems based on Classical algorithms

This is what most of the software developers are used to. In this architecture, deployment means rolling out the code to production environment.

Deep Learning based systems

Notice the differences?

Model Training

Classical Algorithms

The training in this approach is the age-old approach that feels most natural for all developers. In this architecture, the model is built after heuristically identifying the parameters of the algorithms. The training artefact is code, (whether it is Python or Java or good old C).

Deep Learning Models

The training in this case is also fairly straight forward. Whether you are performing transfer learning or building a network from scratch, you train the model on a high-performance machine locally or on Cloud. The training time can be reduced significantly by harnessing GPUs. Refer to this paper for a study.[https://web.stanford.edu/~rezab/classes/cme323/S16/projects_reports/hedge_usmani.pdf]

There are research papers exploring parallel and distributed computing approaches for Deep learning. But this is not mainstream yet. So, I am not discussing further on them here. Moreover, the focus of this document is on inference.

Model Inference

Whatever technology we use, there is a place where image is captured and converted into a colour space matrix. (RGB, CMY etc). This is a numerical representation which enables the monitors to display the captured information. And Inference is all about making sense of these numbers.

There are a variety of ways to run inference for Computer Vision solutions. The choice of the hardware architecture has major impact on the CapEx, OpEx and quality of the solution. A careful evaluation of these parameters is vital for understanding the value benefit for the consumer.

Let’s for a moment look at how the images data needs inferencing. Offline image processing (photos uploaded to a cloud drive etc.) and processed at a later time or real-time processing where video frames from a camera are continuously being processed. Whatever your business requirement might be, you need to evaluate the cost implications for delivering a successful product.

There are a variety of ways to architect the product. Each approach has technical and financial implications. I am capturing some of them here.

The landscape

Classical Image Processing Techniques

Local Inferencing

In this scenario, the camera is directly connected to the machine where code is running. This offers great flexibility in controlling the hardware and software but is inevitably constrained by the hardware capabilities of the host machine. The camera cannot be too far away from the host computer for practical considerations. Imagine running a camera wire the length of 200/300 meters so that face recognition can be performed on passengers. Streaming the video frames over internet solves this problem but adds additional complications. Also, it does not scale very well when there are multiple camera feeds that need to be analysed. CPU limits are invariably hit fairly fast.

Cloud

Cloud offers a great simplification compared to local machines. However, cloud is not “free”. It is certainly affordable but it is not free. And it might not even be cheap. You have to carefully build your lamba functions etc to ensure optimal response times.

Also, network latencies have to be considered while processing the video frames. Your camera might be located on your front Porsche. But if the cloud server is running on a datacenter a few hundred miles away then network latencies will have to be considered for your solution.

Deep Learning Techniques

Generally speaking, a deep learning based model offers better accuracy for your models. However, deep learning models are also slow, CPU hungry and memory hogs. There are multiple approaches in the industry to tackle them and bring out a beautiful solution.

Local CPUs

The deep learning solution can be run on the same machine to which the camera is connected. This is great for a niche set of activities where the image cannot be permitted to leave the computer (perhaps for legal or privacy reasons). In this case, scaling becomes a challenge. With Deep learning model’s CPU heavy number crunching, adding a second and third camera will exponentially throw up the costs.

Local GPUs

GPUs are great for reducing the computational time for deep learning models. They almost magically improve the run-time performance. Please note that GPUs don’t impact the algorithm’s performance. They only impact runtime performance.

NVIDIA GPUs: The most well known of all the GPUs. Using CUDA programming, you can harness the GPUs awesome power. However, CUDA is restricted to NVIDIA GPUs.
Intel GPUs: Many of the recent laptops have an Intel GPU integrated into their mother board. Alas, nobody uses them. I attended an Intel training program on their OpenVINO usage where the trainer confessed “most of them are probably sitting idle in your system”. This was circa early 2020. Since then, Intel has invested many millions into changing this scenario.

Cloud

Cloud CPUs and GPUs are along the same lines as I described above for Local CPUs. Excepting that, I never heard of Intel GPUs available in the cloud. Other than NVIDIA GPUs, there are also Vision Processing Units and Tensor Processing Units available on cloud which will perform the same function as GPU and reduce your latency time.

Edge Computing

Whether you are processing the deep learning on the local network or in the cloud, you are inevitably constrained by the CPU power of the machine. Increasing CPU power is exponentially expensive. Also, network costs also push your OpEx. Is there a way for achieving your target without breaking the bank?

The answer is special purpose devices like Intel Movidius (Neural Computing Sticks), Qualcomm Snapdragon etc. In this scenario, the inference processing is offloaded to these special purpose devices that are attached to the computer. They come in all shapes and designs from being integrated into the motherboard to plug-n-play USB interface sticks.

Using some special coding, you can offload the inference processing of your deep neural network to these devices and keep your CPU free. These devices are cheap, portable and more importantly, offer Edge computing facilities. So, instead of sending your entire video frame to the CPU in the cloud, process the frame on the edge device and send only the inference result to the main server. So, the server, instead of processing the entire image, simply processes information that the image has a car, a cat and a mouse. This is significantly simpler and is very easy on your budgets. If in case the edge device is damaged, replacing them isn’t too expensive either.

But, there is a catch. These Edge devices are not capable of processing any random neural network you might conceive. They support a pre-defined list of networks like Inception, YOLO etc. As a developer, you have the freedom to perform transfer learning and train them for your chosen objects. But if you want to build a custom network from scratch, you might have to make some compromises on the activation functions, types of layers etc. That said, this is a dynamically progressing area and competition is driving the companies to innovate faster. Personally, this is my bet on how the computer vision domain shall progress.

Common Trade-offs

Responsiveness-vs-Cost

Irrespective of the architecture you have chosen, there will always be a trade-off decision regarding Responsiveness-vs-cost. I have seen many people being unaware of this point. So, let me write it here.

Video stream is a sequence of images. Every video camera has a framerate. Frame rate is the number of images captured per second. The higher the frame rate, the more realistic is the movement of objects on screen. 24 frames per second and you get the realistic action we see in TVs and movies. What this means for you is, when processing video frames, we have to process 24 images per second. Now, deep learning networks being CPU hungry, take 3–4 seconds for processing one frame. So, by the time you processed one frame, you already have 24*3–1 = 71 frames already in the queue. In a few seconds, you are processing a frame which is more than a few minutes outdated. Your solution is not real time any more. Moving to GPU will reduce your processing to time to less than a second, but it is not 1/24th of a second (at least at the time of writing this article). No matter whether you choose a CPU, GPU, VPU, TPU or Edge devices, this problem remains. One approach is to process only 1 frame out of every n and skip the remaining. If you chose your architecture properly (considering all the options described above), your n can be quite small and be almost real time.

Choice of Hardware

The hardware you chose will impact your solution to higher degree than simply the costing. Choice of GPU vs CPU vs Cloud Vs Edge involves Product Managers too. Why? In the section on Edge Computing, I mentioned about the compromises involving the algorithmic choices. These choices impact the accuracy and effectiveness of the model. This is where Product Managers have to step in and weigh in on what amount of accuracy loss is acceptable.

There is one more catch. Once you choose your platform (either a CPU or GPU or VPU), it is fairly difficult to port it to another. The platforms, languages and internals are plain incompatible with one another. OpenCL claims to bring in interoperability but is yet to deliver on its promise. These nuances have to be carefully considered when making your architectural decisions.