Building a Real-Time Object Recognition App with Tensorflow and OpenCV

In this article, I will walk through the steps how you can easily build your own real-time object recognition application with Tensorflow’s (TF) new Object Detection API and OpenCV in Python 3 (specifically 3.5). The focus will be on the challenges that I faced when building it. You can find the full code on my repo.

And here is also the app in action:

Me trying to classify some random stuff on my desk:)


Google has just released their new TensorFlow Object Detection API. The first release contains:

I wanted to lay my hands on this new cool stuff and had some time to build a simple real-time object recognition demo.

Object Detection Demo

First, I pulled the TensorFlow models repo and then had a looked at the notebook that they released as well. It basically walked through the all steps of using a pre-trained model. In their example, they used the “SSD with Mobilenet” model but you can also download several other pre-trained models on what they call the “Tensorflow detection model zoo”. Those models are, by the way, trained on the COCO dataset and vary depending on the model speed (slow, medium and fast) and model performance (mAP — mean average precision).

What I did next was to run the example. The example is actually well documented. Essentially this is what it does:

  1. Import the required packages like TensorFlow, PIL etc.
  2. Define some variables e.g. number of class, name of the model etc.
  3. Download the frozen model (.pb — protobuf) and load it into memory
  4. Loading some helper code e.g. an index to label translator
  5. The detection code itself on two test images

Note: Before running the example, be aware to have a look at the setup note. In particular, the protobuf compilation section is important:

# From tensorflow/models/
protoc object_detection/protos/*.proto --python_out=.

Without running this command, the example won’t work.

I then took their code and modified it accordingly:

  • Remove the model download part
  • PIL is not needed as the video streams in OpenCV are already in numpy arrays (PIL is also a very big overhead specifically when using it for reading in the images aka video streams)
  • No “with” statement for the TensorFlow session as this is a huge overhead especially when every time the session needs to be started after each stream

Then, I used OpenCV to connect it with my webcam. There are many examples out there that explain you how you can do it, even the official documentation. So, I won’t dig deeper into it. The more interesting part is the optimization that I did to increase the performance of the application. In my case I looked at good fps — frame per seconds.

Generally, plain vanilla/naive implementation of many OpenCV examples are not really optimal, for example some of the functions in OpenCV are heavily I/O bounded. So I had to come up with various solutions to encounter this:

  • Reading of frames from the web camera causes a lot of I/O. My idea was to move this part completely to a different Python process with the multiprocessing library. This somehow didn’t work. There were some explanations on Stackoverflow why it wouldn’t work but I did’t dig deeper into this. Fortunately, I found a very nice example from Adrian Rosebrock on his website “pyimagesearch” using threading instead which improved my fps a lot. By the way, if you want to know the difference between multiprocessing and threading, on Stackoverflow there is a good explanation for this.
  • Loading the frozen model into memory is a big overhead every time the application starts. And I already used one TF session for each run but still this is very slow. So what did I do to solve this problem? The solution is quite simple. In this case, I used the multiprocessing library to move the heavy workload of the object detection part into multiple processes. The initial start of the application will be slow as each of those processes need to load the model into memory and start the TF session but after this we will benefit from parallelism😁
  • Reducing the width and height of the frames in the video stream also improved fps a lot.

Note: If you are on Mac OSX like me and you’re using OpenCV 3.1, there might be a chance that OpenCV’s VideoCapture crashes after a while. There is already an issue filed. Switching back to OpenCV 3.0 solved the issue though.

Conclusion & Outlook

Give me a ❤️ if you liked this post:) Pull the code and try it out yourself. And definitely have a look at the Tensorflow Object Detection API. It’s pretty neat and simple from the first look so far. The next thing I want to try is to train my own dataset with the API and also use the pre-trained models for other applications that I have on my mind. I’m also not fully satisfied with the performance of the application. The fps rate is still not optimal. There are still many bottlenecks in OpenCV that I can’t influence but there are alternatives that I can try out like using WebRTC. This is however web-based. Moreover, I’m thinking to use asynchronous method calls (async) to improve my fps rate. Stay tuned!

Follow me on twitter: @datitran