Not just another YOLO V3 Object Detector for Python

If you are interested on Computer Vision, then you have probably heard about YOLO by now. YOLO — “You Only Look Once” is a fast, real-time technique for object detection. The latest version of YOLO, YOLO version 3.0 released recently is still much faster than contemporary approaches for Object Detection while producing comparable results. Already their exist popular python wrappers/ports for earlier versions of YOLO. However, I needed a quick wrapper on YOLO 3.0 for Python which made me commence the project YOLO3–4-Py. As of now, it is a simple stable wrapper based on Cython.

Following is an output obtained from YOLO3–4-Py

YOLO-3–4-Py applied for TownCentre test video from “Coarse Gaze Estimation in Visual Surveillance Project” by University of Oxford

Whats so special about YOLO-3–4-Py?

In fact, there exist a work-in-progress Python wrapper based on ctypes within the original source code repository of darknet (which is the platform used to implement YOLO 3.0). You can find it here. However I noticed a disturbing drawback in this wrapper which made it unsuitable for a typical Python based Open CV development.

I needed interactions with standard Open CV 3 API for Python

We usually work with Numpy Arrays when we use Open CV in Python. We would use Numpy arrays to represent images. Thus, we would like to use a Numpy Array to feed an input image to the detector. Despite being very useful, this facility is not implemented in the Python wrapper embedded in darknet source code. The trivial technique based on the provided wrapper is to save the Numpy array image to the disk and re-load it using the wrapper. Certainly not a lucrative idea for a real-time detector.

I already knew about a project called pyboostcvconverter by Gregory Kramida which lets you convert a Numpy Array to cv::Mat, the matrix implementation of native Open CV 3.0. Based on this project, I was able to transform Numpy Arrays to Darknet Images using in-memory operations.

It is still very fast

In my test bench with below mentioned specs, YOLO-3–4–Py accounts for approximately 7% additional CPU time in processing 1280 x 720 video frames. For a 1920 x 1080 resolution, this overhead increases to approximately 11%. I’m sure this overhead can be reduced even more in the future.

Test Bench Specifications: Intel Core i7 7700 HQ (up-to 3.8 GHz), 16 GB Memory, nVidia Geforce GTX 1060 6GB VGA, Ubuntu 16.04, Open CV 3.4 and Tensorflow 1.5.

With your contribution, it can be made better!

I have not spend a lot of time on this project yet (It’s less than 2 days work as of 2018–04–07). Still, the results I’ve managed to get surprised me. The wrapper is stable and fast. So, its a good starting point!

It can be made better. I would love to hear your thoughts and ideas about improvements that can be made to the project. Your ideas and code contributions are most welcome.

Feel free to make pull requests and do improvements. Raise issues on GitHub repository. It’s under the Apache 2.0 License so anyone can use and contribute.

And, whats so special about YOLO?

Object Detection is the task of identifying objects in an image with bounding boxes indicating their location. Conventional approaches for Object Detection depended on re-purposing Image Classifiers (techniques for identifying type of object in an image) as Object Detectors using a technique known as Regional Proposal Network. The region proposal network would propose a redundant set of overlapping bounding boxes inside the image as possible useful areas. Then a classifier would try to identify the type of object in each bounding box. This approach means that the classifiers used in conventional object detectors would look at the same part of an image many times. The specialty of YOLO is that it looks at a certain part of an image only once. The end result of this approach is a much faster Object Detector with comparable accuracy (as of Version 3.0).

More details on YOLO can be found in their official website


  1. YOLO Project Website
  2. “TownCentre” test video from “Coarse Gaze Estimation in Visual Surveillance Project” by University of Oxford.