Detecting People in Real-time Using Deep Learning

Schuman Zhang
5 min readJul 10, 2018

--

Object detection is real-time is a really interesting topic. How do we reliably detect people and other real life objects in video feeds? Recently I managed to build a very simple app that simply connects to the user’s laptop webcam and automatically detect objects. I’d like to share with you how I built this app and some of the interesting issues and challenges I had along the way.

The full source code for my project can be found here.

Object detection is a very active research topic in the field of computer vision. The most productive efforts to detect objects and localising them (you think of as putting bounding boxes around objects) in an image is to use deep learning techniques. There are in particular several types of neural network architectures that are designed for this purpose, most notably R-CNN, faster R-CNN, Single Shot Detection (SSD), and YOLO (You Only Look Once).

Tensorflow object detection models

You can readily find pre-trained models of the above mentioned neural network architectures in the tensorflow library. Collectively they are called the tensorflow detection model zoo. These pre-trained models are trained on the COCO dataset, which contains a total of 90 class labels (real world objects such as people, cats and dogs etc). In our simple app, we will use the single shot detection method and the specific model will be called mobilenet. This architecture is more compact and we get the added benefit of speed, which is important since we’ll be analysing 30–50 frames per second.

I won’t go into the finer details of how these neural networks work (that’s another separate and interesting topic altogether). In our app our focus is to detect people and we’re trying to answer the question of whether there’s people in the room and if so, how many people are there? But it should also detect up to 90 real world object categories, including mundane objects such as cell phones, books, laptops etc. In theory, we could use transfer learning methodologies to re-train the final layers of these neural network architectures in order to detect a greater variety of objects, but then we will need an extra volume of training data and of course lots of computing power and time.

Just a quick summary, our video feed analysis will be detecting people in a room using tensorflow, open-cv and Python.

Building the object detection app

The overall flow of the app is as follows:

  1. We will use the open-cv Python library to read frames from the webcam of our laptop. This will be done via the VideoCapture function from open-cv.
  2. Then we will pass those frames into the mobilenet ssd model to detect objects. Any detections with confidence level above .5 will be returned and drawn into the frame.
  3. Any detected objects will pass through a visualisation module, which simply puts coloured bounding boxes around the detected objects in the image.
  4. We also add a tracking module which will display whether the room is empty or occupied and also displays the number of people in the room. The data is then stored in a separate .csv file.
  5. The processed frame is passed back and we use imshow function from open-cv to display the processed frames with the bounding boxes to the user.
  6. Lastly, the output of our video feed will be written into a separate .mp4 file at 20 frames per second, so that we can enjoy our work later :)

In the above code the ‘while’ loop is used to read frames from the webcam, we put the unprocessed frames into an input queue to be passed to our deep learning model. Once we received predictions from tensorflow, these predictions/detections will be inserted into an output queue, which then passes through the visualisation module via the object_tracker class, then finally we write the new processed frames into a separate file and display the results to the user.

We will take advantage of multi-threading in Python to increase the speed at which we process our video frames. The worker function below will take frames from the input queue, load the tensorflow model and put any detections back into the output queue. This will be run separately from the main thread.

And of course to visualise the detections we will need to pass the detected class labels, their respective confidence levels, bounding box colours and coordinates and draw them onto the frame.

Testing and evaluating the app

The next question is how well does this simple app perform? Running the app on my laptop I feel that detecting people is actually quite robust. I didn’t put this applications into a stringent testing environment. However, I can see many cases where it could be rather brittle. Firstly, when I put the Steve Jobs biography in front of the camera, the app detects another person, rather than detecting a book (so it cannot distinguish between a real person or an image of a person). Secondly, I feel that while detecting people is performing well, detecting many other classes don’t perform particularly well, it often mistakes my cell phone as a TV or a laptop for example. There is certainly a lot of room for improvement when it comes to detecting other real world objects.

Potential real world use cases?

We can easily think of many interesting real world applications for analysing and detecting people or other objects in a real-time video feed. We could detect the presence of people in a surveillance camera, after all we have a tremendous amount of security footage that nobody looks at. Potentially we can enable cameras to track people, count foot traffic and even identify specific actions and behaviours real-time. Autonomous transportation is also on the horizon, and this type of technology will be crucial in helping our vehicles to see the road and detect pedestrians.

--

--

Schuman Zhang

Interested in techie things. All my side projects here -> computer vision, natural language processing, augmented reality and some opinion articles