An End-to-End Solution for Pedestrian Tracking on RTSP IP Camera feed Using Pytorch

An interaction layer with minimal set up to manage a pedestrian detection system

Masoud Masoumi Moghadam

Published in

NATIX

8 min readMay 27, 2020

***Original Image from*** ***arkeyphoto*** on ***Unsplash***

Introduction

Pedestrian detection system could come in handy in many types of software.
For example:

Software that investigates the crowd commuting in several passages and plots crowd heat-map for each passage. Then reports are created to address the human traffic bottlenecks in public places.
It can help organizations to check the social distancing in video surveillance systems.
The crowd simulation systems also might use the result of this system.
Pedestrian attribute recognition is another area of research that benefits from this system.

In this post, I will explain how I implemented an interaction layer for a real-time pedestrian tracking system on surveillance cameras using PyTorch and flask.
I will briefly introduce the deep sort and then mainly focus on the webserver side. This project was originally developed by Ziqiang Pei and the webserver idea was the contribution made by me. The code of this project can be found here.

A brief description of the deep sort

According to [1] Classic methods of multi-object tracking are divided into two parts:
Detection: All desired objects are detected.
Association: Then a matching is performed for similar detections with respect to the previous frame and after that matched frames are followed through the sequence to get the tracking of an object.

In deep sort algorithm, this method is further divided into three steps:
1. Detection: To compute detections, a popular CNN-based object detection method is used (in this project we used YOLO).
2. Estimation: The intermediate step before data association consists of an estimation model. This uses the state of each track as a vector of eight quantities, that is, box center (x, y), box scale (s), box aspect ratio (a), and their derivatives with time as velocities. The Kalman filter is used to model these states as a dynamical system. If there is no detection of a tracking object for a threshold of consecutive frames, it is considered to be out of frame or lost. For a newly detected box, the new track is started.
3.Association: In the final step, given the predicted states from Kalman filtering using the previous information and the newly detected box in the current frame, an association is made for the new detection with old object tracks in the previous frame. This is computed using Hungarian algorithm on bipartite graph matching. This is made even more robust by setting the weights of the matching with distance formulation.

For further study about deep sort, I suggest you take a look at [1] on page 126–129 and also this blog.

Motivation

It’s been a while since I was researching blogs to find open-source approaches for pedestrian detection on RTSP video. Unfortunately, I was not able to find good resources that implemented both AI and backend parts. So I decided to develop one with the hope that this project could be a basic approach to help others to build more advanced projects.

What is RTSP anyway?

Real-Time Streaming Protocol (RTSP) is a network control protocol designed for communications systems to control the streaming of media servers. In this tutorial, I assume you have got some RTSP link to the camera you want to test the service on. I guess it is available for everyone to use earthcam.com stream videos.

Architecture Design

A basic approach

Since I had little experience in design patterns and backend engineering, in the first attempt I came up with this structure:

However soon I noticed some drawbacks of this approach:

The framework was not robust against network packet loss. This gives output videos with jitters.
I used subprocess module with a bash command to start the AI application. This leaves me with no control over application (The only way to stop the process was to find PID of the AI process and then kill the process).
I was not able to switch the camera. In order to switch the camera in this structure, I had to restart the interaction layer (because AI process starts/stops with the interaction layer, and there is no way to stop the AI without stopping the interaction layer).
For each client request, the service would be started again. After several requests, the server crashes due to the overflow of server resources.

So I figured out that most of the problems occur due to the high coupling of AI script and Interaction Layer modules. In this way, I decided to add Redis as an async module for better decoupling.

An improved design by adding a caching module

Thanks to this awesome post from Adrian Rosebrock, I realized the importance of caching in web servers using Redis. I reorganized the structure like this:

***The 2nd architecture: Adding Redis as an async module to decouple the Interaction Layer and AI service***

In the image above, the numbers in black circles indicate the following steps of the procedure:

1) When a webserver is up, it ignites the pedestrian detection service with default inputs.
2) Pedestrian detection gets RTSP link to surveillance camera as input and reads the video input using threads.
3) After AI algorithm finishes its task on each frame, the output frame is cached in Redis (each frame is replaced with the output of previous steps)
4) The interaction layer retrieves the processed frames from the Redis cache.
5) Interaction layer provides stream pedestrian detection on 127.0.0.1:8888/run address for clients.

How did this structure improve performance?

We always face network packet loss in the system. In order to improve the robustness against this issue, the process of reading frames from camera feed is done by threads.
Since the interaction layer has full control over the AI process, it can switch the input camera with not much trouble.
Using Redis as an async module, improved the decoupling of the AI process and an interaction layer. In this way, the interaction layer does not start multiple instances of AI again and again. So that we can scale this framework to use pedestrian detection for multiple cameras in parallel.

Code description

Requirements

For this project I used these python libraries:

Python 3.7
Opencv-python
Sklearn
Torch > 0.4
Torchvision >=0.1
Pillow
Easydict
Redis
Dotenv
Flask

My OS is Ubuntu 16.04 and the GPU model is Nvidia GeForce GTX 2080.
For caching I installed Redis and it’s serving on port 6379. A thorough installation guide for Redis can be found in here.

Code step 1: Capture video from RTSP

In the first step, I just tried to capture video frames and cache them in Redis. The OpenCV cv2.VideoCapture objects can handle inputs from RTSP link by default. But apparently this module is not robust against network packet loss according to [3]. So in order to overcome this issue, I used threads for obtaining frames. Here’s the initial code I came up with:

Reading frames from rtsp using threads to overcome packet loss

Code step 2: Adding pedestrian detection

The pedestrian detection code for simple video inputs is available here. Combining this code with the one in the first step, I came up with this:

Pedestrian detection script

Let me explain the important changes:

Inline 67, in_progress the flag is a variable which will be initialized in the rtsp_server.py . Whenever we want to abrogate the AI script, we switch it as off and whenever we want to start the AI, we switch it as on . This trick enables us to call for stopping/restarting the AI service from another process (like rtsp_server which is the main thread running).
The ability to control the execution of the pedestrian detection process can help us to switch the camera feed after the restart.
In lines 78 — 98, detection function is added which does the detection, estimation, and association of deep sort algorithm (as explained earlier) on pedestrians.
In lines 100 — 117,draw_bboxes function is added to draw the bounding boxes along with pedestrian id for pedestrians.

Coding Step 3: Adding web server

On the webserver, basically I need two tools to run this application as designed. First I need a tool that retrieves cached frames from Redis and displays it in some Http link. Second, I need to be able to start/stop the AI service.

For the first one, An Html template is created and named as index.html . Then it is placed in templates directory. This template is used to embed the stream images provided by webserver to show to clients.

Client-side view

And to configure our deep sort and yolo detector model parameters server_cfg.py script file is added.Here is the code:

deep learning model configuration

In my projects, I always make a .env file which holds the information of static files like model weights, configuration files, or datasets.

In line 16 and 28 I have fetched the directory address of Yolo and deep sort models from .env file. This trick is useful for me when I have to move the services directory to the server. We are going to use variables model and deep_sort_dict in our rtsp_webserver.py file.

And at last, this is the code to the webserver:

Web server application that controls AI-based service.

Let’s scrutinize the code together:

Lines 23 — 25, we initialized the Redis, flask server and in_progress flag.
In lines 26 — 49, we have initialized the argument parser.
In lines 52 — 59, the function gen returns the cached frames in the pedestrian detection part. To provide the processed frames for clients, I used stream function (lines 95 — 103) which maps them to index.html.
In lines 62 — 71, the function pedestrian_tracking starts the pedestrian tracking process. To trigger the action I used the function trigger_process in lines 74 — 87.
Finally in lines 106 — 140, function process_manager , is the main function that deals with GET requests from clients to 127.0.0.1:8888/run. Each request to this link has 2 parameters at most. The first one is run which commands to start (if set it as 1) or stop (if set it as 0). The second one is camera_stream which is the RTSP link of camera feed and its value will eventually be set as the new video input to pedestrian detection service (lines 122–128).
While the pedestrian detection service is running on some cameras, it’s impossible to trigger a new one. So every request with run=1 will be responded to by pedestrian detection is already in progress (line 131). The same scenario is available for stopping the service (line 137).

I have to add some authorization and authentication over this function which I will add later to the Github repo.

Summary

In this article, a basic approach to pedestrian detection on surveillance cameras is described. We found out how to design architecture is critical to scale the application. In this project, using a caching tool like Redis helped us to design a better structure by decoupling the webserver and pedestrian detection process. We also provide a basic API to control the pedestrian detection process and its input.

In the end, I have to give a shout-out to Dr. Adrian RoseBrock for his awesome blog posts (especially this one) which were so useful for this project.
I also have to thank Ziqiang Pei who provided his open-source pedestrian detection approach and let me to collaborate in the project.

References

[1] Practical Computer Vision by Abhinav Dadhich
[2] https://nanonets.com/blog/object-tracking-deepsort/
[3] https://stackoverflow.com/questions/55828451/video-streaming-from-ip-camera-in-python-using-opencv-cv2-videocapture/55838623#55838623

DISCLAIMER: This post only reflects the author’s personal opinion, not any other organization’s. This is not official advice. The author is not responsible for any decisions that readers choose to make.