How to track objects in the real world with TensorFlow, SORT and OpenCV

Published in

HAL24K TechBlog

6 min readOct 16, 2019

Many of the solutions we work on at HAL24K involve detecting, classifying and tracking objects from camera feeds. However, for some of our customers it’s not enough to simply identify and track each object in the frame: they need to know exactly where it is located in the real world (i.e. its latitude and longitude), how large it is or how fast it is travelling.

Luckily, two factors come into play which we can use to our advantage on many of our projects. Firstly, the cameras we use are often fixed in position. This means that each pixel of the background always corresponds to the same real world location. Secondly, many of our customers are based in the Netherlands, which, as countries go, is pretty flat. This means that whatever scene our camera observes can often be approximated to a set of objects located upon a 2D surface.

In this article we’ll show how, assuming these two preconditions, you can create a matrix transformation to convert between frame position and longitude-latitude, using OpenCV. By using this matrix transformation we will then determine the real world positions of the objects we detect plus calculate the speeds at which they are travelling.

The camera feed

For this article we’ll be using a few minutes of video taken from the livestream below, provided courtesy of Provincie Gelderland (and streamed using VidGear).

You can see that the camera is fixed in position and observes a set of objects on an approximately 2D surface — vehicles travelling around a roundabout. We should therefore be able to define a matrix transformation to map the real space positions of these vehicles. But before we can perform this mapping, we will first need to detect and track the vehicles within the video. We’ll do this with two out-of-the-box techniques: a pre-trained TensorFlow object detector and the SORT tracking algorithm.

Detecting

First we need to generate the detections. Fortunately, the category of object we wish to detect — “car” — is one of the 80 COCO object classes, so we can use a TensorFlow object detector pre-trained on the COCO dataset. For this article we’ll use an Inception SSD (ssd_inception_v2_coco) and generate the detections by following the walkthrough provided by TensorFlow, keeping only those classified as “car”.

Although it struggles with distant and occluded vehicles, this object detector does a pretty good job for a pre-trained model:

Tracking

Next we need to track these detections across the video. We’ll do this by first dropping any “duplicate” detections (that have a non-negligible IoU with a more probable detection in the same frame) and then using the SORT algorithm to track the objects between frames. Finally we ignore any tracked objects that have fewer than five detections. To turn these tracked detections into paths we take the bottom center of the bounding box generated by the Kalman filter to be the vehicle’s position.

This approach allows us to track the individual paths of all detected vehicles around the roundabout, with only some slight confusion between tracks for close-travelling vehicles in the far distance:

Mapping

Now that we’ve run our detection and tracking algorithms, we have, for each vehicle, a set of u,v coordinates which represent the path it took in “pixel space”, i.e. the 2D frame which our camera is recording. To know the “real space” location of each vehicle, we need to map these coordinates to another (approximately) 2D plane — the Earth’s surface. We can perform this mapping with a perspective transform, a matrix operation which projects a set of points from one 2D plane to another.

The maths for deriving this transformation can be a little involved (if you like linear algebra, they can be found here), but fortunately the matrix calculation can be performed by the OpenCV function getPerspectiveTransform. All we have to do is provide the coordinates of four points — noncollinear, so defining the corners of a quadrilateral — in both coordinate systems.

This is the fiddly part. To get these four coordinates, we need to identify the longitude-latitude of four pixel locations in our frame. These should be spread as far apart as possible: the greater the area covered by the quadrilateral, the more accurate the perspective transformation will be. For our video, I have identified four clear landmarks in the frame and located these on a satellite image (via google maps). These are defined and plotted below.

We then define a “PixelMapper” class which, from these four points, can translate a (u,v) pixel location to a (lon,lat) real world location, and vice versa.

Note that PixelMapper simply performs a linear transformation of coordinates: there is no logic to determine whether a background pixel falls beyond the vanishing point of the 2D surface, or whether a longitude-latitude coordinate falls behind the field of view of the camera. We therefore need to be careful of these cases when making transformations.

We can then perform these transformations like so:

And applying this transformation to the (u,v) coordinates of our tracked objects, we can now plot their real world positions as they travel around the roundabout:

Speed calculation

As our final step we want to calculate the speeds of the vehicles we have tracked. To do this we need to calculate the distance they travel, which we can do using the longitude-latitudes we’ve just calculated. Rather than doing this directly (using the haversine formula) we will instead perform another coordinate transform, this time converting the vehicle latitude-longitudes into a local Cartesian coordinate system, with units in metres. This then allows us to easily calculate lengths and areas in these units.

We will transform our tracks from longitude-latitude (epsg:4326) to RD New (epsg:28992), a Cartesian coordinate system (with units in metres) that is valid across the Netherlands. (Local coordinate systems for any location can be easily searched for here). This can easily be done using pyproj:

Then for our final piece of maths: to calculate the vehicles’ speeds we need only to divide the distance — calculated from the coordinates above — by the time. In the animation below we’re doing this using the distance travelled across the previous 10 frames (corresponding to 2 seconds of time). Fortunately for this video it seems that all drivers are well below the speed limit.

About HAL24K

HAL24K is a Data Intelligence scale-up based in San Francisco, Amsterdam and Singapore, delivering operational and predictive intelligence to cities, countries and companies.