An End to End Workflow on the Cloud to Monitor Traffic Flow using Deep Learning

Published in

GeoAI

11 min readJul 17, 2019

An overview of integrating live feed streaming video technology with ArcGIS, the ArcGIS API for Python and deep learning models in Amazon Web Services (AWS) to monitor traffic flow.

Outline:

1. Traffic Management and Problem Statement
2. Live Feed Streaming Technology and Data Labeling
3. Object Detection: Training YOLO3 on AWS
4. End to End Workflow Architecture
5. Monitor Live Traffic Flow using the Dashboard
6. Anomaly Detection using Historical Data
7. Conclusion and Future Works
8. Acknowledgment and References

Traffic Management and Problem Statement

Traffic flow and intensity are important factors to monitor in urban environments. The need to control road traffic flow is not a new idea. In most big cities, busy roads, highways, and intersections are monitored using cameras. People at the department of transportation are usually interested in a variety of things such as accidents, road coverage such as snow, crime, car breakdowns, speed of cars, traffic jams and number of pedestrians. Information from live feed streaming technology can help better to manage traffic and provide public safety for citizens. A study by the National Highway Traffic Safety Administration showed that 36% of crashes happen at intersections. Thus, intersections are a major cause of city congestion leading to delays and a major concern of traffic management centers. To monitor and manage traffic flow, many cameras are installed at intersections. These can be both cameras with a fixed position, as well as remotely controlled (pan/tilt/zoom) cameras (Figure 1).

Figure 1. Images’ Examples of Live Feed Streaming Technology with Detected Objects at Two Intersections in Washington D.C. (Day-Time and Night-Time)

The District Department of Transportation (DDOT) in Washington D.C. asked Esri to build an end-to-end workflow on the cloud that can 1) monitor traffic (mainly cars, buses, trucks, bikes, and people) at the main 111 intersections in Washington D.C. and visualize traffic flow using GIS, 2) detect anomalies in traffic (mainly cars, buses, trucks, bikes, and people) at intersections in terms of volume, and 3) detect pedestrians at unsafe locations at intersections. Such an end-to-end workflow requires not only detecting desired objects from live feed streaming video, but it also needs to integrate geospatial data with a deep learning framework.

In this blog post, I show how I built an end-to-end workflow by leveraging ArcGIS, the ArcGIS API for Python [1], AWS and a deep learning framework (Keras here). The end to end workflow uses AWS for training and inference using GPU to speed up the process and serve this workflow as a live workflow for the department of transportation [2]. It also uses the ArcGIS API for Python to integrate geospatial information such as the location of objects from live feed streaming video with deep learning framework and store the time series records in using ArcGIS Enterprise[3].

2. Live Feed Streaming Technology and Data Labelling

Deep learning projects require large data to train the model. Thus, I had to collect training samples from live cameras in Washington D.C. TrafficLand provides live traffic video for 111 cameras in Washington D.C. as a RestAPI [4]. To prepare the training samples, I wrote a Python Script that randomly calls TrafficLand Rest API and collected ~1000 images at both day-time and night-time from 111 cameras. I then gathered images into a folder and have labeled images using LabelMe (Figure 2) by drawing bounding boxes around desired objects [5]. I then exported labels in a text file format that is readable by most of the object detection algorithms.

Figure 2. User-interface of LabelImg Software and Example of Labeled Image in Day-Time

3. Object Detection: Training YOLO3 on AWS

To build an end to end workflow, we initially need to detect objects from live feed streaming. You Only Look Once (YOLO) is a popular deep learning object detection algorithm that can achieve high accuracy in real-time applications. The output of YOLO provides the location of objects in the image as well as its associated class. YOLO requires only one forward propagation pass through the network to make predictions. The earlier version of YOLO such as YOLO2 often had missed small objects in the image due to loss of fine-grained features as the layers downsampled the input image. In addition, YOLO2’s architecture was still lacking most of the state-of-the-art algorithms such as residual blocks, skip connections and upsampling. YOLO3 tackle such issues by adding more layers and incorporate residual skip connections and upsampling. YOLO3 performs the detection by applying a 1×1 detection kernels on feature maps of three different sizes at three different places in the network (Figure 3). There are lots of papers and blogs that clearly explain about YOLO and how YOLO progressed during the last few years. Thus, I am going to skip covering YOLO in more details here. You can find more about original YOLO as well as how they are different from each other in these references [6].

Around 1000 images from 111 live feeds cameras at the day-time and the night-time with their associated labels used to train YOLO3 on AWS. I ran an EC2 instance on AWS that had AWS deep learning AMIs [2] which comes with pre-installed popular deep learning frameworks and interfaces such as TensorFlow, PyTorch, and Keras to train sophisticated deep learning models. I used the pre-trained YOLO3 model and did transfer learning. I then tested the trained model by comparing the predicted bounding boxes with the actual labels. The trained model performed ~95% mean intersection of union (IOU). I used existing Github repo to train YOLO3 model as a reference [7].

4. End to End Workflow Architecture

To build a live workflow on AWS, we came up with below architecture to facilitate traffic monitoring (Figure 4): 1) We used parallel processing to speed up collecting live feed images from TrafficLand RestAPI, 2) Then images pass to the trained YOLO3 model on AWS EC2 instance. YOLO3 detects objects in each image and their associated bounding boxes. Overall, we were able to collect 111 images and run inference in 10 seconds on AWS using a single NVIDIA Tesla K80 GPU, 3) We then send the outcome of YOLO3 model to GeoEvent [8] on AWS to be stored in the big data store as well as visualizing traffic flow on the dashboard. Our workflow also stores the associated image with each camera in S3 bucket across times.

Figure 4. An End to End Workflow Architecture

Our IT team set up deep learning EC2 instance and ArcGIS GeoEvent Server [8] on AWS. GeoEvent Server enables real-time event-based data streams to be integrated as location data in a feature class or big data store [9]. To connect GeoEvent Server to YOLO3 model on deep learning EC2 instance, I had to configure GeoEvent Service which includes input connectors, processors and output connectors. GeoEvent Services are created using a simple graphic interface like ModelBuilder [10].

The input connector informs GeoEvent about the structure of event data that coming from YOLO3 model and passes data to the processors. If you send data with other format or structure, GeoEvent won’t be able to consume data. Several connectors are provided for common data formats (text, RSS, ESRI feature JSON, and generic JSON) and data communication channels (system file, HTTP, TCP, UDP, WebSocket, and ESRI feature service). To set up the input connector, we need to define GeoEvent Definition which shows the data structure of event data coming (Figure 5a). Figure 5a shows GeoEvent Definition in GeoJSON format and Figure 5b shows RestAPI as a data communication channel. Thus, The outcome of YOLO3 model for each camera converted to GeoJSON using camera location information.

Figure 5. GeoEvent Definition on the Left Side and Input Connector on the Right Side

The processors are a configurable element of a GeoEvent Service that performs specific actions on event data as it is received such as identification or enrichment as event data is routed from inputs to outputs. Since we are not processing data before writing them into feature class, we did not use processors in our GeoEvent services.

The output connectors are responsible for converting the GeoEvents back to a data stream and sending the events through a selected data communication channel. We set up two GeoEvent services: 1) real-time GeoEvent service (called Update-Counts in Figure 6a) is responsible to digest only current or live feed data and visualize them on the dashboard while 2) historical GeoEvent service (called Sum-Counts in Figure 6b) is responsible to store historical data in a feature class or big data store for further analysis such as anomaly detection. The difference between two services is that real-time service only has 111 records that updates based on current or live feed data while historical service keeps appending new records. Let’s do simple math to figure out a number of records per day. If the historical GeoEvent service consumes event data from 111 cameras per second, our system stores ~9.6 million records per day (111 cameras × 24 hours × 60 minutes × 60 seconds).

Figure 6. Real-Time GeoEvent Service Output on the Left Side and Historical GeoEvent Service Output on the Right Side

5. Monitor Live Traffic Flow using the Dashboard

To automate real-time traffic flow monitoring, we provided a dashboard [11] for the department of transportation. The dashboard shows the location of 111 live feeds cameras in D.C. as well as the counts of cars, buses, trucks, bikes, and people at each intersection. The dashboard is connected to feature class that populates by real-time GeoEvent service. Users can visually see where in Washington D.C. is more crowded by cars versus people and vice versa. Users can also visually see the live feed image associated with each camera. Figure 7 illustrates the interface of the dashboard. The dashboard has also updated the statistics on the left panel when the user zooms and pans across the study area.

Figure 7. Top Image Shows the Dashboard Interface for entire Washington D.C. Each Red Dot Refers to One Camera. Numbers on the Left Side Show Number of Objects in Entire D.C. Lower Image Shows the Dashboard Interface for the Zoomed Area in D.C. Zooming Further Enables you to See Count of Objects in Four Corners of each Camera as well as Associated Current Image with Camera.

6. Anomaly Detection using Historical Data

The department of transportation was also interested to know how the volume of desired objects (mainly cars, buses, trucks, bikes and people) change across intersections and across times. To solve this problem, we leveraged the outcome of historical GeoEvents Service after one week to calculate the volume of each object for each camera per minute per day. Let’s do simple math to figure out the number of combinations. There are ~1 million possible combinations (111 cameras × 7 days × 24 hours × 60 minutes). You can think of this as a lookup table to detect the anomaly. For any new given count per object, we compare the results with historical counts. If it is higher than 30% of historical counts, we call it an anomaly and visualize it on the map. We also write anomalies in a separate feature class. Figure 8 shows the image with the pedestrian’s and car’s anomalies at one of the intersections.

Figure 8. Anomalies of Pedestrians and Cars at One of the Intersections

DDOT was also interested to learn more about pedestrian behaviors across intersections. They were mainly interested to know in which intersection pedestrians are passing from one side to another side of the street without using the crosswalk. There are lots of ways to approach this problem. The typical way is to detect crosswalks from the given image and identify pedestrians outside of crosswalks as an anomaly.

I took another way here. To do this, I ran YOLO3 for ~5 hours across one of the intersections as an example and kept the location of pedestrians (row and column of image) in each time frame. Since YOLO3 provides four coordinates of the bounding box for the given object (top right, top left, lower right, and lower left), I averaged lower row and column coordinates (lower right, and lower left) and converted the bounding box to a point. This is because lower coordinates are closer to the ground and can represent the crosswalks better. Each red point represents the location of the pedestrian across time (Figure 9). Such data can be used to reveal the pattern and behavior of how pedestrians walk across intersections. Since most of the pedestrians use the crosswalk to pass the street, there is a high density of points across the crosswalk (Figure 9). In contrast, there is a low density of the points on the street (Figure 9).

Figure 9. Raw Image on the Left Side and Pedestrian's Locations across Time on the Right Side

To separate high- and low-density points, I used DBSCAN, Density-Based Spatial Clustering of Applications with Noise, as an anomaly non-parametric algorithm. DBSCAN finds clusters of high-density point features within surrounding noise based on their spatial distribution [12]. DBSCAN is also marking as outliers’ points that lie alone in low-density regions whose nearest neighbors are too far away. Figure 10 shows the outcome of DBSCAN where it separated low-density points or anomalies from high-density points.

Figure 10. Example of the Pedestrian that Walked from One Side of the Street to Other Side Without Using the Crosswalk

7. Conclusion and Future Works

In this article, I walked you through the end to end workflow that the GeoAI team developed on the cloud to monitor traffic flow. Our workflow is able to 1) access to the live feed streaming technology, 2) detect cars, buses, trucks, bikes, and people using YOLO3 from live feed streaming on AWS, 3) send the outcomes of YOLO3 to GeoEvent on AWS to visualize the traffic flow on the dashboard and store the them in the big data store for further analysis, 4) perform anomaly detection in terms of objects volumes as well as identifying pedestrians in unsafe locations. GIS played a key role here to map the location of cameras and visualize them on the dashboard. Our future works focus on a diverse range of topics such as tracking objects, calculating speed, calculating the volume of objects per lane, etc.

8. Acknowledgment and References

I want to say thanks to Daniel Wilson for speeding up the workflow 10 times faster on AWS and setting up S3 bucket to save images, Joel McCune who set up the dashboard on AWS, RJ Sunderman who taught me about ArcGIS GeoEvent Server, Alberto Nieto who initially started this project with DDOT where we expanded his works by adding YOLO3 model, ArcGIS GeoEvent Server, AWS to make it a real-time workflow on the cloud and finally Mark Carlson who set up ArcGIS GeoEvent Server and AWS Deep Learning AMIs. We also replicated the entire workflow on Azure. Please let me know if you either have a question or you think we can collaborate on a similar project.

1] https://developers.arcgis.com/python
2] https://aws.amazon.com/machine-learning/amis
3] http://www.arcgis.com/index.html
4] http://www.trafficland.com
5] https://github.com/tzutalin/labelImg
6] https://arxiv.org/abs/1506.02640
https://arxiv.org/abs/1612.08242
https://arxiv.org/abs/1804.02767
https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html#yolo-you-only-look-once
https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
7] https://github.com/qqwweee/keras-yolo3
8] https://enterprise.arcgis.com/en/geoevent/latest/get-started/a-quick-tour-of-geoevent-server.htm
9] https://enterprise.arcgis.com/en/geoevent/latest/administer/managing-big-data-stores.htm
10] http://desktop.arcgis.com/en/arcmap/10.3/analyze/modelbuilder/what-is-modelbuilder.htm
11] https://www.esri.com/en-us/arcgis/products/operations-dashboard/overview
12] https://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/densitybasedclustering.htm

An End to End Workflow on the Cloud to Monitor Traffic Flow using Deep Learning

Written by Amin Tayyebi