Running Object Detection at Scale and Structuring Data

Samuel Brice

Published in

Analytics Vidhya

8 min readNov 21, 2020

Demystifying Clearview AI Blog Series (Part 4)

Previous: Choosing the Right Model for Object Detection

Next: Tracking Vehicles with Deep Learning

Running detectors on EC2 compute-intensive C5.XLarge vs. GPU-based P2.XLarge.

Scaling an Object Detection Pipeline in the Cloud

Similar to how we scaled the data streaming pipeline using AWS EC2, we can also use EC2s to scale object detection. Given that this was my first in-depth experience with AWS EC2s, I made some critical mistakes worth sharing.

Free Tier EC2 t2.micro instances are great for running simple services like web servers but practically useless for any kinda-serious or pretty-serious tasks. Whereas using EC2 to pull in data was almost “free,” once the data was within AWS, it would cost “something” to process. There’s an entire field of research devoted to the art and science of optimizing costs and performance on AWS EC2 based on spot or market instance rates.

For CCTView, the primary means of limiting costs boiled down to reducing the number of cameras of interest from all 250 in Manhattan to five specific cameras along FDR Drive. We’ll discuss the reasons behind those specific FDR Drive cameras in Part 5.

Structuring and Naming Data for Processing

Over 24 hours, the data streaming pipeline collected a little over 60k JPEG frames per camera, totaling more than 281 GiB of unprocessed video and images. Before running any detection pipeline on our data, we first need to do some simple preprocessing of the raw data.

NYCDOT Traffic Cam Videos — FDR @ Grand St to FDR @ E 53 St, Hourly

One of the very first steps in working with such a large data set involves removing duplicates. The simplest way of eliminating image duplicates is by using a shasum to generate unique hashes for each image, then removing duplicate hash values.

File format and naming conventions are also essential when working with such a large dataset, especially given that the data will be stored and indexed on the filesystem. With easy programmability in mind, files must be stored and named to include as much contextual information as possible.

The folder structure and file naming scheme evolved considerably over time. Too flat of a folder structure would result in too many files within leaf nodes, making it more complicated to scale and distribute. A simple nested tree proved appropriate for the CCTV data, nesting recursively based on year, month, day, hour, minute, and second — using the timestamp-based folder tree format YYYY/MM/DD/HH/mm/ss/{FileName}.

File naming needs to pack in as much contextual information as possible, which in addition to the timestamp, includes the Camera ID and CCTV Number. The reason being that if a file gets misplaced or, an entire branch of the tree is flattened, file names must be unique so as not to conflict. Programmatically decomposing the desired contextual information from the filename must also be very simple to do using standard regex or basic delimiter based splitting.

It didn’t seem evident at the time; however, orientation is another piece of information that needs to be attached along with the Camera ID and CCTV Name. The reason is, again, for easier programmability. For example, if a vehicle is northbound, it makes sense to focus on the cameras facing south. Having that information in the camera’s composite ID removes the need to look it up from an external source.

With all the above in mind, the file naming convention for the CCTView data evolved into ABC-XYZ-O — YYYY-MM-DD — HH.mm.ss.jpeg — where ABC is the Camera ID, XYZ is the CCTV Number, O is the orientation, and the remainder is the frame timestamp.

With our data properly structured and smartly organized, distributing, and scaling on the cloud becomes a matter of time vs money. Faster data processing boils down to running more machines concurrently.

Leveraging Amazon Machine Images

Amazon Machine Images (AMI) is a feature within AWS EC2 that enables launching multiple instances with the same configuration. For relatively small data sets, you can load everything onto a base image then create a candidate AMI from a properly configured instance snapshot.

A benefit of such a workflow includes savings in network data transfer and ease of replicating data and instance functionality from one AWS zone to another. The latter proved incredibly vital as it became clear that different regions have different CPU and GPU elasticities based on underlying allocations and variable utilization.

The most effective way I discovered for leveraging EC2 AMIs involved simply spinning up an instance starting from Amazon’s provided Deep Learning AMI, uploading the subject dataset onto EBS, configuring a conda environment, then verifying that the detection pipeline runs as expected.

Once a Deep Learning AMI based instance is in the desired state, you can cut a “custom” AMI composed of the base Deep Learning AMI, including any data uploaded to EBS, as well as any environment configurations made within the underlying snapshot instance.

CPUs vs. GPUs

This section is mostly an aside, but so that you may learn from my experience…it turns out that the elasticity of GPU instances is very low.

Using GPUs on AWS EC2 isn’t merely a matter of selecting a GPU instance at setup/creation time. To use GPUs within the desired region, you must first request an allocation literally from a person, the squishy type. This request process was very surprising. The process took about 2–3 days, something I hadn’t built into my timeline. In the meantime, the best available option is to run the detector pipeline on CPU instances.

Playing around with different instance types will give you a general idea of how you can best parallelize a detection pipeline, utilizing the desired instance as completely as possible while balancing costs. I first experimented with some General Purpose instances then quickly moved on to Compute Optimized instances. I eventually settled on the c5.xlarge, having 4 GiB of memory, Elastic Block Storage (EBS), network bandwidth of up to 10 Gbps, and EBS bandwidth of up to 4,750 Mbps. The selection process wasn’t very scientific; I stopped at the first instance that demonstrated good FPS performance while utilizing as much CPU and memory as possible.

Reusing the data streaming pipeline utilities from above, I implemented a Node.js CLI script that forked independent sub-processes per hour of CCTV camera frame captures. The parent script/process was identical to the one used in the data pipeline. The detection subprocess was modified to spawn a Python subprocess that ran the ImageAI detector script using IPC to monitor progress. Running five detector subprocesses concurrently on the c5.xlarge yielded a similar performance as on my laptop with very high CPU and memory utilization.

The AWS CloudWatch dashboard is an excellent tool for tracking utilization on multiple instances over time.

By the time GPUs were eventually allocated to my account, most of my detector jobs were complete. To get a look at the difference between CPUs vs. GPUs, I used a p2.xlarge instance to finish processing a batch partially completed by a c5.xlarge instance. The Accelerated Computing p2.xlarge instance comes with 1 GPU, 4 vCPU, 61 GiB of memory, and 12 GiB of GPU memory. As expected, the GPU instance was faster, twice as fast, taking a little under one second per frame, compared to over two seconds on a CPU instance. That was a very modest improvement without any optimizations, so there are likely ways to improve performance considerably.

Restructuring Object Detection Output

A powerful feature of the ImageAI library is Object Extraction. Based on a feature flag ImageAI will extract each detected object to a separate image file. Meaning if a frame contains six cars, after running the detection model on a frame, ImageAI will output (1) a copy of the original image with each detection within their bounding boxes, (2) a JSON object with a list detailing the objects detected, and lastly (3) six separate image files, one for each car detected. This object extraction means that we end up with eight new files from the single image file for the original camera frame, depending on the number of objects detected.

The folder above does not reflect the original structure as outputted by ImageAI, which was very bare; it reflects the eventual system designed to organize the data best. Within the file name, there is now a “category” (e.g., original, detections). The detection metadata JSON is saved as a map containing a summary of all detected objects and detailed information about each object detection. Lastly, the extracted detection object image names were modified to serve as a globally unique ID.

As a final optimization for the demo implementation, the individually extracted object images have been renamed to expose all available metadata, including percentage_probability and bounding_boxes. This naming convention further reduces the need for additional data IO when displaying results to the user. The file name includes the same level of detail as the library metadata object.

One of the highlights from the above structure is the ease of rolling up the detection details metadata (i.e., in the JSON file). Specifically, we’re able to roll up all “time second level” detection details JSONs into a single “time minute level” detection details JSON. This metadata rollup makes it much easier to get all contextual information for a camera or block of time upfront without running multiple data queries. The utility of this will become more apparent as we dive into designing and developing the user-facing application.

Below is the result of running object detection on the five selected FDR Drive cameras.

Having processed the target cameras using object detection and extracted all vehicles of interest from each frame, we’re now ready to implement an image recognition model.