Collecting Data for Deep Learning Development

Samuel Brice

Published in

Analytics Vidhya

5 min readNov 15, 2020

Demystifying Clearview AI Blog Series (Part 2)

Previous: A Short History of Facial Recognition

Next: Choosing the Right Model for Object Detection

The Deep Learning Development Lifecycle

Deep Learning development can be illustrated as a cycle of various steps, starting from data collection to ultimately deploying an app or service to users. Once a Deep Learning model is deployed, monitoring its performance and how users interact with its functionality will inform what new data should be collected or what additional processing should be implemented to rebuild then redeploy a more performant model.

As a demonstration of the Deep Learning Development Lifecycle, we’ll be going through deploying a vehicle recognition app called CCTView using publicly available CCTV camera feeds from the New York City Department of Transportation. Concurrently within each step of the process, I’ll discuss what things would look like from the perspective of developing Clearview’s facial recognition app.

For this CCTView demonstration, we’ll focus on the Deep Learning Development Lifecycle’s first three steps. Once our data is processed and a detection model has been built, we’ll jump into an application development lifecycle that will involve designing, implementing, and deploying our Deep Learning backed CCTView application.

Collecting Publicly Available Data

There are various means of collecting publicly available data from the web. In Clearview’s case, they’ve admitted to scraping images from sites such as Facebook and Instagram. Such a process is as simple as visiting the Instagram account of @instagram and recursively downloading images from each of its 352 million followers.

You can see how this process would work from your Chrome browser using Chrome DevTools (on Mac, press ⌘ Cmd + ↑ Shift + J ) to manually inspect the webpage’s internet traffic while filtering for images. To scale up to millions and billions of images, you’d need to hire a crew of engineers, each assigned with the task of joining a desired social media network then downloading such “public” information using automated programs.

It’s important to note that some websites explicitly forbid such web scraping as part of their Terms of Service. For example, YouTube’s Terms of Service expressly prohibit collecting data that can be used to identify a person. YouTube, Google, Twitter, and Facebook have all sent cease-and-desist letters to Clearview, instructing them to delete all data collected from their websites. In February of 2020, a California resident and an Illinois resident also filed class-action lawsuits against Clearview for violations of the California Consumer Privacy Act (CCPA) and the Illinois Biometric Privacy Act (BIPA).

In our case, for CCTView, CCTV camera feeds are made publicly available by the New York City Department of Transportation. The site webcams.nyctmc.org provides real-time traffic information on all five NYC boroughs. There are 747 cameras listed on the DOT site.

For CCTView, we’ll be focusing on the roughly 250 cameras in Lower Manhattan.

Scaling a Data Streaming Pipeline in the Cloud

Streaming 250 cameras using a simple laptop is a non-starter. The best way to implement and scale such a pipeline is by using cloud services. Any cloud provider, such as Azure or Google Cloud Platform, would do. If you sign up for an AWS account, you can use the 750 EC2 hours per-month within the 12-month Free Tier.

A few interesting things worth noting about the DOT webcams is that the feeds are somewhat analog and must be pulled from the server. In addition to being small (352x240) and low resolution (100x100), the best possible framerate from a camera feed is one frame per second. Such a low framerate is unusual, but that’s what we have to work with, so we’ll just make the best of it.

You can use your imagination to implement such a pipeline on your favorite cloud provider. For this demo, using AWS EC2, I implemented a Node.js CLI script that forked a subprocess for each camera then used IPC channels to track and maintain each feed. The Node.js CLI script proved very simple and worked well on the AWS EC2 t2.micro instance running with one vCPU and 1 GiB of memory.

With 21 camera feeds per instance polling asynchronously, CPU utilization maxed out at 8%, and memory utilization stabilized at well below 1 GiB.

There are likely better ways to do this. But the above pipeline took a single day to implement and operate within the EC2 Free Tier constraints. Over 24 hours, the data streaming pipeline collected a little over 60k frames per camera, totaling more than 281 GiB of unprocessed video.

The basic pipeline stored data using Elastic Block Storage (EBS), an easy to use storage service designed for use with Amazon Elastic Compute Cloud (EC2). As revealed by Gizmodo, Clearview stored its data using the Amazon Simple Storage Service (S3), a high-performance storage service that provides data access through a web interface. It would have been possible to stream the CCTV video feeds directly onto S3; however, EBS proved more straightforward to work with given the additional data experimentation and processing involved.

With our data now available for easy processing, the next step is choosing the right Deep Learning model based on the data available and the desired application use case.