Building a Serverless End-to-End Solution for Brand Detection in Video

6 min readOct 6, 2018

Some time ago, we were asked by a client to help them on the technical implementation of a business venture. The idea was to detect brands and logos in TV sports broadcasts and videos in order to measure the brand exposure during e.g. a football match or a ski race. The project was abandoned by the client before the MVP was completed because of commercial reasons. We decided to finish the MVP nevertheless and put part of it online because we reckoned it is fun. Here is how we built a full end-to-end solution on the Google Cloud Platform (GCP).

The Requirements

The client wanted the solution to:

detect specific brands in video media (classification and localisation);
calculate brand exposure metrics (how often did a brand appear and how well was it visible on screen);
cover two use cases: 24/7 real-time analysis of TV broadcasts as well as on demand analysis after the sports event (overnight processing of recorded videos);
handle HD quality video with up to 25 fps and multiple simultaneous TV broadcasts.

We had already developed a brand detection model in TensorFlow based on the TensorFlow Object Detection API that worked reasonably well, at least for the purpose of a prototype. So, all that was left to do was to embed it into an end-to-end architecture that acquires video streams and files, preprocesses them, detects the brands in all the single frames, stores the results in a data warehouse, and calculates and visualises the metrics in a dashboard; everything in almost real time and irrespective of the number of the amount of data that we throw at it (e.g. take ten simultaneous TV broadcasts at 25 fps and you’re about to run brand detection on 250 HD images per second). Piece of cake, right? All it takes is a bunch of sysops people that set up, operate, scale up and down, and maintain the infrastructure needed. Oh, wait. We don’t have such people at Quantworks. We don’t even have an IT department. So…?

Of course, you’ll have guessed where this is going (it’s in the title of this article, after all). As a small startup with quite some data engineering capabilities but no sysops personnel, we’ve become used (more precisely: addicted) to the blessings of serverless on the GCP. Write your code, test it, deploy it to a service. Plug (the services together) and play. Bam! No capacity planning, no manual scaling of infrastructure, no configuration or maintenance of servers. (What are these “servers” anyway?)

The Building Blocks

So, let’s recap what had to be done: receive a video stream or a video file; split it into the single frames; run brand detection on each single frame using our TensorFlow model; dump the results into a data warehouse that allows online analytical processing (OLAP) on a large amount of data; visualise the metrics in a web based dashboard.

The architecture on GCP for the last three steps is pretty straightforward: deploy the TensorFlow model on Cloud ML Engine, allowing us to do inference through API requests; store the inference results in BigQuery so that we can calculate the exposure metrics on the fly with SQL queries; and use App Engine to run the web frontend. As far as the ingestion is concerned, a different pipeline is needed for each of the two use cases: a Batch one and a strEAMing one — a case for Apache BEAM.

High-level overview of the building blocks

So let’s dig deeper into the setup of the two pipelines.

The Streaming Pipeline

The streaming pipeline has to cope with a constant inflow of media streams from different TV stations. We use OpenCV to read the video streams and to split them into the single frames (usually 25 frames per second and stream). The frames are saved in a Cloud Storage bucket. We tested this successfully on a Compute Engine instance. However, if we were to put this into production, we would probably deploy it as a service on App Engine for, you know, serverlessness reasons.

Images arriving in the Cloud Storage bucket trigger a Cloud PubSub notification. The rest of the pipeline is built with the Apache Beam Java framework, running in Dataflow on the GCP. It is subscribed to the PubSub topic and hence notified about every image being saved in the bucket. It then loads the image and includes it in the API request to our model deployed on Cloud ML Engine. The response from Cloud ML Engine containing the classes and bounding boxes of the detected brands is then inserted into BigQuery.

Once deployed, the Dataflow pipeline runs forever, fed by the video frames dropped in the storage bucket, processing them in parallel, and scaling compute instances automatically according to the data volume. Add another TV broadcast? Just make sure the frames land in the bucket; the rest like provisioning more resources is taken care of by Dataflow.

Since we don’t need any sophisticated preprocessing on the images nor advanced Beam features like windowing or watermarks, we could also have implemented all of this with Cloud Functions. However, we wanted to keep open the option of doing more sophisticated preprocessing later which is easier to accomplish in Java than in JavaScript (the Python runtime for Cloud Functions was not yet available at that time). Also, we could reuse most of the code for the batch pipeline.

The Batch Pipeline

The batch pipeline is triggered by the user uploading a video file over the frontend for analysis. After the video has been uploaded to a Cloud Storage bucket, a Dataflow batch job defined in a template is launched. It reads the video file, splits it into frames and the rest is the same as with the streaming pipeline (and by “the same” I mean exactly the same, as in: exactly the same code), i.e. call the Cloud ML Engine service and store the inference results in BigQuery. Since we thought it would be nice to give the user not only the metrics but also a glimpse behind the scenes of object detection, we decided to add one more step to the batch pipeline that draws the bounding boxes and labels of the detected brands onto the frames and assembles everything back to one video file.

A behaviour totally acceptable for the original use case (overnight processing of videos) but somewhat unsatisfactory for the demo we have put online is the long processing time of videos. Without going into too much detail, this is due to a pipeline property that limits the full parallelisation of the whole Dataflow pipeline. (For the Dataflow affine among you: we have to materialise the PCollection after a high fan-out ParDo step — one video file split into a lot of single frames — in order to prevent Dataflow’s fusion optimisation which in this case is, well, not really an optimisation; see the Dataflow documentation for more details.)

One could mitigate this using Cloud Functions instead of Dataflow, however, this would come at the cost of losing the notion of the bounded PCollection, i.e. of the fixed-size data set enabling the pipeline to know when it is done with processing all the frames (by contrast, Cloud Functions would execute the parallel steps autonomously and without a master process that keeps track of completeness). And the boundedness comes in handy if one wants to show the progress on the frontend or create the output video. We decided to add an additional preprocessing step that splits the video file into several parts before handing them over to Dataflow in order to allow for more parallel processing. This is accomplished with a Node.js Cloud Function and using FFMPEG. It mitigates the latency problem a little, but doesn’t eliminate it.

Conclusion

So, there you have it: A completely serverless solution on the GCP that scales automatically with the inflowing amount of data. All it took us to extend the TensorFlow brand detection model into a full end-to-end application is a GCP architect with a bit of coding skills and an Angular-savvy intern for the frontend. Piece of cake indeed.

We’ve put the batch processing-only part online as a sort of MVP. (It’s an MVP, okay? Expect bugs and a far from perfect UX.)

More to come! In another article, I’ll show you how we enhanced the solution with an administration site that facilitates the training and management of brand detection models and the management of data sets.