Realtime data processing with Apache Beam and Google Dataflow at Dailymotion

Pierre Vanacker
Oct 4, 2019 · 7 min read

Dailymotion is a leading video platform which allows users to discover and watch new videos based on their tastes and trending subjects, which means our data team must be able to define in realtime which videos are currently being viewed and how, to have a precise idea of what our users actually want to see. Discover how we are able to collect, process and redistribute billions of events across our systems in realtime using Apache Beam framework and Google Dataflow platform.

A significant part of Dailymotion’s traffic is made not only through embedded videos on external websites, but also on our constantly improving website. This means we need to:

In order to fulfill both of these needs, we collect, process and redistribute billions of events across our systems in realtime — on average more than 50 000 per second — without managing any processing infrastructure using Apache Beam framework & Google Dataflow platform.

What’s Apache Beam?

Apache Beam is a unified programming model to define data pipelines, formerly developed by Google, now open-source and supported by the Apache Foundation. It allows us to develop our batch and streaming data processing jobs with the same codebase and to plug it on our different Data Storage and messaging systems through Google Cloud connectors.

We use Apache Beam in combination with Google Dataflow Platform, which provides us with a fully-managed environment to deploy our pipelines across multiple regions in the world.

Build and Deployment

Let’s start by an overview of our build process for our pipelines developed in Java (which we mainly use for our streaming pipelines).

Beam pipelines can also be developed in Python and many people in the Dailymotion Data Team actually use it (mainly for batch processing).

We extensively use Docker to build and deploy our pipelines as the containers can carry all the required dependencies.

Build process

Beam pipelines are built by our Continuous Integration Platform (Jenkins) inside a Docker container to create an “überjar” which contains all Beam and runner (Dataflow) dependencies. The überjar is then pushed to an artifact repository for later use.

Realtime jobs build

Deployment process

A Beam pipeline is deployed through a Docker image with a ready-to-use entry point, provided to drain (i.e. stop ingesting new data, to cancel a job while avoiding data loss) a potentially already-running instance of the job and to start a new one with a bunch of pre-configured parameters.

Streaming jobs are redeployed through CI when pushing to an “environment” branch, i.e.: push to stage branch redeploys a staging pipeline, push to prod branch redeploys a production pipeline.

Basically, deploy to production means deploying the job in a Google Cloud production project, reading production sources and writing in production sinks. Multiple pipelines can be redeployed at the same time:

A few minutes after pushing the code, the job can be deployed in Google Cloud Platform and is ready to process data.

Realtime Jobs Deployment

Another way to deploy jobs is with Cloud Dataflow Templates system which makes Dataflow jobs runnable from Google Storage with gcloud CLI : `gcloud beta dataflow jobs run <job>`

We still chose to rely on docker containers for the following reasons:

Beam jobs are not only deployed through CI but also, in some cases, scheduled and therefore launched as batch jobs.

Lambda architecture

We sometimes use our realtime processing jobs in combination with batch jobs, launched through our scheduler: Airflow.

The processing part of the batch job is actually the same Beam / Dataflow job as for realtime. The only thing which differs is the source (input) and the sink (output). Both are parameterized at launch.

Typically, an Airflow batch job consists of 3 steps:

Full representation of the Build, Realtime jobs and Batch launch process:

Processing and Redistribution of the Data

We extensively use the Google Cloud Platform to collect, store and perform analysis over our data. Naturally, we then use connectors provided by Beam Google Cloud Runner to plug our jobs on every component of the Cloud Platform.

Beam, through the Google Dataflow runner, provides connectors for BigQuery, Pub/Sub, Cloud Storage and more.

As the code is easily extensible, it’s also possible to plug it to any of our systems, like Key-Value Databases or by making REST calls to APIs.

Realtime processing

Most of our data processed in realtime passes through Pub/Sub, Google Cloud messaging system. We receive a tremendous amount of events through our own infrastructure and dispatch them to other jobs or in our databases for further transformations. Typical final sinks are Google BigQuery and Key-Valued Datastores to be either processed with batch processing or queried through services.

What does it technically mean to process data in realtime with Beam? Basically, read from an unbounded source (like a messaging system: Pub/sub) and apply a bunch of transforms/enrichments (ex: aggregation, types checks, enrichment of data through API…)

final Pipeline p = Pipeline.create(options);

PCollection<String> rawMessages = p.apply(“Read topic », PubsubIO.readMessages().fromTopic(“projects/yourproject/topics/yourtopic”));

// Apply some transform;

Except for some aggregation functions, transformation code should be roughly the same whether the job is launched in streaming or in batch. Only the IO part changes.

Example of an enrichment pipeline:

This is a simplified representation of an enrichment pipeline, based on Dailymotion’s API to enrich video metadata.

Architecture of an events processing pipeline

Here is an architectural representation of our events processing pipeline.

It outputs enriched events in Bigquery and forwards them to Google Pub/Sub messaging system for more specific uses:

In a few seconds after an event actually happened (like a click of a user on a button), it’s available in our Data Warehouse, ready for analysis and integrated into our recommendation systems.

On the way to stability and sanity

As our jobs are deployed on Google Cloud Platform, we can enjoy several features to counterfeit the tremendous amount of data processed every minute:

Dataflow launch parameters

— numWorkers=5
— workerMachineType=n1-standard-8

Dataflow launch parameters

— autoscalingAlgorithm=THROUGHPUT_BASED
— maxNumWorkers=50

It allows us to report in realtime the number and kind of errors raised in a job, monitor & dashboard it and automatically trigger alerts when some thresholds are reached.

By using Apache Beam Framework and Google Dataflow, we managed to easily deploy processing jobs which handle billions of events every hour. Our jobs can also handle big variations of traffic while we’re able to get precise insights of what’s happening. There is still work to do to remain cost-effective as more and more events are processed and our analysis needs keep growing, while having our platform as resilient as possible.


The home for videos that matter

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store