Data engineering tech at Unruly

Data engineering began at Unruly with a product experiment as with many of the products we build. As an ad-tech organisation, we wanted to demonstrate the commercial value of predicting if a user will complete watching a digital video ad. This provides value by meeting objectives (KPIs) of advertisers, saving money, providing the end user more relevant ads and helping Unruly to optimise ad serving.

Additionally, this experiment allowed the technical team to define a base architecture and technology stack for data pipelines and machine learning at Unruly. This post will share with you our technical architecture, taking a look at how it has evolved and the choices made. We assume a degree of familiarity with data engineering and machine learning technologies.

Some jargon

Events are domain events from any other systems. The user events are a subset of these events that capture user interaction with a digital ad. These are useful for some of our machine learning models (for example video complete prediction) as they represent user context and behaviour.

Transactions are a related set of events e.g. events for a single user interacting with an instance of a digital ad. Think of this as a ‘squashing’ of events into a single row.

Data scale

Anywhere from 0.5B to 1.5B events per day are received; this can be between 50 to 100GB of data (compressed, batched per hour) daily. Whilst this is not ‘big data’, our supply volumes are ever increasing and we have considered this factor in our technology choices.

A picture of things at the beginning

A picture of things now

This is a continuously evolving architecture and we expect it to change further as new and exciting challenges are faced — watch this space!

BigQuery for data warehousing

There are many comparisons of Amazon’s Redshift and Google’s BigQuery, and it is a close call as their features are very similar.

The team chose BigQuery due to its relative simplicity and low barrier to entry. We liked the fully managed database-as-a-service approach as it supports our need for a scalable, highly-performant data store without having to concern ourselves with running our own infrastructure. Even Redshift as a cloud service needs a certain degree of infrastructure configuration and optimisation that BigQuery does not require.

Events are partitioned by date — this keeps our costs low by restricting the partitions queried and query performance high by parallelising queries across partitions.

Dimensions as immutable snapshots

Dimensions imported as daily batches are stored as immutable snapshots into timestamped partitions. This is useful for reprocessing data with consistent, reproducible results.

Batch data processing

The team’s data pipelines are batch oriented rather than streaming— this is conceptually and operationally a simpler model. Right now, these are single daily batches. As we discover new business use cases or face challenges of growing latencies and larger data sets, we are prepared to consider a mini-batch or streaming oriented processing architecture.

Apache Airflow for pipelines

Apache Airflow is a data processing framework (for ETL and other workflows) with similar purpose to Luigi, Oozie or Azkaban.

Using Airflow helped to solve many observability problems. We had challenges with knowing which stages in the pipeline had failed as the original bash script was writing everything to a single log file. It was also difficult to monitor the pipeline, debug which parts had failed and alert on failures — all things that Airflow helped to solve out of the box.

Airflow was also a strategic choice as it gave a balance of simple first setup (using a Local executor) with scope to scale in the future (e.g. using a Celery or Mesos executor).

It has provided many other useful features:

  • DAGs (Directed acyclic graph) — the power of dependency graphs helps reduced completion times by utilising concurrent tasks. It also stops downstream tasks from executing when there are upstream failures and manages recoverability.
  • Backfill — useful for upstream failures or requirement for historic data — Airflow makes backfill super easy via the web UI or CLI.
  • Pipelines as code — not just any code but Python code (not XML declarations!).
  • Abstractions — putting reusable workflow components into subdags means future pipelines are easier to create. Pipelines can also be generated dynamically at run time e.g. an import pipeline per available source file, or a processing pipeline per partition of data.
  • Airflow operators — we incrementally moved bash scripts over to Airflow, and converted parts to utilise the in-built operators, reducing the amount of code to maintain and allowing the team to be more productive.

Vowpal Wabbit for machine learning

Vowpal Wabbit is an extremely fast machine learning engine since it using hashing trick. It also handles categorical and multi-label features very well.

This is vital for performant training on huge data sets and low latency model predictions. Vowpal also makes discovery of feature importance relatively easy.

Standardised on EC2 infrastructure

The pipeline was migrated to run on EC2 infrastructure from Google Compute, and all new services going forward are hosted on EC2.

This is in line with the infrastructure used by other teams, and allowed us to utilise the common Terraform and Puppet code base for easily rebuilding hosts.

Python vs Bash

Our first pipeline was written with a combination of Bash and Python. There were a few problems with this:

  1. Testing was more difficult
  2. Unnecessary knowledge of two languages
  3. Achieving reusability & consistency

Much of the data science community is using Python — so it made sense to standardise on this language.

The team have been steadily replacing Bash scripts with Airflow operators (which are coded in Python) — this has reduced the amount of code to maintain.

Custom Python code is easily unit and integration tested— this has helped to build quality into the code. Structuring and packaging of custom Python components is easier — we distribute versioned, reusable libraries via Artifactory that are pip installable.

Data analysis infrastructure

During the experiment, the definition of the model and exploratory data analysis work was done locally on our Data Scientist’s machine. Subsequently, we have built infrastructure for Jupyter Notebook with popular data libraries like Pandas and Scikit-learn. All notebooks are committed to Git repositories, retaining a traceable version history like the rest of our codebase.

This repository and central service for data analysis and model building facilitates reproducibility, knowledge sharing and increased transparency.

Model serving infrastructure

The first approach of model deployment during the experiment was to deploy manually onto the consumer hosts. This proposed some key challenges, primarily due to the dependency on other teams:

  1. Regular deployments are difficult
  2. Monitoring was challenging — especially real-time monitoring of scoring
  3. Experimentation (e.g. A/B testing, new features) was difficult

A centralised service for model scoring owned by the Data Engineering team helps solve these problems, however we also needed to consider the requirement of low latency by the consumers. A locally applied model incurred zero network latency, however a centralised service would. To solve this model serving infrastructure is deployed in the same availability zones as consumer instances to keep network latency at a bare minimum (<5ms).

What more could be done

Parallel batches or streaming in the pipeline — Pipelines currently import data in sequential batches (e.g. a batch for each source file). The main pipeline takes around 3.5 hours and since models are trained daily this is sufficient for our current needs. As our data scales or there is a need to have more regularly trained models, a quick win is to import batches in parallel. Beyond this, another option is to stream the data e.g. via Kafka.

Athena for data warehousing — with the increasing maturity of Avro and Parquet, there is opportunity to store data in one of these formats rather than CSV, and utilise AWS Athena for querying data directly from S3. Redundant infrastructure (BigQuery) could be removed, saving on costs, whilst achieving similar levels of performance and ease of use (via a SQL-like interface) as BigQuery. Here is a good overview of Parquet.

In summary

Unruly’s data engineering team has made many choices and tradeoffs for our machine learning pipelines, and we continue to adapt to change and evolve our technical stack and architectural patterns along the way.

  • Big Query is a highly scalable and performant data store that has provided the simplicity of use and low barrier to entry (a fully managed scalable service with a SQL interface); however there is opportunity to simplify by using AWS Athena.
  • Vowpal Wabbit is an extremely fast machine learning engine that uses the hashing trick and can handle large amounts of data such as real-time events.
  • Airflow provides observability, pipelines as Python code and other key features like backfill, recoverability and dependency management.
  • Python as the primary language makes sense as a large part of the data science ecosystem is Python oriented, and it has simplified automated testing and distribution of reusable libraries.
  • The use of Jupyter notebooks with pandas/scikit-learn/matplotlib has made data analysis and knowledge sharing super easy.
  • Building a model serving infrastructure has enabled regular model deployment, real-time monitoring of models in production and effective A/B experiments.

If you have a passion for machine learning and are excited about working collaboratively to develop data architecture, technologies and processes, check out our open roles.