Leverage AWS to create a data pipeline at scale in a couple of weeks

Robin Mizreh
May 13, 2019 · 6 min read

Voodoo is the number one gaming company worldwide in terms of downloads on both the Apple and Android stores.

Worldwide free downloads iOS ranking

We’ve got 40 million daily active users (DAU) and more than 400 million monthly active users (MAU).
As such, we deal with several billion events every day, and the purpose of this article is to give you an overview of how we go about doing that.

Context of the project

The main reason we chose AWS as our cloud provider is for the managed services they can provide, which can save us so much time.

What we needed was a pipeline to collect our data with all the tools to visualize it and use it to nourish other projects. As everything moves so quickly here at Voodoo, we needed to choose an architecture that suited our needs and could be put in production very quickly. It had to:

As you can see on the picture below, the main need for us was the aforementioned scalability! Once we had tried it, we went from 0 to almost 2 billion daily events in 2–3 days.

The architecture

Compute

Container technology was mandatory for us because we truly believe that it is not necessarily the technology you use that makes the biggest difference, it’s the potential for flexibility if you need to replace any given tool that will help you in the long run. As is often said, the world of computer engineering moves so fast that nobody can say that the most appropriate technology for today will remain the same 1 or 2 years from now. For us, containers are the clear answer to that need for flexibility.

We decided to use ECS mostly because it is very well integrated in the AWS ecosystem. We needed something we could deploy easily and quickly as, at that time, we didn’t know much about Kubernetes, but we did have extensive experience with ECS & AWS, so this was the obvious choice for us.

In a subsequent article, we will cover how we carry out Blue/Green deployment on ECS to facilitate high-speed rollback (if need be) and how it allows you to run E2E tests and soft releases against your production, for your new features.

Collect & Storage

Firehose and Glue are a helluva combo in AWS, they allow you to push data to a stream, thanks to the AWS SDK. You are thus able to batch the insertion to optimize your usage. Firehose will allow you to aggregate and compress data and push it to multiple data sources, which might include Amazon S3, Elasticsearch or Redshift. You can choose thresholds related to the time and/or the size limits of each batch, before being inserted into the data source.

Glue, when combined with Firehose, enables file conversion, so you can create your S3 data lake very easily. Each of your Firehose streams will be mapped out on a Glue table and it will convert files into Apache Parquet (or ORC) files and compress them with Snappy, without having to write a single line of code.

Your S3 file can be read by any jobs written to understand parquet (or ORC) files. As it is on S3, you can you use it instead of HDFS storage so you can have multiple apps reading the same data at the same time, without having to worry about data consistency or data availability — it will just scale to your needs.

Data Visualization

When you combine the Firehose & Glue solution with Amazon Athena, this is where the magic happens. With Athena you are now able to query all your data from S3 with SQL queries. You don’t have to maintain any HDFS clusters, you don’t need to deal with any huge clusters and all the constraints they entail. You just query your data whenever you need to and it just works.

As the AWS console and Athena are not particularly user-friendly, we wanted a visualization tool that could work with Athena. We found several, like Superset, Redash, Periscope, … We also wanted to avoid having the SQL query as the only way for users to create visualizations of data, but we wanted them to be able to do it with some different interfaces (like drag & drop, combo, …). So that technical people weren’t limited in the possibility of graph creation, we chose Superset.

In a subsequent article, we will cover what you can do with Superset and Athena in greater depth.

Monitoring & Troubleshooting

To monitor our software and check data discrepancies, we send error logs and aggregated data to elasticsearch through Cloudwatch logs. As the log are stream through AWS Lambda we are able to troubleshoot our production with real time data.

We also make dashboards to ensure that everything is going well on our data pipeline.

This architecture performs well but, at such a large scale, you need to be very careful of 2 things:

What’s next

Our pipeline is fully operational and ingests several billion events every day. The next step for us could be split into 3 main topics:

Take-Aways

Voodoo Engineering

Learn about Voodoo’s engineering efforts.