Leverage AWS to create a data pipeline at scale in a couple of weeks

Published in

Voodoo Engineering

6 min readMay 13, 2019

Voodoo is the number one gaming company worldwide in terms of downloads on both the Apple and Android stores.

We’ve got 40 million daily active users (DAU) and more than 400 million monthly active users (MAU).
As such, we deal with several billion events every day, and the purpose of this article is to give you an overview of how we go about doing that.

Context of the project

The main reason we chose AWS as our cloud provider is for the managed services they can provide, which can save us so much time.

What we needed was a pipeline to collect our data with all the tools to visualize it and use it to nourish other projects. As everything moves so quickly here at Voodoo, we needed to choose an architecture that suited our needs and could be put in production very quickly. It had to:

provide the means for huge scalability (ingesting billions of events every day)
bring us the flexibility we need
allow for limited personnel to maintain and monitor the pipeline

As you can see on the picture below, the main need for us was the aforementioned scalability! Once we had tried it, we went from 0 to almost 2 billion daily events in 2–3 days.

The architecture

Compute

Container technology was mandatory for us because we truly believe that it is not necessarily the technology you use that makes the biggest difference, it’s the potential for flexibility if you need to replace any given tool that will help you in the long run. As is often said, the world of computer engineering moves so fast that nobody can say that the most appropriate technology for today will remain the same 1 or 2 years from now. For us, containers are the clear answer to that need for flexibility.

We decided to use ECS mostly because it is very well integrated in the AWS ecosystem. We needed something we could deploy easily and quickly as, at that time, we didn’t know much about Kubernetes, but we did have extensive experience with ECS & AWS, so this was the obvious choice for us.

In a subsequent article, we will cover how we carry out Blue/Green deployment on ECS to facilitate high-speed rollback (if need be) and how it allows you to run E2E tests and soft releases against your production, for your new features.

Collect & Storage

Firehose and Glue are a helluva combo in AWS, they allow you to push data to a stream, thanks to the AWS SDK. You are thus able to batch the insertion to optimize your usage. Firehose will allow you to aggregate and compress data and push it to multiple data sources, which might include Amazon S3, Elasticsearch or Redshift. You can choose thresholds related to the time and/or the size limits of each batch, before being inserted into the data source.

Glue, when combined with Firehose, enables file conversion, so you can create your S3 data lake very easily. Each of your Firehose streams will be mapped out on a Glue table and it will convert files into Apache Parquet (or ORC) files and compress them with Snappy, without having to write a single line of code.

Your S3 file can be read by any jobs written to understand parquet (or ORC) files. As it is on S3, you can you use it instead of HDFS storage so you can have multiple apps reading the same data at the same time, without having to worry about data consistency or data availability — it will just scale to your needs.

Data Visualization

When you combine the Firehose & Glue solution with Amazon Athena, this is where the magic happens. With Athena you are now able to query all your data from S3 with SQL queries. You don’t have to maintain any HDFS clusters, you don’t need to deal with any huge clusters and all the constraints they entail. You just query your data whenever you need to and it just works.

As the AWS console and Athena are not particularly user-friendly, we wanted a visualization tool that could work with Athena. We found several, like Superset, Redash, Periscope, … We also wanted to avoid having the SQL query as the only way for users to create visualizations of data, but we wanted them to be able to do it with some different interfaces (like drag & drop, combo, …). So that technical people weren’t limited in the possibility of graph creation, we chose Superset.

In a subsequent article, we will cover what you can do with Superset and Athena in greater depth.

Monitoring & Troubleshooting

To monitor our software and check data discrepancies, we send error logs and aggregated data to elasticsearch through Cloudwatch logs. As the log are stream through AWS Lambda we are able to troubleshoot our production with real time data.

We also make dashboards to ensure that everything is going well on our data pipeline.

This architecture performs well but, at such a large scale, you need to be very careful of 2 things:

Be aware of the amount of data you send
Monitor the number of Lambda invoked by the streaming process because if a big downtime occurred for any reason it will spam your logging infrastructure and possibly make your Lambda throttle at a point

What’s next

Our pipeline is fully operational and ingests several billion events every day. The next step for us could be split into 3 main topics:

Automatisation, adding new event types to the data pipeline. So it needs to be able to create Firehose streams & Glue tables on the fly, and also update the configuration to process and validate new events
Centralize and improve our monitoring to be able to aggregate application logs with more technical metrics, like CPU spikes or latency. Tools like Grafana and Prometheus or potentially a SaaS solution, if we want to externalize it
Add some jobs to automatically optimize our data and remove duplicates to keep data quality as high as we can

Take-Aways

This architecture allows us to focus only on API Code and dashboards that bring immediate value for our Data pipeline
Almost every piece of infrastructure is managed so we do not need heavy administration. The main issue is to anticipate soft limits of services and be proactive about them.
Cost is equal to what you use, there is no minimum cluster size, like Redshift. So it works very well for tests as you can go from 0 to 2 billion over several days, and only pay when the traffic is high. As only a few weeks, with minimal amount of human resources, were needed to launch and monitor the project, the costs are very attractive even with our huge amount of traffic.
The Glue data schema can evolve easily, Athena is now supported by many dataviz tools like tableau, superset, periscope and redash, to name but a few…
This kind of architecture gives us the flexibility and the time-to-market which was, and still is, so important for us