The Billion Events Infrastructure

Tilen Kavcic
Outfit7
Published in
5 min readDec 13, 2022

TL;DR: Outfit7 is, at its core, a data-driven company. Collecting and analyzing performance metrics from our games holds a lot of interesting challenges. In this article, I’ll explore our infrastructure design decisions and what we have learned along the way.

Hi, my name is Tilen Kavčič. I’m one of the backend core members at Outfit7, specializing in cloud infrastructure and R&D. Our team’s mission is to create high performance, scalable solutions that support the growth of our company. I’m part software engineer, part data engineer, with an interest in DevOps. The best part of my job is getting the chance to create solutions that simplify development of new software. My current mission is to modernize our data stack, bringing software engineering best practices to our data teams.

Background

At Outfit7, we believe that all our games should look, play and feel good. With 430 million monthly active users, it can be a challenge to get a clear picture of how our games and company are performing. The data we gather helps us understand how we perform in the fast-paced mobile gaming market. Getting fast feedback is key. Because of this, we need infrastructure that is scalable and reliable, and which, ideally, doesn’t hurt the company’s wallet too much. Over the past decade we’ve continuously improved our infrastructure to accommodate our growing user base. Today, we handle half a million row inserts per second — let me show you how. Welcome to the Billion Events Infrastructure.

Infrastructure

Let’s start with an overview of our infrastructure. We receive data from various data sources, like game usage data, third party services, etc., which we save in persistent storage. Here, we’ll focus only on data sent by our games, because this is what forms the backbone of our analyses. From data gathering endpoints to persistent storage, all is hosted on the Google Cloud Platform. Let’s start with the first layer, at our API endpoints:

  1. API endpoints
    We run our data-gathering endpoints on different managed services (App Engine, Cloud Run and Kubernetes Engine). Our main endpoint, where we gather usage data, runs on Kubernetes Engine. We chose Kubernetes because it offers a nice balance between performance and cost. Because we have a global player base, we run three clusters, each in different regions (US, EU and Asia). Our main cluster resides in the US, where we get the most traffic, while the Asia cluster serves as the endpoint for our users in China. We also set up one cluster in Europe which is our backfall endpoint in the event of endpoint failures in US and Asia clusters. All data is validated and pushed to our data delivery layer.
  2. Data delivery
    This second layer is a middleware between our API endpoints and data processing workloads. It provides asynchronous delivery of messages with the benefit of keeping our data safe until it’s written to persistent storage. We use Pub/Sub (similar to Apache Kafka) which provides a simple, reliable and scalable message delivery service.
  3. Data processing
    In this layer, we do some light data transformation (unbatching the client data) and streaming data into persistent storage. Our main framework to do this is Apache Beam which runs on managed cloud infrastructure called Dataflow. This is a tried and tested way to do ETL pipelines because it provides a reliable way of inserting data into persistent storage.
  4. Data storage
    After all this, the data is finally written into BigQuery. It’s our main data warehouse, which powers all of our analytics workload. It’s fast, reliable and just what we need to store and analyze our data.

All these layers combined create a scalable and reliable infrastructure that can receive, process and ingest more than half a million rows of data per second, or five terabytes of new data per day. Our data teams query 60 petabytes per year with zero downtime and zero maintenance from the core engineering team, just by using BigQuery. By using a simple approach and managed services a single person can maintain this infrastructure. This enables us to focus on developing new solutions that add value to our company.

What we learned

Development environment

These environments are important because they create a safe space where engineers can create and test new features. Without them you are always one step away from breaking the production environment. Our team takes these environments seriously. At the very beginning, we separate production and development services by using access permissions and configuration files. Using infrastructure as code (we use Terraform) speeds up creating development and production environments. If you want to create a better development environment we recommend looking at The Twelve-Factor App [1].

Tests

Probably the most important part of any service is the tests. Unit and integration tests provide a good overview if your services are working as expected. And for core services that have a direct impact on the company, tests are a must-have. We had great success detecting bugs in our code before releasing our code into production. The key note here is that tests should be meaningful and should test the core functionality of our service. To build upon this we use CI/CD to automate the releasing and testing phases which cuts the manual labor immensely.

Metrics

While tests primarily cover our core business logic during development, we still need to monitor the behavior and health of our services during and after deployment. To do this, we create metrics that give us real-time insights into how our service performs. On top of that, we build alert policies that inform us when something is wrong. We also have automatic abort functionality in our CI/CD pipeline if the metrics are above a certain threshold. But creating helpful metrics can be tricky. You can either over-monitor and overload your on-call people with alerts or under-monitor and overlook issues that are bad for the company’s performance. I recommend reading Google’s SRE book, which will help you find the middle ground [2].

Simplicity

Adding new features is inevitable, even when the core infrastructure is intact. We need to keep in mind that making decisions that require complex systems will hurt your team in the long run. While designing your systems you need to balance complexity and costs. Usually managed cloud solutions like Pub/Sub are the best way to reduce complexity, but they can be more costly. Our team uses managed cloud solutions because they reduce the overall load on the engineers. We also simplify our release process. We have one rule, if you push it to main it will be tested and released. No more scripts and documents on how to release new code.

[1]: https://12factor.net/

[2]: https://sre.google/workbook/monitoring/

Conclusion

In this article we looked at our core data infrastructure, which is responsible for receiving and saving the analytical data from our games. Due to the large amount of data we receive, we needed to create a reliable infrastructure design that would scale with our growth. Choosing the correct technology is just one part of creating such a system. Any core system should be designed with simplicity in mind with tests, metrics, logging and development environments that give developers a safety net. We went through many iterations of our system over the last couple of years, each time learning something new.

--

--