Overview of Pocket Gems’ Data Pipeline
At Pocket Gems, the Data Infrastructure team is responsible for developing and managing the data pipeline. The first generation of Pocket Gems’ data pipeline was built in 2011, and we just released the third generation in February 2020.
This post describes the infrastructure of the Pocket Gems’ data pipeline, and how it ensures data accuracy, stability, scalability, low latency, and flexibility.
Pocket Gems Data Pipeline Architecture
We have FOUR major components here: data extractor, message queue, data transformer, data loader. The data is generated on the mobile devices and game servers, and sent to the data extractor. Some initial processing is done before sending the data to the transformer. Then we have downloaders subscribing to the message queue, and use different stage strategies to stream data into the data warehouse.
For the last decade of active games at Pocket Gems, we have collected event data from games — e.g. session data, battle data, and client customized event data. The data is generated on the mobile devices and sent to the backend server.
On the server, we have two types of memcache buffers for event processing: the heterogeneous buffer and the homogeneous buffers. All incoming data goes into the heterogeneous buffers first. Data here is not segmented and includes multiple types of events. When a heterogeneous buffer reaches a threshold, it distributes the data into homogeneous buffers based on the event type.
A homogeneous buffer stores a single type of event. When a homogeneous buffer is full or times out, the data gets flushed to the message queue.
We use Google Cloud Pub/Sub as the middleware in the entire data pipeline.
Pub/Sub is a messaging queue system which receives an event from the publisher and delivers it to the subscriber. The core concepts of Pub/Sub are topic, subscription, message, and message attribute. As a service provided by Google Cloud Platform, the stability, scalability and low latency are guaranteed for Pub/Sub.
Alternatives that we considered to Pub/Sub were Kafka and AWS Kinesis. All three of these services provide streaming systems and similar functionalities. Kafka is an open-source and highly customizable system, which means it will have higher maintenance costs. Kinesis is a data streaming service provided by AWS. It requires you to choose the appropriate partition key and needs to do re-sharding in order to scale. Compared to Kafka and Kinesis, Pub/Sub can autoscale and is a fully managed service.
From the previous steps, we already know that the event data is collected by the data extractor and published to the message queue.
First, different kinds of raw data are published into the same Pub/Sub topic — untransformed. Whenever a new event is published, it will be pushed to the untransformed subscription.
When events are received by the untransformed subscription, we get the event type from the message attribute and use it to fetch the event definition, then we transform the raw event data into the final formatted data.
After the transformation, we publish the data into a transformed topic or the crash topic. A corresponding subscription is attached to the topics that pull the data. The crash topic and subscription are used for crash report data, which requires high computing and memory resources so it is handled differently by the loader.
The transformation and validations done at this layer guarantee data accuracy.
The above 2 steps are built using Google Cloud Platform, but the loader step is executed on the EC2 of Amazon Web Services. We decided to keep using our existing tasking system, running on AWS.
Each loader fetches a fixed number of Pub/Sub messages, groups them based on the event type from the message attribute, and ingests the data into the final destination. Given that we have different topics and subscriptions, we also have different types of loaders to process them.
We use an in-house distributed tasking system, built with the AWS ASG to scale the loader.
After the loaders ingest the data into the data warehouse, the message is acknowledged and removed from Pub/Sub. The error handling system puts any invalid data into a different place to avoid blocking data processing. If there are any failures during the process, the message will be treated as un-transferred and re-delivered. This prevents any data loss.
Different staging strategies are used to load the data based on the latency requirements.
These four components are decoupled. We can easily modify one of the components without impacting the others.
Performance and Monitoring
We built a system to monitor the performance of the data pipeline system. Within the system, there are dashboards for error monitoring, message queue status, and latency of the pipeline.
We decided to use managed services mostly to guarantee stability and durability without wasting our engineering resources on it. The custom data validation and error handling processes make the system fault-tolerant.
Pub/Sub is fully managed by GCP and automatically scales. Loaders running on AWS EC2 also make use of the auto-scaling group to automatically add more EC2 instances.
In the shared event definition module, we have strict definitions for normal events and dynamic definitions for custom events. With them, tables are created automatically, event data is validated and transformed, and ingested into the tables. So we can make sure the data is accurate.
The total latency of the pipeline is 10 seconds from when the data is published to Pub/Sub, until the data is inserted into BigQuery.
As part of the data engineering team, we develop, monitor, and improve the performance of our pipeline system, BI tools, and several other business-critical systems for Pocket Gems (including CRM, AB tests, Data Modeling).
If you are interested in working in a high performing infrastructure team, we are hiring.