How Gamezop uses AWS S3 as a data-warehouse for terabytes of clickstream data!

Published in

Gamezop Tech

9 min readMay 14, 2021

Introduction

Gamezop is an HTML5 game publishing platform. We have hundreds of games that are embedded across thousands of popular apps and websites. In combination, this leads of millions of monthly active users. At the time this particular service was built, we were at about 21 million monthly active users. At the time this article was written, that number had surpassed 40 million! More about Gamezop here:

Understanding “state” data and the problem at hand

Alright, so coming to the service that this article talks about. When any user plays any game on our platform, each click and each score update needs to be tracked, stored, and made retrievable against a time-series setup. We shall refer to this data as “state” or “states” data (pardon the inconsistency 😅).

For instance, let’s say you play one of our games and make a score of 10 (starting from 0). Our system would have one row of state data for every score update — 10 rows of state data. We record the score, the device time when the score was made, the number of clicks that it took you to reach that score and a bunch of other data points around each score update. So the simple act of playing a game for 5 minutes can generate thousands of rows of states data!

Why do we need to store this? Well, we also have a real-money gaming product, SkillClash. Here, users play games to win real-cash! And where there’s cash, there’s bad actors trying to game the system. Among several other use cases, this data helps us build machine learning models to detect anomalies in user score progression patterns.

At scale, this becomes a behemoth in terms of storage costs if we are using an RDBMS. We have about 125k+/minute or 2k+/second entries following into our states table. Other than just the actual data that needs to be stored, there are also the indices. The nature of the data is such that new data is queried more often by multiple real-time processes, and so our data storage system must be initially ACID-compliant. We then also have older data which we can move over to a cold storage system — with 3 core objectives:

Storage system of choice should be infinitely scalable
Storage system of choice should be cheaper than the RDBMS disk
Data should be easily retrievable/queryable.

By the way, an RDBMS also has a hard limit of 2TB per instance by AWS. Another common issue we face is other tables being locked/ slowed down by the rate at which we insert into the state table. These limitations may change over time but we felt the need to engineer ourselves out of these problems quickly.

Cold storage system using AWS S3

We decided to build a cold storage system over AWS S3 to access and scale our states data storage while keeping costs low. We need to understand one more business object before getting to the Cold Storage System: “session”. Each state belongs to a session. A session is created when any user starts playing a game. One session can have multiple gameplay loops. We also have a table that stores the session information.

What is important to understand is that each session may have hundreds or thousands (in some cases even hundreds of thousands) of metadata rich state entries.

As mentioned earlier, we can determine which sessions are not going to see a lot of real-time read queries and can be moved to cold storage. When we set out to build a system that would allow us to use S3 as a database, the first question was: how do we define the storage unit on S3, i.e., what should one file contain? This would determine our atomic unit, and all operations would be based on this unit. Our atomic unit is a session. One file contains all the states data of one session. Each file is named based on details/fields of the session upon which we may want to run bulk queries. More on this in the S3 Reader service below.

We have an injective function (one-to-one) that accepts information about a session and returns a file name. This is basically the name of the file which will save all the states data. Other than just moving the data, this system’s other operation should view / query the data. Since our unit is a session, the sessions table in our RDBMS acts as an index for the files present in our S3 bucket. This is an important idea: for example, if we wanted to get all states data for a particular game, we could query the sessions table for all sessions that meet this requirement and then generate all possible file names that would contain data with respect to that game. Similarly, if we need to know all the files in our S3 bucket which would have data on a particular user, we can generate those file names via the data in our sessions table and then query those files out of S3. Here is a pipeline for reading data from S3:

Let’s say a service needs all the state data on a game for a date range. When this request comes in, our S3 Reader service first converts this query into a sessions query. Essentially, what should be the query on the sessions table which will return all session data that is relevant to the original query?
Actually query the sessions tables for the session data
Data from each of the rows returned is fed into the injective function to generate the file name as it would be on S3.
Download the files from S3 for all those sessions and merge all these files into a single file.
Return this file either as an S3 URL to a JSON file, or a JSON response itself.

We still keep the sessions on a Postgres system; this is much smaller (relatively!) than storing the state data. Not too much — slightly over 1.2 billion rows of sessions data at the time of writing 😅

A more technical deep dive into different aspects of this system:

S3-Migrator

You would recall that we have some state data that is stored in our RDBMS system and there’s a system that offloads data older than a certain time frame off to S3. This is the S3-Migrator. The S3-migrator, which moves the data from our RDBMS to S3, is programmed in a master-slave manner.

The master finds out sessions metadata of the state data that needs to be moved and spawns slave programs to actually move this data to S3. The slave program is currently running on AWS Lambda functions. The number of Lambda functions spawned to transfer this data depends on how much data is moved by each Lambda function. Investigating 2 possible extremes will give us a better idea of the number of slaves to spawn and what factors affect this system:

A tiny amount of data is moved by each Lambda function, but many Lambda functions are spawned.
A huge amount of data moved by each Lambda function, but less Lambda function spawned.

Although method 1 uses the most computation power, it’s not the most efficient. Time is wasted in I/O calls to spawn Lambdas and create connections to the DB. Method 2 would be synchronous, which would be too slow and will not beat the rate at which data moves into our RDBMS.

So we had to tinker around with these parameters to get an optimal throughput (MB/s). For us, each Lambda function moves ~3.6 MB of data, which gave us a throughput of 10MB/s. Of course, these numbers are particular to our data models and will broadly vary according to the data model and size of the data. Although this speed works for us, here are some things we can do to further improve the system:

Output files: Currently, our output file is JSONL, which is very inefficient. Moving to a CSV or a binary storage format (like parquet) would drastically affect this system’s efficiency in reading and writing. This would not increase the earlier metrics mentioned (throughput in MB/s) but increases the throughput of more meaningful data being moved (session/MB). Also reduces the size of the data stored on S3.
Not using Lambda: Spawning the slaves through Kubernetes or Goroutines is going to be much faster. For us, at our current scale and rate of data, Lambda functions do the job (and they cost close to nothing)!
Clean input to the slaves: To make the most of this process, the slave program should get correct metadata information. For our case, some of the sessions don’t have any state data. This could happen if someone opens a game but drops off before making any score. And presently the slave system needs to run this sanity check before creating files over in S3. This wastes time in the slave process.

Another note on the Migrator system: we use MongoDB to dump a report of all the data moved and purged from our RDBMS. This can be important to trace our data/execution path.

Storage Cost Analysis:

If the S3 Migrator runs once every 15 mins, it moves 50k sessions (each session is 1 file) each run. That is 4,800,000 sessions a day. AWS charges $0.005 per thousand S3 write requests, i.e., $25 per day just for requests. Another $0.575 for the actual storage of data. AWS charges about $0.025/GB and for us, 4.8M sessions is roughly 23GB of data. One way to compare how much we are spending on storage (S3 vs EC2 disk) is to consider the metric “Price per Session”.

Price per Session for this system is largely just the insertion cost, and that is: $0.000005 [($25 + $0.575)/ 4,800,000].

We had an EC2 instance that stored this data. It used a 2TB disk, and ran on a c5.large, which cost us around $300 per month. Price per Session calculated for this is: $0.00000186 (300 / 160,631,316). This denominator is basically the number of sessions we can store on a 2TB disk (data + indices + 20% free disk space).

On initial look view, it looks like we are actually paying more for storing data in S3 ~2.5x more ($0.000005/$0.00000186). But actually, after the initial insert, S3 only charges $0.00000011 ($0.575 / 4800000) per session. Compared to maintaining cold storage on an AWS EC2 machine, that makes S3 ~16x cheaper!

Here’s a summary:

Here’s how this graph was calculated.

S3-Reader

Unlike the Migrator, the reading mechanism over AWS Lambdas would get complicated. So we just used go’s concurrency and the fileSystem for this program. The pipeline for this system is mentioned in the introduction part. Let’s examine the 2 core operations that this program does:

Download: downloads a file to a file location from S3. Downloading speed is bottlenecked mainly by the network speed, but if the hard disk is running out of IO/s, the hard disk will affect the downloading process.
Merging: Once the individual files/buffers are downloaded, they need to be put into a single location/ file.

Here’s a concurrency model represented in a tree-like structure.

The idea here is not to block the merging and the downloading process. We expect the downloading process to be slower than the merging process. Once the download is completed, a file handler is passed on to the Merger through a channel. The Merger is configured with a buffer. The Merger uses Goroutines to read the file into the buffer. Once the buffer is full, the Merger dumps the buffer into the output file and waits for more file handlers. In this way, the system can pull from the S3 bucket at a commendable rate. The S3 reader with this architecture can be optimized for fast read speeds/larger number of request/low memory usages. We have multiple deployments of our reader with different configurations.

Here are some areas of improvement for this system:

Main memory vs. file system: don’t use filesystem, store and merge in memory. This can be used on smaller requests.
Cache: a cache system can be built to not download and reuse the same file. Depending on how we plan to operate our data, we can set up caching protocols.

Conclusion:

Knowing the use case of our data allowed us to build a data processing/storage pipeline to optimizes cost, human resources, and scale.