By Jeff Holbrook
The General Data Protection Regulation was approved in 2016, to take effect on May 25th, 2018. Among other things, this regulation strengthens the rights of individuals over their personal data, allowing customers to request the personal information that a company keeps on them (a Subject Access Request, or ‘SAR’) and for customers to request that their personal information be forgotten (a Right To Be Forgotten (RTBF) request).
To a developer such as King, that means we need to be able to delete people’s data within a month following their request. We make Free-to-Play mobile games. We track things like your game progress and state. We use identifiers collected from your device and information players choose to sync via their social media accounts to make games more social.
In a data-driven industry like Free-to-Play games, it is important that we track a variety of information. Most of this data doesn’t contain any personally identifiable information, but some of the information is, like the device id or social network information.
A bit of context: Our team joined King in 2015, most of the way through the development of Paradise Bay. As we had the game and backend working before the acquisition, and it was based on tech that had been battle tested, we maintained our own backend. King had a large team working on ensuring their tech stack was compliant, and it was up to the Paradise Bay team to clean up ours.
There were many sub-tasks, but on the engineering front, we had 2 main tasks:
- Scrub the incoming analytics pipeline.
- Allow for the automated scrubbing of the game database.
The analytics pipeline
For something that a player never interacts with, this took a pretty large amount of effort. Our industry lives and dies by its data. King already had a system to anonymize analytic data on demand. Rather than adapt our old cloud infrastructure to scrub PII we decided that we should simply pipe our events through King’s Data Warehouse. Simple, right? It’s a huge system that moves a ton of data. There’s nothing simple about it.
The game servers send analytics events about player activity up to the cloud in batches of around 5000. We then insert a message with a link to the batch into a cloud queue service. Up to this part in the pipe, nothing was to change.
What we needed to replace is called an ETL. ETL stands for Extract, Transform, Load. Our previous ETL would check the queue for pending batches of events:
- Extract each batch from the cloud,
- Transform it from the game server format
- Load it into the database
In its most basic form, it’s just a piece of software to download data and get it stored in the target database.
The first part we tackled was the Extract code, which downloads the event batches and splits it into individual events. The code that did this for the old pipeline had a very large amount of specific error handling and had a lot of special case code that we didn’t want to disrupt any more than we had to. We decided to find a way to run as much of the old code as possible. This code, however, was written in Java 6, but the authors had been reading the Java 8 proposals and had implemented a large part of what became Java 8. Some of the classes were worth bringing over because they were either isolated enough or there was no Java 8 analogue, but often the better choice was to replace them with the modern Java 8 equivalent. Most of the concepts were similar enough, but some were different enough to cause some necessary restructuring, and some difficult debugging moments.
Our Java 6 and Guava implementation of the ETL processes each batch of Events like this:
While our Java 8 implementation solves it as such:
These 2 code samples do the same thing. They create a promise to be fulfilled upon completion of a task that is run, then it submits the batch to be processed. The promise is that the batch will be or deleted upon success, or re-queued upon failure, of processing a batch of events.
You can see that the 2 implementations have many differences! We had to translate obvious API differences “completeWith” became “complete” and less obvious differences such as “recoverWith” became “exceptionally”.
We also had to translate a few structural differences as well. In our Java 6 implementation, we had a concept of binding parameters to a lambda, wherein our Java 8, we didn’t find one that suited our needs. The code in the Java 6 is more concise, but the code in Java 8 is more standard–we had to make trade-offs on what to use directly, and what to find the right way to translate. If you’ve ever tried to learn a new language, spoken or programming, you know what it means to try to literally translate vs actually translate. The literal translation is often poor, while a translation of the actual intended message is more correct. We found that more modern Java was preferable because anyone who followed us would likely be more familiar with modern Java instead of what Z2 had custom implemented in terms of Java 6!
Once we had the Extract part of the system translated, the Transform and Load parts of the process needed to be rewritten from scratch as Load goes to King’s system, instead of our old system, and we need to Transform to meet those new needs. Loading is straightforward: a function call of all the right parameters to fire the event, which King’s Data Warehouse receives, but each function call needs to be explicitly written out, so there’s a lot of code to write–or generate! We wrote a temporary transform framework that would take each event from a batch and create/update a generated Transform layer with the correct parameters and “Load” function call. We had to do some creative config to help it understand what we wanted certain things named, but since all of our events start with the same base fields, it was easier than doing it all by hand. All we had to do was run the temporary transform layer to output our final, then drop the generated code into place. There was a little more hand-editing of the output, but overall, this process allowed us to start off at a run instead of a crawl.
Let’s follow some totally fake sample data through the transformation and code-gen process:
Our temporary transform layer output Java code that looked a bit like this:
The last bit remaining to do was to write glue code that could read each event from the batch and determine which EventType it was supposed to be and call createLoggableEvent. And with that, we had our Extract layers downloading batches and pass it along to the Transform and Load layers. Once we got it deployed and fully running, this part was done and we were able to move on.
Paradise Bay — Game Servers
Paradise Bay has its own game servers on its own stack. It scales well and can handle all the load that we threw at it for several years before it was acquired by King. It’s written in Java, but a lot of the Paradise Bay-specific code is in Lua. It runs in AWS and stores data in a MySQL database.
At the time, we had several million players of Paradise Bay. That’s ever, not all at once. Once you have a handle on a player’s data, it should be trivial to remove all the relevant data. Except that it’s not.
Traditional good database design is to have a single key that allows you to look up any data related to an entry. That’s fine if you always have that key. But because we use many Ids to look up a player, we have our data laid out very differently. An example of how we denormalize data for easy lookup is demonstrated by the device/social network Id and game state Id relationships.
A very traditional database will be keyed on the Game State Id and a server might look up the data by searching each entry until it found a Game State Id where it referenced the Device Id.
Where in our denormalized example, you directly look up the device or social network Id and it tells you which Game State Id to use.
Because we want to use many types of ways to identify a player, rather than always signing in with a login Id, we keep track of several Ids, things like device Ids and social network Ids that we gather on login, that point to the game state. We do this by creating an entry in the database for each Id that each store the game state that it relates to. Because of the way that we write these down, we don’t necessarily have an automated way reverse lookup for all device Ids! Your device will still point to your game state–which means that your game state is PII.
We don’t know which database entries might reference the player, so we need to open and check every single entry. It’s not the most elegant solution, but it allows us to cover everything and it’s easy to implement. Turns out that our database crawler is NOT fast enough. With many million players, and at least 10 documents each, and the database is shared across several machines, there’s a lot to crawl. And with our comfort zone being around a week to ensure that we can crawl all players within a week in an ever-changing landscape of code, we had our work cut out for us.
The crawler needs to crawl around 100 players per second to cover the whole database in a week. The metrics showed code was clearly able to crawl around 200 players per second, but then the crawler would appear to hang for almost a half hour before moving on to the next group. Some debugging and digging deeper revealed the trouble. The keys are prefixed by document type and were coming in alphabetical order. Alphabetically, the first set of documents in a table match the player, and basically, the rest of the keys are irrelevant to the crawler. Practically, it meant that we were processing keys that we KNEW we would never need, and MILLIONS of them.
We know the prefixes of each key type that could contain PII. Sorting through each key was taking too long, but armed with this information about the prefixes we are able to modify our SQL select statement to sort through only the relevant keys for possible PII entries and scrub. Crawling only the key types that we knew might contain PII, we were able to move along PLENTY FAST ENOUGH! We were then ready for GDPR.
When all was said and done, we had accomplished two pretty major updates to our system that allowed us to continue to serve our customers in the new GDPR landscape. The analytics pipeline was a huge effort in translating code and working across data centers, while the scrubber crawler allowed us to stretch all the way down the stack to the database and back. It was a great project that left us with some real learnings about how to plan for these kinds of events in the future.