The Zynga Analytics System, Part 1: Overview

Kendall Willets
10 min readFeb 21, 2018

--

Let me preface this piece by warning the reader that this is partly a technical description and partly a personal memoir, as I am only documenting what stands out most in my memory. It is largely a story of the early days of Data Engineering, before it became buzzwordy and cargo-cultish. The best I can promise is to break out the more interesting topics in subsequent posts.

I started at Zynga in January of 2009, as the third person hired for the Analytics team. At that time Zynga had a range of smaller games with DAU in the 100k range, and their analytics consisted mainly of gathering a few stats from nightly backups of game databases, and from the payments system, which they had wisely separated completely from the rest of the company. Some studios such as Poker had better analytics that they had developed on their own.

Although the company had only about 200 employees, it was already undergoing growing pains. There were tales of fights between the founders, lawsuits over poached employees, and completely irrational stress-triggered drama. Our task was to add analytics.

Our initial objective as a group was to centralize the analytics flow, which consisted of an internal reporting website, data repositories, and data gathering. I don’t remember which part I worked on first, but I had to go through a fairly difficult time learning PHP and our execrable internal web stack. It was mostly cut-and-paste of some forms-handling Javascript and PHP; there was no framework of any significance.

Our data repository was initially MySQL, and basic DAU aggregations took over an hour. However we were experienced enough to look for a better data warehouse, so we sprinted through an analysis of several platforms in a couple of weeks. Products based on map-reduce did not fare well; the main factor was time to first rows, since even a simple query required several minutes of mapping before reducing. Vertica, on the other hand, was impressive, with a combination of distribution, column-orientation, compression, and sorting that was hard to beat. We also got a fairly good price, as the company was looking for an entry into the game industry.

I did one of the first queries of the production Vertica system, and I remember it took about two tenths of a second, since we had aligned the projections with the aggregation, which was the same DAU query that had taken an hour before. I switched our database interface to Postgres (since Vertica uses the same wire protocol), and since then we’ve run millions of queries through it.

It was not all smooth sailing, however, since Vertica was going through growing pains of its own. For the first year or so we could not do non-segmented joins; all the tables in a join had to be pre-partitioned so that joined rows were on the same node. We also used to joke that one couldn’t be a real analyst until one had brought down Vertica at least once; there were a number of user-level SQL operations which would bring down the entire cluster.

In the end, though, Vertica was a key part of analyst productivity, and we eventually scaled it to hundreds of servers, loading over a terabyte per hour. Map-reduce and cloud products are still catching up to what Vertica was doing nine years ago.

ZTrack, An Instrumentation Pipeline

Concurrently with scaling our data repository, we also began to build an instrumentation infrastructure, which we named ZTrack. The main idea of instrumentation is to put hooks in application code that record certain events such as installs, visits, messaging, and any other event that product managers, developers, or analysts deem necessary.

Part of the success of ZTrack was due to organizational factors, as OKR’s were tied to metrics fed through it. Unfortunately this stimulus also meant that I had a constant stream of PM’s at my desk with “is stats working?” requests, and later assessments found that instrumentation put a load on developers as well. But over time these PM’s took over most of the analytics cycle, defining their own metrics and shepherding them through implementation and analysis, while we focused on infrastructure and a loose schema defined around core metrics. Developers also told me they liked the system, as they could use it to log bugs and count the failures they couldn’t fix right away. I referred to it as the world’s most expensive debugger.

In our environment we had a lot of PHP code, and a little bit of Java, so we developed ZTrack as a simple PHP and Java API that allowed people to log events to some sort of black hole that we managed. This black hole was actually a transport layer that serialized and moved data from the application server, through a series of pipes to Vertica and a couple of other centralized data consumers. It was a primitive version of Kafka or Kinesis, without the 24x7 support that makes these services tolerable.

We based these pipes on scribed, a Facebook internal product written in C++. One weekend morning I saw that my manager was getting segfaults from building scribed against the wrong boost version, so I fixed it for him, static-linked it, and ended up supporting it for years to come. It’s actually a very bug-free and footprint-respecting tool, but we still ended up doing a huge amount of DevOps-type configuration and support to keep it fully available.

We set up ZTrack to validate and serialize each event into CSV format on the client, so that the Vertica loader received a file from scribed every minute with directly loadable data. While CSV was compact, we had to fix the column list in the loader every time we added a field to the API; our scheme was quite schema-dependent.

Scribed has a very simple flow-control algorithm: it buffers messages up to a configured buffer size or timeout, and then tries to send them to its list of downstream destinations. Unfortunately if even one of these destinations is backed up, the entire buffer send fails, and that server in turn becomes backed up. I spent a lot of time diagnosing and fixing these issues.

One of the oddest problems we had with data transport was due to sshd as we shipped scribed data between data centers. SSH uses a fixed-size buffer, and the total size of in-flight packets cannot exceed this buffer size. This limit means that channel capacity is inversely proportional to latency, so a 70 ms transcontinental ping time translates to about 550 kB/s.

What this bandwidth constriction meant at the time was that on Farmville thousands of EC2 servers would be delayed in sending data for many hours of the day (normally we could get an event end-to-end in about three seconds).

We initially didn’t know much about the sshd problem, but I found a patch that someone had developed for it specifically. I messaged the CTO several times about it, but nobody did anything, so one night I asked the Farmville lead for an EC2 instance and applied it myself. It was non-trivial, since our production distribution was actually downlevel from the patch, and I had to search to find the lowest version that the patch could handle, and the highest version that the distro could run. Then I had to bring up the new sshd on a cloud instance whose only link to me was sshd. However it did work, and extremely well. I manually installed the new sshd on the long-haul machines that night.

In case it’s not obvious by this point, we had root on virtually everything that wasn’t payments. In order to move fast, we had to be able to patch ZTrack or scribed on other people’s production machines, as well as keeping our own transports and Vertica clusters running. The cross-organizational nature of stats meant that we had access to everything, but we went through a lot of stress making sure we didn’t break other people’s products, and we lacked a test-driven release process. As a result, we never quite got ourselves out of the alerting loop.

Experimentation

Once we had instrumentation working, we also moved full-speed into controlled experimentation, or A/B testing. We added a ztrack_experiment call that assigned the user to a variant and logged the event to the backend, so that the users who received the experiment treatment could be aggregated and their stats compared against control (some people call these triggered experiments).

What we also needed was a way to manage these assignments and do ramp-up, feature deployment, and so on, which we built on our execrable LAMP stack, with a table of experiments and their settings which were edited through an internal web page. Every minute, a cron job would poll this database and send the settings to a production tier of memcached’s for use by the client library. ztrack_experiment would then take its parameters, game ID, experiment name, and so forth, and build a key to look up the experiment settings in one of these memcaches, as well as the local APC cache.

The actual variant assignment was a pseudorandom hash of the user’s ID and an experiment-specific salt (I think it was the experiment ID). The hash was then converted to a scalar in [0,100) or so, and values below the threshold (a percentage supplied by the user) would be assigned to the experiment treatment. We also handled multiple treatment variants and so on, but it was all based on this numberline.

We had a number of distributed programming problems to solve with this process, since for instance we didn’t want every thread on the server to try to refresh the APC entry at once after expiration, so we kept the expired cache entry for some threads to use while one was doing the refresh, which itself could fail. In the end I wrote a complete test harness to simulate cache failure modes and rates and do factorial testing.

We also encountered issues with horizontal failover and even pusher performance as we scaled out to thousands of experiments. One of my more dubious achievements was writing a socket-level asynchronous memcache interface in PHP; it won’t win any systems programming awards, but it increased performance by orders of magnitude.

A few years ago, a company called LaunchDarkly developed an experimentation service with similar architecture (but better components), and their cache expiration and refresh protocol ended up being exactly the same; I take that as some validation of our work. That said, this type of problem is probably solved in some distributed programming textbook.

Realtime And Other Efforts

As the load on Vertica grew, we found a lot of people doing simple queries like DAU or installs every few minutes to monitor their games, so our initial push into realtime was to relieve that load. We built simple aggregators, to sum these quantities over 1 or 5-minute intervals, directly from the incoming scribed datastream. I don’t even remember where we stored these, but they were time series of a few hundred rows per day, so I think it was a small SQL database.

A second generation of this system proved to be an embarrassment, as the organization created a system of dozens of machines that amounted to about .2 COST. I made some efforts to turn the ship, but Zynga by that time had become top-heavy and sclerotic.

Vertica itself was also quite fast at loading, so the latency from logging to querying was under five minutes typically, and we never really broke out realtime as a strongly independent product.

Data Products

Another big area for Analytics was in what we called Data Products: useful aggregations of data that could be put back into use in games or marketing.

We pushed a nightly build of user profiles and friend lists, etc. to be used by games in whatever way they saw fit. The suggestions to send messages, gifts, and so on to other friends playing the game often originated from this data, as well as offers of discounts for frequent payers.

We also developed a system for quests, which are sequences of actions meant to result in some reward for the user, in return for, say, trying a new game or feature. Since we held all the data for all the games, we played a big role in cross-promotions, as Zynga tried to capitalize on the network effect. I was hoping to get a Rete network going for this, but we ended up just refiring all the rules on every event, which was probably more robust.

Unfortunately we also got pulled into data recovery efforts fairly regularly, as we had basic stats like level, items, and so on that could be used to restore corrupted user state. Mafia Wars at one point expanded the size of its user blobs while forgetting to increase its memcache entry size limit, so thousands of blobs were truncated. Stats filled in some of the missing data.

Social Graph Analytics

A perennial part of this user profiling was to find “influencers” or users who exerted a disproportionate amount of influence on others. We looked at a number of metrics but never really embraced global graph analysis.

It was technically interesting, however, and at one point I ran a betweenness centrality calculation in Vertica that yielded a few hundred heavy hitters. We found that other tools didn’t scale well, and it was really a problem of physically distributing the social graph across the network graph of the cluster, so Vertica seemed as good a tool as any.

I also ran some simpler clustering coefficient calculations across various games to see if they were more random or socially connective; there were some differences, but not a lot of actionable results.

The main reason for not adopting sophisticated graph calculations is that they didn’t give us much beyond what our detailed activity and messaging stats already had; we could measure social influence just by how many messages a person sent and and the amount of reciprocity.

The End Game

One of the problems with centralized analytics was that, while game developers could move around from game to game, I had a skillset that really only fit within that central group, and while I had good lateral relationships with analysts, PM’s and developers all over the company, there weren’t that many other roles for me to move into.

My last year or so I became a kind of embedded data engineer for game releases. Stats was a blocker for release, so we had to get everything configured and tested, including our friend scribed, in EC2 launch scripts, and that was actually kind of fun. I developed a number of tools for devs to debug their stats, and I enjoyed making their lives a little easier. It turned out that instrumentation took about one mythical man-month to implement on most games, so building tools to help debug and QA it was time well spent.

But the financial winds eventually caught up with us, and I went out the door in 2013 with a sense of relief.

--

--