Zynga With Friends: Enhancing Server Side Scalability

Words With Friends Engineering

Published in

Zynga Engineering

9 min readDec 9, 2019

Ojas Sangameswara | Principal Software Engineer 1

Introduction

Zynga® With Friends™ (ZWF) is the backend service powering games such as Words with Friends®, Chess with Friends®, and Crosswords With Friends®, handling game data, sending push notifications, and maintaining progression systems amongst many other features in these games. In its ten year existence, we have handled hundreds of billions of moves comprising tens of billions of games across multiple titles, with a constant stream of new features and game modes to support. This has presented a never-ending, but exciting, challenge: How can we minimize disruptions for our players while continuing to add on to an enormous system?

The Challenge

Members of the ZWF team develop and support a monolithic server codebase, which in the past powered as many as seven live game titles. In 2017, the release of Words With Friends 2®, with new features such as Solo Challenge and Lightning Round, as well as Crosswords With Friends, presented an additional challenge to the system.

In 2017, our leadership asked us to increase the amount of daily active users we could support while minimizing game disruption issues, an uphill battle in light of the additional features and games we were to support. All while the client-side team started undertaking a massive challenge of its own. The Zynga With Friends backend team spent months identifying pain points, assessing potential solutions, and finally putting into action a plan that allowed us to both increase our ability to scale and support everything we wanted to add to our games. As both a backend development team and live operations support team, we sought to minimize the amount of off-hours support time our team members would need to spend dealing with any issues that came up in pursuit of these goals

The Problems

A series of team meetings in early 2017 led us to identify these primary issues we wanted to solve:

Resiliency and Quick Recovery from Database / Memcache / Redis Failovers
Lack of granular visibility into our database usage
Re-evaluation and refactoring of aging code
Auditing of Client API usage to help detect inefficiencies and redundant calls
Quicker Identification of Issues
Better handling of sudden bursts of traffic

Ultimately, finding solutions to these problems would lead to these benefits:

Minimized downtime
Quicker response times
Fewer necessary running application servers to handle all traffic
Higher team knowledge of a massive codebase
Less time spent dealing with outages and off-hours issues, ultimately leading to a happier team!

The Solutions

A company decision to return to the Amazon Web Services™ (AWS™) cloud services platform gave us an option with regards to resiliency: the array of managed services AWS provided. In addition, we evaluated several tools and libraries while making enhancements to our usage of services already implemented in the Zynga With Friends system.

Databases

ZWF databases initially were self-managed MySQL processes running on Amazon Elastic Compute Cloud™ (Amazon EC2™) instances. After evaluating our current setup and various alternatives, we quickly came to the decision to pursue a migration to Amazon Aurora™. Several benefits stood out:

We no longer had to account for both the compute power and size of our databases when choosing instance size — we could optimize for compute power and run on smaller instance sizes despite the size of our data sets.
We could handle failovers with minimal impact by migrating our application to hit cluster endpoints rather than direct addresses of the Amazon EC2 instances running our databases previously.
The existence of read-only cluster endpoints provided an ability to handle even more traffic easily by redirecting reads to multiple readers within a cluster, a huge benefit for some of our more read-heavy workloads.
We could add Amazon CloudWatch™ (CloudWatch™) monitoring to our array of monitoring tools, and integrate it with our On-Call notification system.

With the help of our fantastic Database Administration team, we were able to replicate our data, which lived in a set of self-managed Amazon EC2 instances to Amazon Aurora instances. We gained the ability to point our application to cluster endpoints so that when a failover did occur, our application did not suffer downtime and experienced only minimal impact. This left us with one remaining problem: Amazon Aurora’s connection limit, much lower than what we were able to set on self-managed MySQL instances. Projected future growth would put us in danger of approaching that limit, and we wanted to make sure we would not be close even with a dramatic increase.

The team evaluated several potential solutions before settling on the ProxySQL™ connection pooling library running on every server. We rewrote our application to redirect all queries to this proxy service, and with minimal impact on application performance, cut our maximum connections necessary by ⅔, leaving us ample headroom to scale within Aurora.

Queries and Code Optimizations

In parallel, we were very much aware that given the size of our codebase and ever-evolving client call patterns, we lacked visibility into which queries were putting the most load on our databases. Implementing the VividCortex™ database profiling service allowed us to see what queries took particularly long or were happening more frequently than we thought. In addition, we were able to find frequent queries that were missing indexes, a task that would have been very difficult otherwise given the number of distinct queries our servers run frequently.

Using the data from VividCortex in conjunction with our connection pooling work and migration to Aurora, we were able to downsize our databases while minimizing issues due to load or hardware failure. As a result, problems with Zynga With Friends games due to databases have been dramatically reduced. In addition, our databases no longer are subject to slowing down during large bursts in traffic and can handle multiples of our peak traffic today.

Memcache

Our databases are fronted by a Memcache pool, which, due to the call patterns our new games and features introduced, led to rapid auto-scaling causing influxes of new connections, as well as an array of hot-key issues.

In line with our move to Amazon Aurora for databases, we chose to move forward with Amazon ElastiCache™ (ElastiCache™) for our Memcache cluster. Similar to Amazon Aurora, we were able to utilize the CloudWatch data in monitoring our cluster. Hardware issues, which previously required us to manually evict or replace instances, now happen minimally. In the past, this could result in significant downtime, but now in ElastiCache, we experience brief amounts of errors over a period of seconds, with little user-facing impact. Vertically or horizontally scaling our cluster has become a matter of a few clicks.

After some additional evaluation, we moved forward with the Twemproxy™ connection pool library for both Memcache and Redis, as a tool to minimize the number of connections we opened to our Memcache instances. This allowed us to remove Memcache as a bottleneck for horizontally scaling our service by removing the stampeding herd connection problem.

Redis™ Instance Management

Additionally, we managed a set of Redis™ instances containing various key-value stores integral to our games. Manual intervention was previously required in the case of a hardware failure, but moving to ElastiCache allowed for failover to happen seamlessly with minimal errors.

Redis connections have not caused quite the bottleneck with regards to scaling that Memcache connections have, but we are considering options with regards to improving our connection pooling there, with Twemproxy for Redis also being an option.

Client Auditing

Predictably, the new games and features we introduced resulted in call patterns that differed from what we had seen historically. We chose to continue utilizing Instrumental™ monitoring, a service ingesting data we send to StatsD™ clients running on our servers. We have used Instrumental for many years to provide easy to consume data visualization and alerting on service data including timing, counts, memory usage, and CPU utilization, amongst many other metrics.

We took data from Instrumental as a lead to analyze particularly high volume and compute-heavy endpoints, and followed that up with changes on the API side, while providing information to client teams to optimize their call patterns.

Instrumental’s ability to overlay historical data over current data assists us additionally in detecting unusual behavior in new releases. We can quickly see in our dashboards when one of our key metrics is reaching unexpected values, follow up by isolating to a time when the issue started, and isolate issues to a specific event, server release, or client update.

Alerting

We have utilized callbacks from Cloudwatch and Instrumental by linking them to PagerDuty™ incident management platform by hitting AWS Lambda™, which functions as a funnel to route and manage all our alerts. Over time, we have tinkered with thresholds so as to figure out what exactly is an issue and what is a false positive. This has allowed us to create a system where we are alerted when there is an issue that requires action, information is provided to the person who is on call in the form of data from Instrumental and CloudWatch, plus playbooks where applicable.

Additionally, over time we have reduced the number of alerts on-call developers have received, a fact truly appreciated by those who have been on-call in the past. While not every alert can have a predicted solution right away, we are able to use our history to assist in diagnosing new issues as they come up.

Spot Instances

In addition to all of these enhancements, we migrated to making better use of Amazon EC2 Spot Instances™ in AWS. This allowed us to balance stability against costs. Previously we would keep a fixed array of application servers corresponding to the expected maximum we’d need at any given point. This leads to overspending on AWS instances. Initially, we allowed our Auto Scaling groups to scale up as needed, requesting on-demand instances. This leads to both elevated costs, as AWS charges their highest prices for instances requested in this manner. Instead, we reduced the number of reserved instances to what we’d need to cover the majority of the day, and used spot instances to fill in the gaps for the highest traffic portions. While we sacrificed the stability of constantly running servers, choosing multiple instance types to run the service on has lead to minimal issues while greatly reducing costs.

Results and Looking Ahead

As a result of all these changes, the Zynga With Friends backend team can definitively say that our system is in a better place. We have minimized outages, increased performance, and cut costs where possible. Team members spend minimal time outside of work dealing with game impacting events, and correspondingly, turnover and burnout have been nearly non-existent.

GWF Health. Uptime last 7 days: 100% — Zynga System Health Monitor

However, we know this is a task that is never quite complete. There are still tasks to be done that can improve our efficiency and resiliency as Zynga With Friends expands. As we continue to add new features to our games and deliver things sure to delight our players, we know that there will be different heavy queries, compute-intensive code, and hardware issues to detect. What we have added are tools in place and team knowledge to make sure we can handle these things quickly and in a cost-effective manner. The Zynga With Friends system will continue to evolve while maintaining a high level of service for many years to come.

—

Zynga, With Friends, Words with Friends, Words with Friends 2, Chess with Friends, and Crosswords with Friends are trademarks, registered trademarks or trade dress of Zynga in the U.S. and/or other countries. All other trademarks not owned by Zynga that appear in this article are the property of their respective owners and this is an independent article and is not affiliated with, nor has it been authorized, sponsored, or otherwise approved by the respective owners.

Amazon Web Services, AWS, Amazon Elastic Compute Cloud, Amazon EC2, Amazon Aurora, Amazon CloudWatch, CloudWatch, AWS Lambda, Amazon EC2 Spot Instances, and Spot Instances are trademarks, registered trademarks or trade dress of AWS in the U.S. and/or other countries. VividCortex is a trademark of VividCortex, Inc. Twemproxy is a trademark of Twitter, Inc. ProxySQL is a trademark of René Cannaò. Instrumental is a trademark of Expected Behavior, LLC. PagerDuty is a trademark of PagerDuty, Inc. Redis is a trademark of Redis Labs, Ltd.

Zynga With Friends: Enhancing Server Side Scalability

Written by Words With Friends Engineering