Game Time!

Wealthsimple
Maker Stories by Wealthsimple
6 min readAug 6, 2021

How Wealthsimple engineers use Game Days to sniff out trouble before it occurs.

Artwork by Sammy Yi

Here at Wealthsimple, client trust is what keeps us up at night. Not just because we know people rely on us to grow their retirement savings, help them amass a down payment for a home, and pay for college educations — though those are responsibilities we take very seriously. It’s also because, in a 24/7 world, we know clients count on services that are robust and reliable around the clock. As our user base grows rapidly (and if you’re a client, new or old: thank you!), we’re putting the pedal to the metal proactively, scaling our systems to make personal finance simple and accessible, whenever you need it.

One element of scaling is to thoroughly check each code change before it is deployed. However, this isn’t enough (on its own) to ensure resilience. It’s possible for software to break in unexpected ways (and at unexpected times), so we need to be able to respond quickly and effectively to these failures. Anticipating the unanticipated is the only way to improve our ability to respond quickly to emergent problems.

Militaries often use war games to test strategies and probe for unforeseen contingencies. At Wealthsimple, we use Game Days to inject controlled failures and test how well we can respond to them. These exercises help us to improve the resilience of our systems, the skills of our team, and ultimately the dependability of the service we provide. Sometimes breaking a little code (on purpose) is the best way to make it more reliable.

What are Game Days? Do I Need to Bring My Console?

Game Days are when we purposely trigger a failure in our production systems, and then try to deal with it as best, and quickly, as we can.

Triggering failures in this way often reveals a lot about the way our systems work, and how we work with them. Sometimes, Game Days reveal how our systems interact with each other in unpredictable situations — like when we impact one or more critical services or infrastructure components that other services depend on.

We also track our responses to these challenges and analyze them to highlight areas of improvement, create action items for future development, and ultimately make our systems even more reliable and resilient.

Why Would You Purposely Break Things? Didn’t Your Mom Teach You That’s a Bad Idea?

The goal is to turn failure to our advantage by preemptively injecting controlled failures in production systems so that issues can be surfaced and fixed before customers encounter them.

Rules of the Game: How We Implemented the Strategy at Wealthsimple

Game Days have proved to be so useful that they’ve become commonplace in our development technique. There are several ways of implementing them. Here’s how we do it at Wealthsimple.

As part of our scaling efforts, we have been migrating our AWS RDS databases to Aurora. This migration has put us in a slightly difficult position when it comes to upgrading and testing customer-facing systems. Essentially, it is no longer feasible for us to plan site-wide downtime for database operations. Some of our products, like Wealthsimple Crypto, need to be continuously accessible because the crypto market is always open. Furthermore, our customers are now in so many time zones that there is no suitable window for downtime. “Service unavailable” is unacceptable.

That’s why we opted to use Game Days instead of maintenance windows for migrating databases. Each of our Game Days is then implemented across two stages — planning and execution.

Step 1: Pre-Game

We start the process of planning a Game Day by communicating and coordinating with service owners and teams. This assures that they are aware of the potential disruption it may cause. In addition, and well ahead of the Game Day, we share a document that allows service owners to plan their own response to the Game Day and forecast the impact on the services they manage.

Then, working alongside service owners, we create a list of all known failure scenarios that could result from injecting the failure. This list contains answers to a few crucial questions:

  • What services are expected to be impacted?
  • What services are expected to not be impacted?
  • What are the underlying assumptions about how these services interact with the service(s) that we are simulating failure on?

We then create a checklist for the Game Day simulation, which allows service owners to record the actual impact of the failure. This checklist also includes a few key items:

  • Is there a clear directly responsible individual (DRI) for each service in case of an incident?
  • Do downstream services and UI components behave as expected?
  • Is there a case of cascading failure? If so, is the owner or team alerted?
  • Does the system return to a healthy state once we remove the injected failure?

Step 2: Game Day Execution

With all of this preparation in place, we can then proceed with the Game Day.

Generally, we inject failure by selectively stopping services that normally communicate with the database. When migrating a database, we briefly stop only the associated service that directly communicates with it — for instance, by not enabling maintenance mode on other services that interact with the stopped service. We then monitor the behaviour of the other services and record our observations.

These observations align with those we’ve detailed above, and we use Pagerduty Incident Management to record them. The details we record include:

  • What services are impacted?
  • Are there any surprises here? Are there any services that shouldn’t have been impacted? If so, we’ve identified previously unknown dependencies or “side effects.”
  • For each impacted service, we then record key details:
  • Did it alert its owners or team?
  • Are the alerts actionable?
  • Do the alerts contain enough details to point to the failure?
  • Did the UI components communicate effectively?
  • Did the system return to its original state when we remove the injected failure?

Finally, we analyze these observations to confirm that impacted services degrade gracefully. For services that don’t exhibit graceful degradation, we create action items to implement the necessary modifications.

Final Result: Game Days Are a Winner

By frequently conducting Game Days, we have been able to make our production systems significantly more resilient, but also take meaningful strides on the human side of the equation by improving our interactions and protocols for dealing with incidents. We are now better prepared to deal with unexpected situations and failures, and this makes us even more confident in our ability to continue providing our customers with highly reliable and robust products. Which should help all of us sleep better at night.

Want to join in on our Game Days? Check out the open roles on our Engineering team today.

Written by Furqan Qureshi, Senior Site Reliability Engineer at Wealthsimple, in collaboration with Nahla Davies. Edited by Mark Adams.

Wealthsimple is a new kind of financial company. Invest, trade, save, spend, and even do your taxes in a better, simpler way. “Maker Stories” is an inside look at how we get things done. Interested in joining our team? Visit our “Work With Us” page to learn more and view open roles.

The content on this site is produced by Wealthsimple Technologies Inc. and is for informational purposes only. The content is not intended to be investment advice or any other kind of professional advice. Before taking any action based on this content you should consult a professional. We do not endorse any third parties referenced on this site. When you invest, your money is at risk and it is possible that you may lose some or all of your investment. Past performance is not a guarantee of future results. Historical returns, hypothetical returns, expected returns and images included in this content are for illustrative purposes only. Copyright © 2021 Wealthsimple Technologies Inc.

--

--

Wealthsimple
Maker Stories by Wealthsimple

We‘re a new kind of financial company. Invest, trade, save, spend, and even do your taxes in a better, simpler way.