My war story: Why I’m glad that I broke production

Published in

Wix Engineering

6 min readDec 23, 2020

Just when I thought I was getting the hang of pushing new features to hundreds of thousands of users, the inevitable happened — I broke production

A wrecked Israeli tank during the early days of the Yom Kippur War (Wikimedia Commons)

I’ve been working as a Software Engineer at Wix for 2.5 years now. About 18 months ago, whilst on a well-deserved break from an extremely busy stretch at work involving quite a stressful upgrade to some important systems, I got a phone call that went something like this:

Hayden, we’re getting complaints from a lot of end users that the transactions that were supposed to be calculated by the new module you wrote are incorrect.

Yikes, I just broke production. Yes, that same excitement of being able to positively influence hundreds of thousands of online stores just switched, at the speed of light, into a terrible gut feeling that I might have just screwed up critical transaction calculations for those same hundreds of thousands of online stores. Every single one of them.

In this post, I will not talk about how crappy I felt during the remainder of my vacation, but rather, I’ll try and explain how this experience has changed me as a programmer, and, as strange as it sounds, why I’m actually glad that I broke production.

And don’t worry, all the incorrect transactions were quickly corrected with no harm to our users

Being careful

I like to think that I’m a careful programmer. When I say careful I mean that I try to think of multiple scenarios in which the software I am writing can fail. I write tests to catch different edge cases, and am an avid supporter of code reviews. And of course, I protect my code with Feature toggles/flags¹ which allow activation of new code in a controlled manner.

The unexpected

Yes, the unexpected always happens when you least expect it. In my case, I got bitten when I had updated a shared library with some great new mechanism. However, I failed to realise that this library was used by more than one micro-service, and when I upgraded the first micro-service without upgrading the second micro-service, I was causing mis-calculations that were being exposed to our users that were very much not harmless :(

Yep, a quick deployment of the second micro-service did resolve this issue, but not before a large number of transactions had been miscalculated.

The aftermath

Well, in the aftermath of this incident, we executed a post-mortem and came up with a lot of lessons learnt. Here are some of the lessons that came out of this:

Monitoring — Add monitoring to our critical services, measuring expected behaviour. We encourage our developers to fire metrics that are relevant to measuring correct business behaviour. All our metrics are written to via an API to our monitoring system, and we create focused dashboards for these metrics. We have libraries both for the JVM and Node based development that make it super-easy for developers to fire metrics.
Alerts — Configure alerts for when those business metrics exceed certain acceptable thresholds. These alerts are configured in the monitoring system and are channeled automatically via our alert management system, to the relevant support teams via relevant paging and messaging applications.
Logging — Write application logs with relevant information. We like to write structured logs with informational messages along with a lot of contextual information (for example: request details, site, user and geographical identities) . It is exceptionally useful for enabling investigation of incidents and identifying suspicious trends. All these structured logs are written to a distributed database that allow us to run analytical queries against logs, and even create ad-hoc dashboards based on those same logs.
Back-Office — Preparation of back-office utilities to access and correct miscalculated transactions, all the way through the system. These utilities give us the confidence and ability to quickly identify and remediate issues that come along. Yes they do take time to develop, and yes, the development estimates of features do need to take this into account
Testing — Never ever assume that tests are always going to cover disasters that can befall your service.
Feature Toggles — However trivial the change, use feature toggles¹ to protect changes. You never know how or why something might break, and the option to turn off that new fork of code is a must. At Wix, we have our own Feature Toggle system which we love and trust, where 1000s of feature experiments are conducted simultaneously.

As can be seen, there is a common theme in these items — they all contribute to the resilience of our system, through monitoring and observability, and lastly, through Feature Toggles.

Brainstorming ‘what if’ scenarios

Lastly and most importantly, engage with fellow developers, brainstorming about what can possibly go wrong in this service. These sessions are also known as pre-mortems², the idea being that we are very much focused on imagining that our service is failing, and how do we both identify it, and remediate it.

One of the things I love about this, is that the group dynamics in these session can really help solve critical issues by identifying possible failure points in our service before they occur.

In these brainstorming sessions, we discuss how we are planning on rolling out the change, talking about different kind of users with different ways they might use this new code. What will happen if the service fails and we need to rollback a service — will we have data loss, or worst of all — might we even have data corruption?

In a positive, collaborative and constructive manner, we are reducing the over-confidence that we, as human programmers, frequently have, that are services are solid, bug-free and that we’ve thought through every scenario.

So, what is there to be glad about?

Like when anything bad happens, there is always a side of things where we try to benefit from these bad events, learning from mistakes so that that bad thing doesn’t happen any more.

I believe that the experience of breaking production taught me many lessons, and have hopefully made me into a better developer, by causing both me and my fellow developers more aware and prepared of what can go wrong when rolling out new behaviour in our services.

I have also learnt that resilience is not only about enabling acceptable functioning of my service on system failure. Rather, it is also about making sure that the whole process of rolling out new functionality is also resilient to all types of failures — whether due to accidental misconfiguration by a human, ‘old’ version of data with strange unexpected format of identifiers, or a clumsy truncation of a critical table.

Having experienced such failure from close has made me acutely aware of the cost of what can happen if things go wrong, and, to say it simply, has made me into a much more conscious programmer than the one I used to be. Conscious of what can go wrong, conscious of the weak points in our system, and conscience that however hard we try, there will always be failures.

As part of the lessons learnt, our team now plans our rollouts with a lot more care, and try, as much as possible, to be ready for the unpredictable, ready to defend against the next surprise.

References

[1] Feature Toggles — https://martinfowler.com/articles/feature-toggles.html

[2] Pre-mortem — https://en.wikipedia.org/wiki/Pre-mortem