The Craft of Troubleshooting — From Art, to Science, to Automation

Aviva Peisach
Wix Engineering
Published in
7 min readNov 18, 2022

Those who read my previous post on postmortems know that resilience is a subject close to my heart.

After all, at Wix we deal with thousands of microservices and hundreds of deployments a day, with a multi -cloud environment and with numerous internal and external dependencies, which are being used by hundreds of millions of users. Improving resilience is the backbone (among others) of a stable and a growing engineering products and environment.

A couple years back in our quest to improve resilience we measured our MTTR (median time to resolution of production incidents) and found that a lot of this time is invested in troubleshooting.

So the challenge at hand was “How do we optimize and minimize troubleshooting time across the company?”.

At first, this seemed a bit of an impossible mission due to the following assumptions:

  1. We tended to view production troubleshooting as an art rather than a science — where only the “tribe’s eldest”, the senior and most experienced members master troubleshooting — ones that achieve that after spending nights dealing with production issues and knowing the system inside and out.
  2. Wix is split into dozens of groups each with their own infra and architecture. Thus we assumed each one is a unique snowflake with their own flows, issues and, therefore, different system knowledge and troubleshooting practices.

Nothing masks reality better than making assumptions… So our first step was to map our actual different troubleshooting practices.

Reality check

We brought together these troubleshooting veterans from different groups and mapped on paper (or more accurately in a Google Sheet) the troubleshooting steps they tend to typically take. Put to “paper” how they map and rule out immediate suspects, how they decide on what to do next, and what are the common root causes.

You can imagine our surprise when we found out these different practices are not that different at all! We found that ~80% of the troubleshooting steps are very common across all our different products.

They all start with similar alerts and continue with drilling deeper into similar paths to rule out similar immediate suspects, and onto mapping the root cause and fixing the issue — all in very similar ways.

Here are several examples:

1) The first thing developers should and would check is whether there was a recent GA of the misbehaving service or a change in a corresponding experiment in proximity to when the issue started occurring. If so, the first thing to do is to do a version rollback or revert the change in the experiment.

2) Developers would also drill in to check if most of the errors can be attributed to specific pods / sites / callers or callees and if so, continue to drill in on these suspects.

Yay! We struck gold! Commonality meant that it was easier to solve, improve and make a cross company impact.

Action plan

We decided to address & leverage these commonalities in 4 parallel paths:

  1. Training — we built a training kit simulating the common root causes and allowing trainees to practice their troubleshooting skills in a safe and quiet environment
  2. Improved Monitoring — since we mapped the common suspects and root causes, we improved our dashboards to include high-level views of these areas to detect things at a glance
  3. Automation — since we mapped common practices and next steps, we’ve also built automation around drilling in and providing additional metrics where needed and auto-suggesting next steps
  4. Self healing — we’re not there yet, but this is our plan for the future: find the most useful and reoccurring practices, definite root case and resolutions, and fully automate both the troubleshooting and recovery steps — removing the need for a person in the middle

Hands-on troubleshooting training

The best way to learn how to troubleshoot is to do it hands-on.
However, the cost of having each developer experience the vast range of actual production issues in real-time is high and lengthy.
Also it is hard to build, maintain and use very detailed troubleshooting guides and cookbooks.

When you’re awakened in the middle of the night with the CEO / CTO breathing down your neck to solve the issue at hand, following a cookbook (no matter how detailed and well maintained) is not an easy task.

Thus we decided to create a hands-on training kit: We built a “trouble-maker” service which stimulates production issues on a synthetic service.

The trainee needs to monitor and troubleshoot in real time until they identify the root cause and fix it. We simulate memory issues, faulty pods, misbehaving dependencies (calling or being called by our synthetic service), and much more.
That way the trainee gets to experience actual troubleshooting practices in a safe environment, which prepares them and increases their confidence when they face the real thing.

Monitoring — seeing is troubleshooting

Since we mapped the common suspects, we now know what to look for — thus we improved our dashboards to include high level distribution histograms of these areas to detect problematic ones at a glance.

These high-level dashboards include error distribution across user agent, by site, by DC, by POD.

This way you can see at a glance what are the dominant contributors to errors and continue to drill down to relevant traces — specifically on them, quickly closing in on the actual root cause.

Automation — if you map it, you can automate it

Since we mapped common troubleshooting paths, we wanted to find a way to automate them as much as possible.

We’ve built the “Alert Enricher” — a system listening in on our alerts (which are published to dedicated Slack channels), checking for immediate / less immediate suspects, and enriching the Alert with more insights and metrics, suggestions for root cause, further investigation and possible remediation.

A few examples:

  1. We check and display the recent GA / Experiment change of the misbehaving service and allow to quickly rollback / revert
  2. We check if a specific POD is misbehaving — and allow to quickly reboot it if needed
  3. We check if a large % of our traffic is triggered by automation tests and if so — provide an option to block the specific automations
  4. We also specify top 5 exceptions — which usually spot the “culprit” right there and then, since the top 1st usually has a ratio of 90% above all others.
  5. If a new exception was introduced (that was not raised before) in the recent timeframe — we also specify it. In many cases it is the root cause for a new alert.

We’ve started off by having these rules and enrichments hardcoded in the system, but have since extended the system to allow developers easily script their own rules, enrichments and suggestions.

Self healing — automation: the next frontier

The next automation step would be to fully automate the troubleshooting process from detection to verification of root cause to fixing the issue and verifying the fix.

Some of this process we’ve automated via our home-grown automatic deployment system which is based on our A/B test infra (Wix Petri) which checks for any changes in metrics, comparing the old and the new version and automatically rolling back upon finding a match.

However, before automating the troubleshooting & the fixing of the fully production-deployed version we need to gather much more stats on our semi-automated process to find the high probability root causes and fixes, since otherwise we may cause more harm than good.

The outcome — fruits of our labor

The steps we’ve done so far have significantly reduced our overall troubleshooting times.

We’ve seen a 27% decrease in median time to resolution between 2020 and 2021.

Many teams confess that they don’t know how they previously managed without it!

We’ve reduced human single points of failure as more/all members of any given team are capable of performing production troubleshooting, or at least reduced the fear factor.

We’ve also streamlined the training kit to our standard onboarding (3 months after initial onboarding) plus enhance the Alert Enricher with more scripting and customizations, we expect to see an even more dramatic impact going forward.

Takeaways and lessons learned

This journey taught us yet again to make no assumptions and to base our decisions on research and data rather than gut feelings. It also reminded us that there’s no one silver bullet solution when it comes to resilience.

Resilience is an ongoing journey with no finish line. One where you need to work on multiple fronts with the goal of continuous learning and improvement in mind.

As this journey continues, stay tuned for more revelations on this front. Want to join us or consult on the resilience journey? Don’t hesitate to reach out.
Note: This article (amongst many others) is also posted in the Wix Engineering Blog.

--

--