Walking backwards: a simple technique to guide live issue resolution

Thomas Weiss
Dec 20, 2017 · 5 min read

Those of you who have been exposed to issues happening in production systems know that it’s a very difficult situation to deal with. There’s a lot of pressure to fix things as soon as possible, and those issues are obviously unexpected so we would rather be doing something else (like, sleeping!).

Image for post
Image for post

Keep calm and trust only your data

  • resist the urge to rush, keep your head cool and “slow down to move fast”
  • trust data, not your intuition (that is, if you do have telemetry data to work on!)

Those are excellent pieces of advice, but even when you manage to keep calm and have valuable data at hand, it can still be challenging to decide where to start and how to proceed. Data can be overwhelming and give you wrong hints about the issue. Sometimes you may even come across data that reveals a totally different issue, and it’s tempting to explore that path and lose focus on the most urgent problem.

To handle those situations efficiently, I’m applying a technique I’ve nicknamed “walking backwards”. Now I’m not claiming that this is my invention; it’s nothing but common sense but I thought I would explain its rationale and illustrate it with some example. Here it goes.

Walking backwards along the positive path

The logic behind that is that, in the complex systems we’re dealing with, the root cause of a problem often resides many layers or levels away from the visible symptoms. How many times have you wondered “How can that be broken? There’s no reason it doesn’t work”. And that’s usually because thinking about the potential reasons that may explain the direct symptoms is a dead-end. And so we start looking at random data, hoping to make sense of the issue by accident. But following what I would call the “positive path”, that is the backwards chain of events that would not have produced the issue we’re investigating, sets a trail we can walk until we eventually find the root cause.

Let me illustrate this technique with some real life example. I’m currently working on the back-end systems of Keakr, a France-based social network app for urban music lovers. Keakr users can upload videos and share them with their friends and followers.

In the context of a real production issue

Image for post
Image for post

Every time a video is created (1), the frontline web servers store in blob storage the raw mp4 file coming from the app (2), then dispatch a request (3) that’s asynchronously received by some background workers to transcode the video in a streaming format using Azure Media Services (4). Because transcoding is not a quick operation, and because there may be a queue of videos waiting to be transcoded, we serve the raw mp4 until the video is ready to be streamed.

Now back to our problem. Follow me as we apply the “walking backwards” technique, starting from the reported symptoms:

Some videos are slow to load

What makes videos fast to load?

  • Serving them in streaming format. A quick check in the database revealed that many recent videos had not been transcoded.

Transcoding of video fails

What makes transcoding of videos succeed?

  • Successful operation of the background workers. Looking at our logs and metrics, the workers seemed to work fine, processing other requests without any problem.
  • The reception and execution of transcoding requests. As workers store incoming requests in some cold storage for auditing, we found out that no such requests had been stored recently.

Transcoding request don’t arrive in the workers

What makes those request arrive?

  • Successful operation of the message queue. As stated previously, other requests were processed by the workers so the message queue was running fine.
  • Logical deduction was that the requests were not fed into the message queue.

Transcoding requests are not issued

What makes the requests being issued?

  • Successful execution of the “create video” HTTP request handler on the web server. That request handler (1) stores the raw mp4 in blob storage and some meta-data in the database, (2) dispatches a push notification to the user and finally (3) sends the transcoding request on the message queue. Looking at the blob storage and database, we knew that step (1) completed, so we dig into the dispatch of push notifications… to realize that the certificate we were using to interface with APNS (Apple’s push notification system) had expired! This led to an uncaught exception, stopping the execution at that point and preventing the transcoding requests to be issued.

From the symptoms to a very unexpected root cause

We could have spent an awful lot of time trying to investigate the CDN or the video transcoding pipeline, sending test videos to Azure Media Services and eventually finding out that this part was working fine. It is only by applying a rather simple technique that we were able to guide our analysis down to the source of the problem in as little time as possible.

Are you following similar methods when troubleshooting production systems? Maybe variants of what I’ve described, or some totally different approach? Please share your thoughts and suggestions in the comments!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store