Quick Guide to Production Debugging

7 min readMay 8, 2020

Debugging code in the development or staging phase is a lot easier than the production environment. In the development phase, you’re at the comfort of your favourite debugger or IDE; setting breakpoints, stepping over, stepping into, and shipping fixes.

But when the app leaves this stage and is pushed to production, there are tons of unexpected factors that make your code unstable. These include high scalability, lots of database queries from users, high concurrent usage, unpredicted states or system behavior, and more. As a result, all sorts of wild things may happen.

To debug errors that happen in this environment, you have to embrace a new debugging mindset and plan ahead.

The biggest challenge here would be capturing the actual state that led to an error, including the specific threads, variable values, and everything else that particular piece of code was trying to do.

The Production Debugging Crisis

Every developer has encountered bugs in production - from runtime performance issues and crashes to deadlocks and memory leaks. Sometimes, it is easy to reproduce bugs, but sometimes it’s almost impossible to reproduce them.

Whichever the case, skipping a production bug is never an option. It would either mean loss of money, time, or even worse, outage of critical services.

This puts a lot of pressure to development teams to resolve the issues as quickly as possible. Some reasons why debugging in production more challenging include:

Bugs that weren’t identified in the development and testing stages are more severe
It is hard to reproduce errors
Simulating production environments is a challenge
Uptime is critical, so normal operations should resume at the earliest convenience.

Diagnosing Production Bugs

When an end-user reports a bug in your application, they never provide enough context on what caused that particular problem. This is why the ability to diagnose production bugs is essential for any developer or front-end engineer.

Before getting to the debugging approaches, here’s some insight into the common types of production bugs.

Application-level bugs

These are bugs associated with the application’s logic or syntax and are often as a result of end-user operations.

Applications level errors are common and can easily be checked by looking at the log files. They are considered easier to debug and fix because applications might continue running even with the bugs.

Service-level bugs

Service-level bugs are associated with web servers, databases, and other components that serve or run together with your application. Unlike application-level errors, these do not reside in the regular log folder.

They are usually found in a different folder in the service directory. For instance, the Nginx log file can be found in the /var/log/nginx directory while My SQL is in /var/log.

Debugging is in production is an art that takes time to master. Here are two main approaches that can be used to resolve problems in production.

Logging

Logging is undoubtedly the most effective production debugging strategies. It should always be your starting point when investigating a production bug. At first, you should set the application’s logging mechanism to send data to a secure server when you can inspect later.

However, sifting through the log files is not always easy. The best approach would be using a logging solution like Splunk or Loggly. These can help you look at all logs associated with a particular user or session ID.

It’s like seeing how the user interacted with your application during the entire session.

Here are some things you should log to get the full context of an error.

XHR Network Failures — Most web applications load data using XHR. If a request times out or returns a fail status code, you should correlate the unsuccessful request with a bug report. Most HTTP libraries allow you to inject a failure handler. A good approach would be using this to send requests to the server containing details of the failed event.

JavaScript Errors — Runtime errors are essential when diagnosing problems in an application. Because most of these are trapped in the browser, logging them to the server can help identify the cause or a bug.

User Actions — It is not enough to find the source of a bug. Most of them are user-induced, so it is important to identify what the user did that resulted in the problematic state.

A great way to get all logs on the fly for debugging purposes without affecting application performance and security is using Rookout.

It will pipeline all your debugging data, so you can easily fetch it wherever you need it in real-time. This is especially important when debugging remotely.

Rookout collects full-stack data without stopping code execution

Using Memory Dumps

Another alternative you can use is creating memory dumps for the application and analyzing them.

A memory dump is a snapshot of the application processes at a certain moment. It contains information like memory allocation, state of threads, and objects.

Memory dumps can be easily generated in the task manager to provide valuable information for debugging a live app.

A simple memory dump, usually referred to a minidump, contains information about the stack, state of processes, called functions, and so on.

This information, when analyzed, provides insight into the state of an application and all processes involved before the error.

For instance, in this simple stack trace, you can see that func1 called func2, which in turn called func4. The function Func4 then created an instance and called funcX. Execution continued with funcX calling funcY and so on.

When All Else Fails, Catch All Exceptions

Uncaught exceptions provide so much information that can help with debugging. This is where most threads die, so you can find a huge chunk of evidence as to what happened before the application ran into an error.

The good thing is that almost every framework provides a way to contain exceptions and show error messages.

The best and arguably last line of defence for catching exceptions would be using a global exception handler. Here is a java code snippet showing how you can set a global exception handler.

To extract more debug data when an uncaught exception happens, consider the following:

Thread names — Modify the thread name according to match the request at hand. For instance, when processing a transaction, you can append its ID to the thread name.

Thread-local storage (TLS)- It’s an efficient way to keep thread-specific information away from the thread object provides information that can help identify what happened when an error occurred. This includes the time, username, transaction ID, and more.

Mapped Diagnostic Context (MDC) — This is similar to the TLS concept, but approaches it from a logging perspective. The MDC is an essential element of the logging framework like Logback. It is used to create a static map at a logging level and enables more advanced feature than the TLS.

A Better Way for Production Debugging

While there is no such thing as the perfect, easy-going solution that can be easily used to resolve production bugs, there are some qualities that describe an ideal tool for debugging in production environments.

These are:

Provide all the stack and variable data needed for debugging
Debugging does not require you to restart or redeploy the application
Does not disrupt application users when debugging
Does not slow down application performance to gather debug data

Using a debugger that’s non-breaking can provide all the benefits of logging while eliminating the drawbacks associated with the logging approach.

A non-breaking debugger satisfies the above four points so you can create a reliable debugging workspace for your application. It is also easier to set breakpoints and view debugging data when your breakpoint fires.

More importantly, collaboration in the context of production debugging is key in modern development. So, using a tool that provides all the needed debugging data and allows real-time collaboration among teams can help resolve bugs faster.