10 easy steps to solve production defects like a pro!

Published in

Walmart Global Tech Blog

3 min readJan 22, 2024

What is the most challenging task for a developer? Is it developing a complex feature, is it crafting Non-functional requirements for a large-scale application, is it learning ever changing tech?

Nah! All these are cake walks for a passionate developer!

The most dreaded challenge faced by even the most efficient developer is a ‘Non reproducible/Intermittent defect’ which is adamant enough to happen only in production environment!

Typically, these non-reproducible defects give more anxiety to the developers as they do not know what to solve at that moment (and not to mention the pressure coming from stakeholders if it is a major incident!)

If you are a developer, you will face this situation at least once in your career. How do you solve a problem if you are not able to reproduce and understand the root cause?

Steve jobs once said, “If you define the problem correctly, you almost have the solution.”

These 10 easy steps aka cheat-sheet will help you define your problem and fine the root cause!

Accepting the fact that there could be an issue (instead of negating it) is the first step in solving the problem with an open mind-frame. This sets the mind for further thinking.
Look for the exact steps to reproduce the issue with all the logs captured.
Verify the artifacts and their version compatibility between the two environments. Your application might be fine, but a third-party library running with different versions could cause a problem.
If the application involves databases, check if there is a difference in the version of the database.
Check if a server restart is required to bring alive all the artifacts. (This would be the most common solution but beware this would solve the problem temporarily in most cases. If you do not pay attention to the root cause, the defect will again knock your door in no time.
Check if any unintended inputs or variables are passed to the APIs (along with valid inputs).
Look for the possibility for a racing condition. Is the expected behaviour overwritten by another thread? Is there a deadlock?
Check if this is a long running application. Does extended time or voluminous data cause any memory spikes or resource crunch?
All the logs that we write in the application comes into action now; make use of them liberally.
And finally, if none of the above works, go to step 1 and think again with an open mind :). You may add more points to this list.

Additional pro-tip: Once you can handle the issue calmly, you should be able to be spread the same to your stakeholders as well. The best way to do is keep them updated on the progress on a regular basis proactively. If you are assuring them that you are working on it time-time, be assured that you would get all the support you want to solve the issue.

10 easy steps to solve production defects like a pro!

Written by ramya gopalakrishnan