Determining a process hang efficiently
It is common and inevitable for software engineers to experience process hang, which essentially means a process makes no progress after a long period of time. After few years working in database kernel and build engine, I have troubleshooted many challenging software bugs. In this article, I will share how to determine if a process is hang efficiently as well as discuss few potential causes of hang. Let’s get started!
Before diving deep into determining whether a process is hang, it is better to understand the possible causes of process hang, this will make some of the trouble shoot techniques self-explanatory! A process is considered as hang when it satisfies the following:
- it runs longer than expected
- it makes no progress for an extended period of time
This should be no surprise, the key here is to confirm the process makes no progress for a long time after it runs longer than expected. So why would cause a process to make no progress for awhile? The answer is waiting…. for an unexpected long period of time. Once we understand the hang is caused by waiting, we can start to reason what it is waiting for. This leads to 3 main types of process hang.
Let’s start with the easy cause here, where a process appears as hang due to it is simply waiting for something to return. For example, this would be waiting for a handshake packet from the network, or just one of its child process takes a really long time to complete and this process is synchronous. This “hang” is legit, and investigation should lean towards the resource that the process is waiting for.
If you see a hang, there is 90% chance it is due to deadlock. Deadlock simply means one can not get the resource because someone already holds it, while this “someone” is also waiting for another resource one already holds. Of course, there are few variations of deadlock in practice, and based on my own experience, self-deadlock is more common to result hang, which basically means the process is waiting on a resource that it already holds. To trouble shoot hangs due to deadlock would be challenging, software engineers normally need to be able to construct a compelling scenario and confirm with evidence before a conclusion can be drawn. Luckily, few debuggers offer deadlock detection functionality.
Bad Program Handling:
Bad program handling refers to a software program is badly written that does not handle abnormal events well and runs into infinite loops. This cause should normally be caught during functional testing phrase, however, this still happens in practice especially for large software that is hard to test line by line.
After providing some of the common causes of hang, it is time to share some of the techniques that I normally use for debugging process hang! The first thing on my mind when comes to determining if a process is hang or not is to determine whether it is making progress. This can be done by dumping process stack periodically. A process stack contains a list of functions that the process/thread is currently invoking, by dumping this information periodically (manually or through scripts), you can observe if the process is making progress in a breeze :). In some cases, it is challenging to identify which process is hanging when a process invokes lots of other processes and each has even more processes. The investigation method is still the same, but the key is to identify which process to investigate at the first place. As a thankyou for reading this article along, I am sharing with you a free tool from Microsoft that is extremely helpful to identify which process is hanging and collect dumps right from the tool for all the processes. This tool is “Process Explorer”, which known as “procexp”, and can be downloaded from here.
As you can see from the output of procexp, it captures the process tree instead of single process listed in Task Manager. By right clicking individual processes, you can find out more information about the process such as start time of the process, and create dumps from there.
Thank you for reading along! I hope this article will be helpful for you when you need to investigate process hang, if you have questions, please feel free to leave a message in the comment or reach out to me!