How to detect bugs?
and why you need to monitor your code in production!
As we have seen before, there are as many types of bugs as species of finches in the Galapagos islands. For instance, one well known type of bugs is when you deal with array and you try to read or write out of the bounds of the array. In C, it mostly results in a segmentation fault, in Java or other managed languages, it results in an exception (ArrayIndexOutOfBoundsException).
Detecting bugs is like fishing, you know what fish you’re looking for, and you set up your fishing strategy accordingly. Of course you would not fish Mexican barracudas the same way as mackerels from the Channel.
The same applies to bugs, there are different strategies to fish detect bugs. Let’s go through them.
1. Perfect programming
The first strategy is to recruit a perfect programmer. She (perfection is a woman value! ) perfectly understands the requirements, never makes a mistake, and last but not least perfectly predicts all users, all inputs in the field, all environment configurations of the production servers (incl. the last upgrade of your-favorite-Linux-distro).
Pros: no bug
Cons: if she exists, she has always been recruited by the competitor.
2. Static Analysis
Remember CS101. You learned that a compiler checks all kinds of things and detects bugs in advance.
For instance, consider:
f(x,y): return x+y
This would not compile because 3+”42”does not make sense.
The vision of static analysis is to statically detect bugs, as an extension of the compiler, a kind of super compiler check. It is indeed powerful. For instance, static analysis for C code as done in Astree detects many runtime errors, such as null dereference and out-of-bounds array accesses.
That’s great, but there’s no free food. Static analysis has two major limitations:
- First, it detects only the bugs it has been built for, mostly runtime errors. Outside those types of bugs, static analysis catches nothing. Basically, you don’t catch Mexican barracudas with mackerel fishing lines.
- Second, it requires a lot of assumptions in order to create a closed world. For instance, static analysis for C code can give you strong guarantees on the absence of memory errors… only if dynamic allocation is forbidden. No malloc, period.
Dynamicity and static analysis address different concerns: usage adaptation and ease of configuration for the former, absence of certain types of bugs for the latter.
Let now dwell on dynamicity & openness and consider a standard Java enterprise application:
- It can be configured with XML files specifying the classes to be instantiated, the domain objects that can be created. As a result, ClassCastException may arise.
- It uses a plugin mechanism based on dynamic code loading.
- It calls 5 different external web services which can fail without notice. The service API or the output data format can change at any service update based by third-party partners.
As a result of dynamicity and openness, this app may encounter:
- a ClassCastException if the configuration file does not match the binary dependencies,
- a NoSuchMethodException if the plugin architecture exploits too many undocumented assumptions,
- And many I/O exceptions related to changes in the third-party services.
The point here is that dynamicity and static analysis go in two opposite directions, you can have safe Java:
- if you disable “null”,
- if you don’t use third-party libraries and services,
- if you don’t use dynamic data structures,
- and if you deploy on a single fully-controlled server.
That’s not possible in most of cases. Our apps evolve in a dynamic world, and it’s impossible to give up dynamicity for static analysis (but for a handful of specific domains).
To sum up about static analysis
Pros: detect all instances of specific classes of bugs
Cons: requires a static and closed world, false positives
Testing is about assessing the quality of a piece of software before it is shipped to production. One classical way of testing is to take the specification used for developing and to check that the app does what it expected to do. This has to be done by a person that is different from the developer, in order to catch ambiguities and hidden assumptions.
Testing can be automated in order to run check for correctness as many times as possible and as cheap as possible. There are many testing automation strategies, incl. “unit testing” with JUnit if you deal with plain objects and web testing with Selenium if you test the front-end of a web app.
Well done testing enables developers to obtain very-high quality software. Automated testing enables them to have enough confidence so as to automatically deploy in production, with no further manual tests. This has given birth to the concept of “continuous deployment”, where the production environment is updated several times a day with bug fixes and new features.
However, testing is no silver bullet. It suffers from two problems. The first one is that it is impossible to test all possible inputs or sequence of inputs. Even after thorough testing, there may remain corner cases where the software would not behave as expected. The second problem is .. dynamicity and openness! Testing is a concrete execution, it is done on a concrete execution environment, with concrete configuration parameters. However, what works on a given configuration may fail on another one. So basically, there is not only the problem of covering the input space, there is the problem of covering the environment space. This environment space is huge. Let us optimistically assume that your app may run on 4 different machines, 3 different operating systems, 2 different virtual machines, using 15 libraries for which there exist 5 different versions for each, this results in 1800 possible execution environments. Not only this space is large, but it is quite hard to explore it in a systematic manner. Now imagine what it is for Android applications, where there are thousands of devices and myriads of variations of the OS…
Summary about testing
Pros: detect many bugs, can be done fully automatically, enables continuous deployment
Cons: the testing space is enormous (input space X environment space), cost if testing contains a manual part.
The ultimate goal of software is to be used in production, in the field. Handling bugs in production seems too late but errors are inevitable, therefore it is still very important to detect them. When one detects a bug in the field there are two important actions to be done: first one is to fix it, so that it does not happen again in the future. Second one is to improve the pre-production engineering process (development, testing) so that a bug of the same kind would be caught earlier.
Detecting bugs in production is called bug monitoring (or more precisely failure monitoring). Many production errors can be detected in production if appropriate measures are taken. However, not all production bugs can be detected. Why? Because detection requires an observable symptom, such as a crash for instance. This means that bug monitoring is always built on top of symptom characterization such as: crashing processes (e.g. segfaults); crashing requests (e.g; HTTP 500); crashing threads; presence of an exception etc. In other words, bug monitoring in production is like fishing: one catches the bugs for which we have designed our lines.
Of course, bug monitoring has drawbacks:
- It requires to add an additional component in the application (the monitoring library),
- it has some overhead (no free food ever :)),
- and it generates a whole lot of monitoring data that has to be stored somewhere for some time and displayed somehow.
To sum up
Bug monitoring complements static analysis and testing and addresses the bugs that make it into production.
Pros: detect the production bugs that hit real users, provide execution data (e.g. actual values) to aid diagnosing.
Cons: overhead and need to handle monitoring data
5. Wrap up
There is no silver bullet to combat and prevent bugs, esp. for open and dynamic software applications which run in diverse environments. Bug monitoring in production is an essential measure to keep happy users and to improve continuously your software quality. Therefore you need to find a monitoring solution.
Makitoo provides monitoring and analytics for Java applications.
Why does Makitoo catch many production bugs?
Because Makitoo’s monitoring is very fine-grained and considers most production symptoms. For instance: crashes, exceptions, null values, OutOfMemoryErrors, infinite loops, Dead lock, many I/O problems (e.g. network latency), etc. are detected.
We keep adding new detectors to give you the most precise picture of the production bugs. We take a great care in keeping the overhead very small (below 5%) and we take care of the monitoring data within our solution. Our dashboard gives you aggregated and actionable insights about the bugs to understand its root cause at a glance.
How does Makitoo work?
We invite you to read it here, everything is well-explained ;)
In short, Makitoo automatically inserts monitoring probes into your code in order to catch every error, so it requires no manuel error tracking and it is exhaustive.
Discover more features and details on Makitoo website: makitoo.com
This article has been written by Martin Monperrus, Associate Professor at University of Lille, Adjunct Researcher at Inria and Co-founder of Makitoo