What are the typical words in exception messages?
In an earlier blog post we’ve examined the frequency of Java exceptions in the Samebug database.
This time we’ll take a look at the messages of Java exceptions. Let’s take the different messages from the exception database and count each one of them only once, regardless of how many times it occured.
Since the messages are from stack traces collected from the web, there are some with several different readings due to copying-pasting into different sites or possible parsing problems. With respect to this, the most frequent message is currently this one: An error occurred while executing doInBackground()
with circa 65 different versions (e.g.with the typo occured
etc, we considered the messages within the limit of the Levenshtein distance).
Unfortunately, it is quite hard to reduce them to a common denominator so we’ve kept one instance of every variation of this kind. There are also messages which contain program-generated parts: for example the name of a class or an index number etc. Some of them can be reduced easily, while others require a great effort and it is not obvious that it is worth doing it.
Below are the most frequent words in the messages (stopwords excluded). There’s no surprise among the top words.
We can try to reduce the above mentioned similar messages if we take into account every distinct word-list derived from a message only once (through tokenization ~splitting the messages into an array of words~ and stopword removal).
Example: message A String index out of range: 4
and message B String index out of range: -1
will both result in this list: [string,index,out,of,range]
. Unfortunately, we will also lose information this way because even how we should tokenize exception messages is not obvious, like what shall we do with the dot or the backslash characters, should we take them as separators or not? Just think about the qualified method names or various path expressions. Nevertheless, we’ve made our grouping and nothing changed in the top ten. In the lower slots there were minor changes: http
improved his ranking by one, expected
fell out of the top 25 and it was replaced by activity
, etc.
We’ve also examined the n-grams in the messages (with and without stopwords, considering distinct 2-grams only). In the case of 2-grams with stopwords the top were unable to
, failed to
, could not
, cannot be
, is not
, not found
, cast to
, be cast
, to start
, does not
, no such
, exception is
, so we turned to those without stopwords.
The last one in the diagram major.minor version
is in the list because we did not handle the dot character as a separator. As for the fourth-placed ljava lang
, the result is from the tokenization of a bit unusual Java class name for a String array: [Ljava.lang.String;
which is seemingly quite frequent in exception messages.
Diagram 3 shows the most frequent 3-grams without stopwords. Expressions start to make more sense here and specific brand/product names like catalina, apache or android appear in top 50.
And lastly let’s take a look at the 4-grams. Version numbers appear in top results.
We’ve grouped the messages by exception type as well and compared the top words by types, choosing only a few from the most important types. As we can see, most frequent words are quite strongly related to the specific exception types and their role. For example, words of ClassNotFoundException
messages are related to class, module, lib, path and finding and loading the class. In contrast, RuntimeException
top words are quite a medley with two exception class names among them.
Finally some statistics about the most frequent words.
The 1st and 2nd exception types are where the given word occurs most frequently in the messages. We have also given contribution of these types to all occurrences in the message-corpus in percentage. That means that for example the word unable
occurs most frequently in RuntimeException
with such a high contribution (35.6%) that we can rightly say that the characteristic exception for this word is RuntimeException
. RuntimeException
can be found often in first place which is why we’ve listed the second most important type as well.
Some words like exception
or found
seemingly have no characteristic exception type. The last column shows the total number of different exception types where the given word occurs in a message.