Building a Part-of-Speech (POS) tagger for domain-specific words in bug reports — Part 4

The deadline is approaching

6 min readNov 5, 2018

Quick Update

Previously, I talked to you about the underlying architecture of the POS Tagger that I’m using. It was difficult to fully understand the Maximum Entropy Markov Model (or MEMM, as you may remember), but fun to try to explain and fully get it myself.

In this week’s blog post, things turn a little more serious… they turn spooky. Spooky in the sense that the deadline for our comprehensive project is really at the end of the street — and I’m not ready to get there yet.

Current Progress

One of the biggest challenges I had to solve was overcoming the “I don’t know what this does” from before. Weeks ago, I talked about how important it was for me to understand the architecture of the POS tagger. Because if that was the case, then how would I be able to optimize it?

Last time, I talked about how the main model behind the POS tagger architecture was the MEMM.

Starting at the fifth or sixth line (depending where you start counting), you can see all the possible parameters you can use to train the POS tagger.

One of my biggest personal breakthroughs is that all of the parameters shown in the figure above are the features that the MEMM uses to calculate the maximum entropy to ensure that the probability of a given tag is properly calculated to assign an appropriate tag (if this sounds unfamiliar, go ahead and read my previous post!).

From this insight, I was able to sweep through a couple of parameters to try to understand the results. However, I ran into several problems as I attempted to streamline this workflow. As a result, I would love to talk to you about the problems I have been facing. On my last post, I avoided to be gloomy and just talk about problems — this week I cannot avoid that!

Challenges Thus Far

Every week is met with challenges. This is one thing I like and dislike about comps. While all the challenges I run into set me back in my progress, these problems challenge me to solve problems and to learn more about my project.

In the process of streamlining the tests of the POS tagger, I ran into a couple of problems with the tokenization of my test sentences. It’s been a while since I last mentioned this word, so let’s review it. Tokenization is the process of dividing a string of words into tokens, which essentially means that we divide a sentence into words. The ideal tokenization of the sentence identity theft is not a joke would be a list of tokens such as [identity, theft, is, not, a, joke]. Notice that I said “ideal,” and the reason being is that there are so many ways to tokenize a sentence. For example, some tokenization processes throw out certain characters or punctuation marks. Let’s take the same sentence, Bears. Beats. Battlestar Galactica. If we decide to tokenize this sentence and ignore the punctation we would get the list [Bears, Beats, Battlestar, Galactica]. This is simple to achieve when dealing with natural language in English sentences, right? It seems like the tokenization process is rather trivial, you can just chop up sentences based on white space and get rid of all punctuation marks. But unfortunately that’s not the case.

So let’s be more specific. There are a lot of tricky cases in which we really need to think about the punctuation marks and understand what becomes important for us to keep and not to keep. Let’s consider one last sentence: I'm going to O'Reilly because my mom's car needs to be fixed. How would be properly tokenize this sentence? Would the ideal tokenization of the word I'm be [I,m] or [Im] or [I'm]? How about O'Reilly? In this case we cannot get rid of all the punctuation marks, so we need to be careful and thoughtful about what punctuation marks we want to keep.

This becomes a lot more important when you’re dealing with sentences that contain lots of punctuation marks. What would be that occasion you ask? Well… bug reports!

There are many bug reports whose punctuation marks are important. So I can’t build a general tokenizer that is able to remove some punctuation marks because they are important within bug reports. Let’s take a look at a couple of bug reports, where we see that punctuation such as _ , , , . , - , or '' , among others, are important.

         5 EXAMPLE BUG REPORTS FROM THE FIREFOX OS PROJECTAurora desktop B2G localizer builds failing with "Can't checkout https://hg.mozilla.org/releases/gaia-l10n/v1_2/eo!"[WebApp][Manifest] Use "chrome": { "navigation": false } in app manifest makes a packaged app's chrome visiblecrash in mozilla::layout::RenderFrameParent::RenderFrameParentRemove setCapture() from touch.js as it interfere with mouse eventsSwitch Telephony.cpp to use nsTArrayHelpers.h implementation of nsTArrayToJSArray

From these examples, we can see that we need to be careful about the punctuation marks we throw away during the tokenization process. Let’s only look at the first sentence from the examples, the bug report Aurora desktop B2G localizer builds failing with "Can't checkout https://hg.mozilla.org/releases/gaia-l10n/v1_2/eo!" ideally would be tokenized into the following list [Aurora, desktop, B2G, localizer, builds, failing, with, Can't, checkout, https://hg.mozilla.org/releases/gaia-l10n/v1_2/eo] As you might be able to infer at this point, this is not the result I have been able to achieve.

In fact, for this particular sentence, my current tokenizer divides it into the following list: [Aurora, desktop, B2G, localizer, builds, failing, with, ", Ca, n't, checkout, https://hg.mozilla.org/releases/gaia-l10n/v1_2/eo!]. Which is almost what the ideal tokenization is, but not exactly. In the instances of words like [Webapp] , the ideal tokenization would be [[WebApp]], but the tokenizer outputs [-LSB-, WebApp, -RSB-], where it entirely replaces the brackets with “-LSB-” and “-RSB-” (standing for Left and Right Side Bracket). Since the tokenizer I am using the is the default tokenizer of the Stanford CoreNLP API, I need to figure out how to prevent it from splitting up the bug reports into these tokens.

The problem doesn’t stop here.

Why is it so important to fix this problem? Since the tokenizer is not splitting up the sentences as I want, I’m unable to properly calculate the accuracy of my model.

Why is that the case? Before I was running into these problems, I was dealing with sentences that allowed the tokenizer to be well behaved. By that I mean that the tokenizer was able to ideally tokenize my sentences. My current method of calculating the accuracy of my model is by diving the number of correctly tagged words by the number of total words in a sentence.

Formula for calculating the accuracy of a given sentence, where the numerator is the total number of correctly tagged tokens and N is the total number of tokens.

So if I have a sentence of 10 words and seven of those are properly tagged, then my accuracy would be (7/10)*100% = 70%. Then for a given number of sentences these individual accuracies are averaged out over the total. The problem is that my method assumes that the number of tagged tokens is the same as the number of ground truth tokens. As a result, if my number of predictions is larger than the number of ground truth tokens, I have extra words that I am unable to compare to its true value and my method breaks down.

Next Steps

There are a couple of things I know I should be working on for the final weeks of my comprehensive senior project.

The tokenizer really set me back on the progress I would have loved to have completed thus far. However, it is an important set back that has made me consider the way I should be tokenizing the sentences. An important part of this research is understanding how the punctuation marks play a role in the structure of the bug reports, so it is important for me to know what punctuation is necessary to keep and what is not.
As I talked to my advisor about different methods to evaluate the accuracy of my data, I was pointed to look at two different things: 1) confusion matrices and 2) k-fold Cross-Validation. Although there is not much I could tell you about it, I still need to research these methods and see how I can interpret the new results.

This was more serious blog than the last one and I think by now you must agree why that was the case. There is not much time left to work on my project and the clock keeps ticking. There’s lots of stuff to work on. However, I’m more excited and motivated to overcome the challenges I have encountered and thus develop my project even further.