Building a Part-of-Speech (POS) tagger for domain-specific words in bug reports — Part 5

We’re almost there!

7 min readDec 3, 2018

I have been M.I.A. for the past few weeks. Not that I was sent off somewhere and disappeared, but that I was lost in the never-ending amount of work I have had to do for my classes. Seriously, every introduction to my blogs seems like a rant about how little time I have had to work on my comps project, but quite frankly, it’s the truth.

So instead of me going into depth about that issue, let me tell you about the work I have been doing for my comps. Over the past few weeks, I had been running into a problem: how do I properly evaluate my data? On my last blog, I talked about the problems with the POS tagger’s tokenizer and what that meant for evaluating my results.

Briefly, there were two things that I noted at the end of my previous blog, and these were the things I needed to work on: 1) I needed to figure out how to improve the tokenization of the POS tagger and 2) find out more about the evaluation methodology for POS taggers.

In this blog, I’ll talk to you about the second point and explain the different some evaluation methodology and which one best applies to my project.

The Evaluation Methodology

Definitely, we can say that the most “intuitive” way of evaluating the performance of a tagged sentence is by its accuracy. Properly defined, accuracy is the ratio of the number of words that are correctly tagged over the total number of words tagged. You probably recognize this, since this has been the method I have used to evaluate the performance of the POS tagger.

Without a doubt, this is a good way to fundamentally evaluate the performance of the POS tagger. As I had previously talked about, “the problem is that my method assumes that the number of tagged tokens is the same as the number of ground truth tokens. As a result, if my number of predictions is larger than the number of ground truth tokens, I have extra words that I am unable to compare to its true value and my method breaks down” [from my previous blog post].

In fact, in a paper by Patrick Paroubek, he summarized the work by Adda et al on the “conditions” of “correct tagging” used for evaluation of accuracy:

The “segmentation convention”, or in other words, the tokenizer, used by the tagger has to be the same as the one used for the ground truth data. If this is not met, then there has to be some “realignment” process. This happens with my data, when the tokenization of the POS tagger does not match with the ideal tokenization of my ground truth data.
Then, the tagset used by the tagger has to be the same used to annotate the ground truth data. Otherwise, “specific mapping procedures” have to be applied. In my case, this condition is met, since my tagged data, both in the training set and the evaluation set are tagged with the same tagset.

In my project, my tokenizer is unable to perform the ideal tokenization, and thus I am not able to calculate the accuracy of the POS tagger because condition #1 is not met.

Under this dilemma, I was directed to look at different evaluation methods I could possibly use, so I will talk about two: confusion matrices and k-fold Cross Validation.

Confusion Matrix

If you’re reading this post based on the fact that my comps is a Machine Learning/Natural Language Processing project, then you may be familiar with the term “confusion matrix.”

Essentially, a confusing matrix is a table that is used to describe the performance of a classification model. As you may know, you need an evaluation set with ground truth data to create a confusion matrix.

In our case, we will start with an example of a confusion matrix for a binary classifier. Recall that a binary classifier is a classifier that can only say “yes” or “no” (then, this can be extended to more than two classes, without loss of generality — look at my math major shine!).

In our example, we will have two possible predicted classes, “right” or “wrong”, which will indicate whether a tag was correct or if it was incorrect per word. Suppose the POS tagger made 200 predictions (in total, we have 10 sentences with 10 words each). Out of those times, the POS tagger predicted 140 “right” tags and it predicted the “wrong” tag 60 times (at this point, you might realize how a confusion matrix may not be the best evaluation method for this project, but let’s continue for now).

The confusion matrix would look something like this:

Now here comes the confusing part, let’s define a couple of terms that many people (including myself), have found confusing at some point.

True positive (TP): This is a case when a classifier predicts yes, and the actual answer is a yes
True negative (TP): This is a case when a classifier predicts no, and the actual answer is a no
False positive (FP): This a case when a classifier predicts yes, but the actual answer is no
False negative (FN): This is a case when a classifier predicts no, but the actual answer is yes

For the sake of the self-realization I had while investigating this evaluation method, I will stop right here and point out why this would not work for a POS tagger.

Problems with Confusion Matrix

At this point, you might also pose the question: “If the POS tagger predicted a wrong tag, how is the ground truth tag also ‘wrong’?” This is a very subtle question, because tags are not inherently right or wrong, but instead are classified as right or wrong if the POS tagger either got it right or wrong. There is no “False positive” or “False negative” for a “classifier” like this one, so we move on. So much for nothing, confusion matrix — thanks.

k-Fold Cross Validation

With cross validation, first we have to talk about the problems of splitting a data set into a test (or evaluation) and training set.

Let’s use the image below as a reference:

If the bar above represents our data set, we have to decide what fraction of that data will become our test data and what remaining fraction will become our training data. The problem is that we want to maximize both of these data sets. We want to get as many data values in the training set to get the best learning results. Similarly, we want to maximize the the test set in order to get the best validation.

If you noticed the focus splitting line, this is because it indicates a small “trade off.” All the data that is put into the test set is lost to the training set and vice versa.

This is where cross validation (CV) comes in! I am not a master in this, but I will explain it as best as I can:
The basic idea of k-fold CV is that if you have a data set, you will partition it into k bins of equal size. Suppose you have a data set of 200 data points (tokens, words, etc) and you want to create 10 bins, so k = 10.

This means that from our data set, we will have 20 data points per bin (10 bins x 20 data points per bin = 200 data points). From here comes the cool part: in k-fold CV, we will run k separate experiments. In this case, we would run 10 separate experiments.

Here’s how it goes, if we have k bins, then we will pick one of those k bins as our test set. Then, the rest k-1 bins will be collected into the training set. And like any other normal machine learning algorithm, we test the performance on the test set.

Once again, the key to this is that we will run this experiment k many times, each time using a separate bin as our training set and collecting the rest as the training set. In our example, we would run 10 separate experiments. At the end, we average the results from the 10 experiments.

Conclusions

After reading these evaluation methods, I believe that k-Fold CV would be a very interesting method to evaluate the performance of the POS tagger. I realized that this method would still require for the accuracy evaluation to work as I would have to run k many experiments. So k-Fold CV would be a good method to use if I was running these experiments on a small dataset, which is also the case for me.

For the next month towards the end of my semester, I have decided to test out this evaluation technique and “see how it goes”!

Disclaimer: If you found any wrong explanation of the evaluation methods above, please let me know and I will change it ASAP! This is a learning progress for me and any opportunity to learn is more than welcome. Thank you!!

Building a Part-of-Speech (POS) tagger for domain-specific words in bug reports — Part 5

We’re almost there!

The Evaluation Methodology

Confusion Matrix

k-Fold Cross Validation

Written by Luis Figueroa