Build An NLP Project From Zero To Hero (4): Data Labeling

Published in

UBIAI NLP

9 min readJan 11, 2022

Data labeling is without a doubt a critical phase in the workflow of any Machine Learning project. Here, we are preparing the study material for our students or rather our Machine Learning models.

Data Labeling or Data Annotation is defined as the process of tagging data, be it images, text files, or videos, and adding meaningful labels to provide context so that a machine learning model can learn from it.

First, we will talk about Data Labeling in general and then we will apply to our project: Analyzing the Stock Market tweets using a NER model.

Introduction to Data Labeling

The majority of ML models nowadays are Supervised Learning Models. They rely heavily on their training data to learn the generalization of a given task. With every training iteration, the model adjusts the weights to predict the correct labels provided by the human annotator.

Annotators are tasked to use their own judgment to annotate every training example: Is this email spam or not? Is this image a cat or a dog?

A more complicated example would be identifying entities like emails, persons, companies' names in every email. The labeler here is required to provide the span (index of beginning, index of ending) for every entity in the text.

As you can see, the labeling or tagging process ranges from being a very simple binary choice to a complex and granular specific choice. To make sure the annotation is successful, there are few requirements:

User-Friendly Labeling Interface: In cases of complex tagging, labelers can be overwhelmed easily. Time is very crucial and you want to finish the process rapidly and correctly. Developing or using an easy-to-use and efficient Data Annotation Tool can certainly help you.
Domain Knowledge Consensus: Every labeler needs to produce consistent and correct tagging. If the data contains high-level domain such as health care, finance, scientific research, the labelers need to have subject matter expertise to perform the annotation. For this project, I had to read articles to understand the jargon of the Stock Market world. This knowledge will help also in the case of conflict when multiple labelers are annotating the same corpus: labelers can make mistakes based on lack of knowledge or bias after all.
Assessing Data Quality: Always verify the accuracy of your labels during the entire phase, because as I realized in this project, the dataset I collected performed well with certain labels but did not with others. An example I give is: I included the PERSON entity in the labels, but the dataset did not provide enough examples for that specific entity, on the other hand, there was plenty of examples for COMPANY names and their TICKERS.

This does not encompass all of the requirements for data labeling but understanding these points will give you a very good start and save you a lot of headaches.

Here is a little story of mine: In a previous NLP project, I was working on developing a model to predict credible news sources and biased news sources based on a dataset of articles. I labeled the data as follows: I had a list of news sources based on my cultural knowledge of them: certainly, there are well-established sources that are considered credible by most people and the rest can be considered less credible. So, I labeled the articles based on a social consensus for lack of better words.

The model got stuck at around 87% accuracy: tried various architectures, and many featurization techniques but I had no improvement. Then I realized that among the articles of the less credible sources there exist good articles that are on par with those of the socially accepted sources. So I realized the model was given wrong labels which resulted in poor performance.

I hope that this section made you realize how crucial data labeling is, now, let us do some practice!

NER Annotation

There exist many tools for NLP text annotation: Paid like Amazon Sagemaker Ground Truth or Prodigy and open source like Doccano. In this tutorial, we are going to try: UBIAI Annotation Tool.

The tool specializes in NER, Relation Extraction, and Text Classification. It has many fascinating features like auto-labeling with rules, dictionaries, and model-assisted labeling.

Defining the Labels

But before starting the work, we need to know what labels we need to extract exactly: we have talked about labels in the pre-annotation section of the previous episode. The generic spaCy is not fine tuned on stock market text so we need to redefine the labels:

COMPANY: the companies names
TICKER: A special symbol for every company in the stock market: Apple’s ticker is AAPL in the NASDAQ stock exchange.
TIME: We regrouped the TIME and DATE labels into one label.
MONEY: self-explanatory
MONEY_LABEL: this one is a little bit tricky and it came out after a lot of research, MONEY labels on their own are not really helpful without mentioning what they refer to. This label indicates simply what the MONEY is about. For example, is it a value of a target price? Or a new rise in a company’s share value? This label refers to the jargon of the Stock Market.
PERCENT: A number indicating a percentage, a statistic.
CARDINAL: A number on its own, not a TIME, a DATE, MONEY, or anything else.
PRODUCT: any mention of a product.
PERSON: A real-life person name like a CEO or a journalist.
GPE: Geopolitical Entity.
EVENT: like a financial summit.

Workflow

First, we need to create a project that will host our data and our task. The steps to do so are well-covered in the Documentation. Basically, you will define your project as a Span-based annotation project. You will configure its settings by addinglabels to use and then you will need to import a dataset with a supported format: ours is a pre-annotated list of dictionaries from the Spacy default NER model. checkout the previous article for more information on how to preprocess your data.

Annotation Interface

Most of your work will be in the Annotation Tab. You can add labels directly in Entities, Relations, or Classification interfaces. Theworkflow is simple: first, select the label and then highlight with your mouse the words in the document text interface. If you are done, validate your document and move to the next one as shown below:

Pretty straightforward.

In the Details Tab, you can modify the project settings like project name and description, existing labels for every type of NLP task (NER, REL, or Classification).

In the Documents Tab, you can import or delete documents.

Now, we reach more interesting features.

Pre-Annotation Interface

Before we even attempt to annotate on our own, maybe we have some pre-existing domain knowledge that can help us initially.

I have compiled a list of tickers and companies' names from this source. Refer to the documentation for more details but here is how it will look like:

The other feature is Rule-based Matching: you can use patterns like regular expressions to tag words. Before pre-annotating with a dictionary, I used a pattern to tag tickers. A ticker is a symbol of one to four Uppercase Letters. The tweets usually mention a ticker preceded with a $ dollar sign but I could not rely on that information because tickers are naturally independent of that. I also found out that some jargon has the same pattern like EPS or earnings per share.

Once you finish defining your dictionaries and rules, you are ready to launch the pre-annotation.

How to add python regex as a rule-based matching

Metrics Interface

One of my favorite feature. It helps in assessing the quality of your data by seeing through the distributions of documents and their labels. If we want our model to perform well, we must make sure that the labels are balanced, otherwise, the model will not perform well with every label.

Specifically, document label distribution was very helpful in monitoring the annotation process: After just labeling 100 documents, I realized that COMPANY, TICKER, MONEY, and MONEY_LABEL are the only well-represented labels in the corpus.

Remember to always check your labeling metrics!

Model Assisted Labeling

I loved this feature so much because I never worked with text annotation tools this way before.

Let say you labeled 100 documents, you can train an NER model within the platform. Afterwards, you can use this model to predict for each document or auto-annotate the rest of the documents if you are satisfied with the performance!

Go to the Models Tab, select Named Entity Recognition, and then press Add a new Model.

A new model will be added to the list.

You can prepare your model for training by opening the Train Model Action in the Action column. You can enter some basic configurations like dropout rate and train/validation ratio. Do not forget to select your model by clicking its checkbox button.

Return to the Annotation Tab and then press Train Model Button in the Model Labeling Interface. After the training is finished, you can use the Predict Button to annotate the current document.

Train your model and then let it help you annotate!

Why is this so helpful? You can easily know what your model is successfully learning. For my case, I have confirmed some of my observations in the Metrics Tab. The model is learning much better the most represented labels like MONEY_LABEL and TICKER.

You can see below the history of the model training. In the Models Tab, click on the model row and it will take you to this page.

You can see that the model fluctuated a bit in performance. Click on the Entity details for any model and see for yourself the individual scores for each label.

Model classification scores for each label

The model is doing good with COMPANY, TICKER, MONEY_LABEL, TIME, and CARDINAL.

The model is struggling with MONEY and PERCENT. But there is room for improvement. There is an issue especially for MONEY because it was supposedly sampled well. It is possible that the model is confusing the label with other numeric entities.

The model is underfitting with PRODUCT, EVENT, and PERSON. It is problematic because, in our project hypothesis, we certainly believe PERSON entities are very influential in the stock market (for example Elon Musk). We conclude that the current labeled dataset is not well representative of this entity.

In the case of a real-life project, we can keep labeling more documents and training the model and in case we don’t see significant progress, we should rethink our dataset.

In conclusion, the model-assisted labeling feature was eye-opening: I saw up close the model learn step by step.

Conclusion

In this article, we understood what Data labeling is about and we saw how to perform the task using a text annotation tool. We also discovered various techniques: dictionary-based labeling, rule-based labeling, and model-assisted labeling. We also saw the problems and the difficulties that might arise with such task.

Do you also know that these trained NER models within the UBIAI tool are full-fledged models? We have already performed the training of the model in this episode. However, we want also to improve the current model and to explore training your own custom NER model using Spacy.

If you are curious about the tool and want to try it yourself, you can request a demo here.

If you have questions, do not hesitate to contact me through Linkedin or Twitter.

Happy learning and see you in the next article!