Build An NLP Project From Zero To Hero (3): Data Preprocessing

Khaled Adrani
UBIAI NLP
Published in
9 min readDec 27, 2021
How to make sure that we got the right data to analyze the stock market?

We continue the series and in this article, we are going to focus on Data Preprocessing. It is usually deemed even more important than training the model itself. Garbage in, garbage out.

Data Preprocessing is the process of transforming raw data into an understandable format that suits your task. Naturally, we cannot work with raw data, let alone demanding machines to understand it. We remind the reader that it is important that our data quality remains above certain threshold and will elaborate more on that later.

First, we will be talking about Data Processing in general and then we will explain the techniques we have used.

So let us dive in!

A Gentle Introduction to Data Preprocessing

Raw data seems chaotic, ambiguous, and unclear. It needs to be preprocessed in order for it to be useful to train the model.

One good example is when you want to train a sentiment analysis model, a model that can predict the sentiment of a text, positive, negative or neutral. You encode the text as vectors and train your model but it seems that your model is not improving.

You decide to peak at the most popular words in the model’s vocabulary, you find words like “I”, “they”, “is”, “and” and so on. They are stopwords that do not bring new information to your model. Therefore the model has not been provided a meaningful embedding vectors so it can recognize the sentiment behind each text.

This example is a little bit oversimplified but I think it is good to understand why preprocessing your data is so important especially in NLP.

The general Steps of Data Preprocessing in a Machine Learning Project:

  • Exploratory Data Analysis: or EDA is the initial step of investigating the data. It allows you to discover clues and patterns that will help you make the right decisions for the next steps.
  • Data Cleaning: Data can be ‘dirty’. It can have missing values and unwanted errors or ‘noise’. You must remove or alter these anomalies.
  • Data Integration: If you work with multiple datasets, you should aim to merge or unify all of them into one single source. This step requires knowing well the metadata of each dataset, detecting common entities and elements in these datasets, and merging conflicted data value concepts (like date and time).
  • Data Reduction: Data can be voluminous in size and complexity. This step aims to reduce the size without sacrificing the valuable information contained within the data. Techniques like dimensionality reduction and data compression are very helpful in this case. A very simple example is to use only the most relevant columns in a dataset for your problem.
  • Data Transformation: This step involves changing the data into a more accessible form or structure. There are many techniques like Data Smoothing, Aggregation of data as a summary, Discretization of continuous data as intervals, and Normalization of data to be contained in a predefined range of values.
  • Data Labeling: It is the process of identifying raw data (images, text files, videos, etc.) and adding one or more meaningful and informative labels to provide context so that a machine learning model can learn from it.

You will not need to use every step and every technique. It all depends on the requirements of your project. For our project, our data is text, so we need to know the steps of Text Preprocessing:

  • Tokenization: It is about dividing a text string into smaller parts or “tokens”. Tokens can be either word, characters, or subwords. Hence, tokenization can be found in three forms — word, character, and subword (n-gram characters) tokenization. Tokens are the building blocks
    of Natural Language and the most common way of processing raw text as their ultimate goal is to create the vocabulary out of a given corpus.
  • Normalization: It aims to put all text in a predefined standard, such as converting all characters to lowercase and lemmatizing tokens.
  • Denoising: Noise removal removes any undesired ’noise’ in the data, for example, it will remove extra white space and special characters like punctuation.

You can see that there are similarities between the general steps of data preprocessing and Text preprocessing. Denoising is Data Cleaning, and Normalization is a type of Data Transformation.

Sometimes these three steps can overlap with each other in no particular order. For example, you can consider Tokenization as a form of Normalization. Or you can do denoising before the tokenization.

The steps mentioned in this section serve as a guideline or a roadmap for your future projects in NLP, so you won’t get lost or forget again what should be done!

Now, it is time to apply this knowledge, and for every technique we use, we will certainly explain our reasons for choosing it.

Text Preprocessing for a NER model

Our objective is to train a Named Entity Recognition model, and for that we need data and that data must be annotated.

The data we are working with is the financial tweets dataset from Kaggle.

We need to preprocess this data to obtain a clean and rich corpus for our model.

Prepare your workspace

I am using Google Colab. To download the dataset, follow the steps included in the previous article.

We will need Spacy as well:

!pip install -U spacy!python -m spacy download en_core_web_md

Data Observation

Let us explore the data first.

We have two datasets: stockerbot-export.csv, which contains the text of tweets, and stocks_cleaned.csv which contains the ticker symbol for every company included in the tweets. A ticker or a stock symbol is a unique series of letters assigned to a security for trading purposes. For example, the ticker for Apple is ‘AAPL’.

import pandas as pd
#some rows are not parsed correctly, use error_bad_lines=False to ignore them
tweets = pd.read_csv('/content/stockerbot-export.csv', error_bad_lines=False)tickers = pd.read_csv('/content/stocks_cleaned.csv')
A sample of the tweets’ dataset
A sample of the tickers’ dataset

The first dataset contains 8 columns and 28264 rows and contains the textual data we need. The second dataset contains 583 rows each presenting a company with its ticker symbol.

This is what we have concluded upon investigating the most important columns of the data:

  • ‘text’: essential, it contains the textual data we need for our model, but there are some tweets that do not provide complete information (for example, there are tweets that do not complete their statement and instead link to external articles). There are also tweets that do not talk about the stock market, they talk about other topics like political news. We know that they are still important in predicting the behavior of the stock market, but to simplify things, we will be working with tweets that are directly involved with the domain.
  • ‘source’: According to the uploader of the dataset, there were influencers (platforms or persons) whose tweets were monitored: [‘MarketWatch’, ‘YahooFinance’, ‘TechCrunch’, ‘WSJ’, ‘Forbes’, ‘FT’, ‘TheEconomist’, ‘nytimes’, ‘Reuters’, ‘GerberKawasaki’, ‘jimcramer’, ‘TheStreet’, ‘TheStalwart’, ‘TruthGundlach’, …]. On the other side, there are some lesser-known sources with just very few tweets. In total, we have 5879 sources. So we should focus on filtering out the tweets according to their sources
  • ‘company_names’: We found out that some rows are including names that are not company names. In this tweet “Who says the American Dream is dead?”, the company_names value was ‘American’. The tweet itself is abstract and so we will be relying on the tickers dataset to check if a tweet contains references to any company.

Data Cleaning

Through our observations, there are two columns we can focus on: “source” and “text”.

Source-Based Filtering:

First, we decided to give priority to the most influencing sources or the primary sources. For the secondary sources, we will choose those who have the most tweets:

all_sources = list(tweets.source.unique())primary_sources = ['MarketWatch', 'business', 'YahooFinance', 'TechCrunch', 'WSJ', 'Forbes', 'FT', 'TheEconomist', 'nytimes', 'Reuters', 'GerberKawasaki', 'jimcramer', 'TheStreet', 'TheStalwart', 'TruthGundlach', 'CarlCIcahn', 'ReformedBroker', 'benbernanke', 'bespokeinvest', 'BespokeCrypto', 'stlouisfed', 'federalreserve', 'GoldmanSachs', 'ianbremmer', 'MorganStanley', 'AswathDamodaran', 'mcuban', 'muddywatersre', 'StockTwits', 'SeanaNSmith']#filter out secondary sources according to the frequency of their tweets, the higher the betterls = []for e in tweets.value_counts(['source']).head(100).index:    ls.append(e[0])secondary_sources = [s for s in ls if s not in primary_sources]my_sources = primary_sources + secondary_sourceslen(my_sources)#Output: 130

Then we filtered out the data to contain only the selected sources:

data = pd.DataFrame()for s in my_sources:   data = pd.concat([tweets[tweets['source'] == s ][:1000],data])data.shape#Output: (10873, 8)

Well, you can say we should have done more filtering even among the secondary sources because they are even more dispersed. We actually did that. In fact, there are sources which only mention 1 tweet, while there are sources at the top which mention from 300 to 900 tweets.

In fact, the primary sources are only responsible for providing only 50 tweets. That was somehow disappointing.

#Frequency of articles for primary sources
tweets[tweets.source.isin(primary_sources)].value_counts(['source']).head(10)
Primary sources only provide very few tweets
#Frequency of articles for secondary sources
tweets[tweets.source.isin(secondary_sources)].value_counts(['source']).head(10)
Secondary sources volume is much higher
#Frequency of articles for all sources
tweets[tweets.source.isin(all_sources)].value_counts(['source']).head(10)
Interestingly, the secondary sources have the most tweets in the entire dataset

We have obtained 10873 tweets that fit our source criteria.

But, through our exploration, we discovered that even for the primary sources tweets (50) only 29 tweets are actually good. Some tweets are too abstract or far from the domain and sometimes do not provide the full information.

Text Quality Filtering:

We believe that cleaning the tweets based on their text is a better solution.

We filtered out any tweet that ends with three dots (indicating it is linking to an external source and thus lacking in information) and also any tweet that is longer than 200 characters. In fact, we found an abnormal tweet with more than 16 000 characters. Strange!

We removed also any URLs found in the tweets as they present noise.

After, we added another filter that gets only tweets that mention companies by their names or their tickers (tweets that are so relevant to the stock market):

The result contains 4124 tweets. We are now confident that we got enough data to train our NER model.

For now, we will get only 700 tweets as it is more than enough.

corpus = res[:700]

We have completed the data cleaning, now, to prepare for the next phase in the project, that is Data Labeling, we need to do pre-annotation using Spacy.

Pre-annotation with Spacy

Data Labeling is critical for Machine learning projects: by labeling correctly every data observation, you ensure that it will maximize its generalization over the phenomena it is trying to learn.

We will be using Spacy, An open-source library to get things done in NLP. It is very intuitive and easy to use. It really makes a lot of things easier.

We will be loading the ‘en_core_web_md’ English model. Its pipeline contains many features like Tokenization, Part of Speech Tagging, Dependency Parsing but we are certainly interested in detecting entities, their label, and their offsets:

import spacynlp = spacy.load('en_core_web_md')doc = nlp("Apple is looking at buying U.K. startup for $1 billion")for ent in doc.ents:    print(ent.text, ent.start_char, ent.end_char, ent.label_)

To learn more about Spacy, check out this intro article.

Why do we need pre-annotation?

We are going to do Data Labeling, in other words, we are going to annotate or mark any entity in every text. Spacy provides basic functionality for that. However, it is not trained well with our very specific goal (stock market and finance text). So, we need to correct all the labels provided by Spacy.

To import the pre-labeled documents to our annotation tool UBIAI, we need to follow a specific format. Each document will be defined as a JSON with two fields: ‘document’ which contains the text and ‘annotation’ that contains a list of entities with their offsets.

Each entity entry will include the text, the spacy label, the starting position, and the ending position.

An example of an entity entry

Finally, we save the output as a JSON file:

import jsonwith open('text_data.json', 'w') as f:    json.dump(data, f)

Conclusion

We have prepared our data for the next phase, which is Data Labeling. As you can see, it was not easy despite the fact that we are working on a tutorial project. There were issues in the way but we overcame them because we focused on being methodical as much as possible. This proves once more that data processing is really crucial for any machine Learning project.

We will be using the UBIAI text annotation tool. We will demo the tool and present its many exciting features.

You can request a demo yourself by sending a request at admin@ubiai.tools or Twitter.

Happy learning and see you in the next article!

--

--