Build An NLP Project From Zero To Hero (2): Data Collection

Khaled Adrani
UBIAI NLP
Published in
7 min readDec 15, 2021
How to collect data the right way in order to predict the future of the Stock Market?

We continue the series and in this article, we are going to talk about Data Collection. As stated in the previous article, our objective is : train a Named Entity Recognition (NER) Model and use it to extract meaningful information from stock tweets and hopefully derive investment decision. To this end, we need to have high quality annotated data ready to be fed to our model and this is the goal if this article.

I also want to mention that we will be exploring multiple methods of Data Collection and we will be choosing one of them for this article. This is because I want every article to contain a roadmap showcasing the overall steps to execute every phase of the project.

Let’s dive in!

Data Collection

Roughly speaking, Data Collection is what it implies from the name: collecting all kinds of information (data) needed for the project. Rigorously speaking, it is a process of collecting reliable and rich data for the project at hand. Certain techniques are needed to measure and analyze the collected data and to make sure that it serves well our objective.

We can collect data through various ways:

  • Open Datasets: Some companies, organizations, and institutions may share publicly their data. Anyone can access, use or share it.
  • Public APIs: When a service or a website presents a programmatic way to acquire data from them but according to certain rules. (For example, an API is usually set with a certain rate limit for incoming requests. Or it can cost you money). You will certainly need an API key in most cases. It will allow your application or your script to communicate with their API.
  • Web Scraping: This is another programmatic way to acquire data from the internet. You program a script to harvest whatever data is accessible on a website. However, this approach can cause legal problems as many websites do not like to be crawled by every bot in the world. So before you decide to scrape a website, check its robots.txt route to see which routes are allowed. Web scraping can be done manually but automating the process will be significantly more efficient.

There exists also other types of data sources, like surveys, interviews, and libraries. Usually, they require a certain level of expertise, unlike the previous sources which need mostly programming skills. They also do not scale well.

Practically, since we want documents discussing the stock market, we decided to go for collecting financial tweets as our data. With the exponential growth of social media, big data has become the hottest mean for researchers and experts to analyze stock market tendencies. This has been influencing the financial and economic domains, which suggests social mood can help define future investment and business decisions.

Yet, it was not a simple decision.

At First, we wanted to scrape few websites (we really wanted to demonstrate some web scraping fundamentals.), but we were not lucky as the websites are restrictive. For your references, this is a non-exhaustive list of tools you can use:

  • Requests: It is an elegant and simple HTTP library for Python, your first friend to meet when going for web scraping.
  • BeautifulSoup: A Python library that parses data out of HTML and XML files. A typical routine is to get the page source code with the Requests Library and then use this library.
  • Scrapy: An entire open source framework for web scraping. You can use it to build crawlers that can ‘navigate’ entire websites efficiently.
  • Selenium: an automation tool that is used primarily for testing web applications. It can be used also for web scraping by simulating the behavior of a human user through a webdriver. This helps greatly in rendering dynamic websites which rely on bringing needed data using JavaScript requests. The previous tools will certainly struggle with dynamic websites, however, Selenium is not really scalable.

There are certainly other tools but these ones are really what you need if you are starting web scraping.

We decided to go for open datasets and public APIs .

A Public API Example: PRAW

A good idea we have found is that we can actually use PRAW or Python Reddit API Wrapper to scrape some posts from r/stocks. The API is easy to get started with.

  1. First, login in to Reddit and create a new application with this link https://ssl.reddit.com/prefs/apps/
  2. Install the praw library in your python environment.
!pip install praw

3. Let your script access the Reddit API

import praw reddit_read_only = praw.Reddit(client_id="your client id you will find it near the Application name",                                       client_secret="you will find it after creating your app",                                   user_agent="scraper by u/username, some text to identify your app, mention your username")   subreddit = reddit_read_only.subreddit("stocks") # Display the name of the Subredditprint("Display Name:", subreddit.display_name) # Display the title of the Subredditprint("Title:", subreddit.title) # Display the description of the Subredditprint("Description:", subreddit.description)

4. You can get the top five hottest posts in the subreddit, or you can filter the last month posts according to their link flair text (a tag that labels a post) like Industry News!

A sample of collected Reddit posts

Reddit is a great source for textual data of all sorts in all fields. A literal Gold Mine but it does not mean that every post is valuable data. We need to make sure that it contributes to the overall dataset and therefore the model.

A Public Open Dataset: Financial Tweets, Kaggle

Kaggle is a really well known website for data scientists and machine learning enthusiasts. It hosts open datasets, notebooks and challenges. Through Kaggle, we were able to find an interesting dataset for financial tweets.

Social Media Big Data, a hidden gold mine

According to this insight, Tweets can tell us more than just sentiment analysis (which is what the literature usually suggests.), it can help also in topic modeling and most importantly Named Entity Recognition. While reading a document, we, humans, usually focus on the entities and the relationships that exist between them. A tweet, being short, should convey the most of information in a very efficient way.

If you are working with Google Colab , get your kaggle api json file and follow these steps to get the data into your environment:

! pip install kaggle! mkdir ~/.kaggle! cp kaggle.json ~/.kaggle/! chmod 600 ~/.kaggle/kaggle.json!kaggle datasets download -d davidwallach/financial-tweetsimport pandas as pdtweets = pd.read_csv('/content/stockerbot-export.csv', error_bad_lines=False)tweets.head(3)

The dataset ‘stockerbot-export.csv’ contains 8 columns and 28264 rows. We assume that ‘text’, ‘source’, ‘timestamp’, ‘company_names’ and ‘verified’ columns will be the most useful later for analysis.

A sample of the tweets dataset

For the sake of the series, we will start by annotating a small quantity of data: around 200 to 300 training examples and we will be using the UBIAI text annotation tool as we will see in the next episode. The number of annotated tweets should be enough to train our initial custom NER model using the Spacy model as a base. Don’t worry about the details for now, we will discuss them more in details in the next episode.

So, the decision was to work with a sample from the financial tweets dataset and use the reddit posts possibly as a portion of the testing data.

But, wait, we should make sure that our decision is reasonable…

A Gentle Introduction to Data Quality:

The way to Quality is the right way

High quality data will certainly be a requirement for this project. Not only it will improve the very product it empowers, but it will also augment the business outcome.

To simplify things, here are five measures that we can use to determine the quality of our data:

  • Accuracy: The data needs to be accurate. It is either accurate or not, especially in the finance account, when we are dealing with precise numbers.
  • Relevancy: The data should meet the requirements for the intended use. We want documents talking about finance and stock market news, not something else.
  • Completeness: The data should not have missing values or miss data records. Usually this concerns more structured data.
  • Timeliness: The data should be up to date. This is a specific detail but we prefer not to have a lot of temporal gaps between the documents creation dates as we might miss some obscure information.
  • Consistency: the data format should be identical across all different data sources (like databases).

We are taking the temporal component as our highest priority. It is true that the Kaggle Dataset covers tweets from 2018, however, we noticed that they are very close temporally, unlike the acquired reddit posts which are more disperse in intervals. We don’t like to miss a thing! And in this series, we are learning, we are not working for production. So, we will choose the dataset that fits our criteria.

Conclusion

Hopefully, you have learnt something or two from this article. We really focused on the roadmap rather than the technicalities as we havelimited the explanations for what we have deemed most fit for our project. However, if you liked our style of explaining things, feel free to request any specific topic to be explained, contact me through the comments or my Linkedin. I believe the secret of mastery is to understand it like a five years old. Smart people are genius because they understand complex concepts with simple methods, not complex methods.

Feel free also to contact UBIAI for any text annotation requests at admin@ubiai.tools or Twitter.

In the next article, we will be going into more depths in Data Preprocessing and Data Labeling. Things are going to get a little tougher but we will make sure that everything remains as clear as possible. See you in the next article!

--

--