How Can Machine Learning Help Us Predict Financial Markets?

Published in

Sysco LABS Sri Lanka

8 min readSep 7, 2023

Hi everyone,

Among all the programming-focused articles available for us here, I thought of talking about an approach that I experimented with, while completing the research project I chose for my master's in computer science. This one is especially interesting - while learning a few cool machine learning techniques, you can use this method practically with next to no cost, and try them out for yourself with either the local stock market or even the global crypto market and maybe earn some decent gains down the line (like I did ..*wink*).

So let's jump into business. First, I will introduce you to the general pipeline that can be used in any financial market and share the specific implementation I did with the crypto market with code examples as well.

The Common Pipeline

Generally, as in any machine learning approach, you need the following:

Plenty of raw data.
A mechanism to filter, clean, aggregate and create a meaningful dataset.
A machine learning model to train and fine-tune to get your results.
A set of actual end results vs. your models' predictions to check the accuracy.

So let's see how we can get these components to work together to produce a prediction about a financial market.

What are we trying to accomplish?

Here, we will be collecting significant events related to a given financial market. These can be political news such as the passing of a new bill related to imports, economic analysis such as the inflation indexes, social events such as riots or even technical news such as new blockchain advancements etc.

Then we would group these under a few common categories such as political, economic, and social. And examine the impact they have on the price change of a financial market (such as cryptocurrency) on the day that event occurred. Here, we would assign each event a significance score using Google Trends so that we can compare the event's significance on a given day, and learn which type of event had the most impact that could have resulted in the price change of a given stock or currency.

Through this, in a future day, if we get to know of an event that took place on a given day, based on that event category and the significance of the event, we can try to predict what sort of a price change can occur due to this event taking place today and based on that we might be able to make a smart trade.

Let's now get into more detail and see how we are to accomplish each of these.

Gathering Data

The proven method to easily capture accurate events related to financial markets is through Tweets. I have done a lengthy background study as to why tweets are suitable and further details about that can be found in chapter 2.2 in my research article.

The source code to gather tweets, pre-process them using Tweepy, and do the sentiment analysis using “vaderSentiment” can be downloaded here.

Training an Event Classifier

For the purpose of training an event classifier, we will be using the “SKLearn” Python library and utilize a “Linear Support Vector” model with one-year past tweet data that will be manually annotated and trained under supervised machine learning.

The training data set can be found here.

To train these models, a data set was manually created and annotated that contained over a thousand tweets and was labeled using the “OneVsRest multi-label strategy” which is a multi-label algorithm test that accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample.

The entire code to train the model can be downloaded from here. This Python code will introduce a model such that, once you pass a news event to it, it will predict to which category it will most likely fall with an accuracy of 94%.

Getting a Significance Score For Events

Okay, now that we have our events classified into categories, next we have to sort out a way to give each event in a category a score value that reflects how popular, or how significant that value was compared to the other events in that category.

These scores should reflect the sentiment (negative or positive).
They should hold a significant measure compared to the rest.

To get the sentiment value of each event, we can use “vaderSentiment” which is an open-source Python library. To get the significance we can use a combination of values.

Popularity score — obtained from Twitter
Re-tweet score — obtained from Twitter
Google trend score — obtained from PyTrends API

The details as to how to get each score for each event can be found in the following source code.

Next, we have to do a few calculations to normalize the values so that they can be compared against each other. The entire code to normalize the score values can be found here.

At the end of your final dataset for each day, there should be a record with the categories of events that occurred and the significance score for each category which should look as follows:

Daily event class with significant scores

Training a Price Prediction Model

For this step, we need to combine 2 data sets.

The daily event dataset with significance scores.
The daily price change for a given stock/crypto.

Once we combine these we would have a data set that has a set of event categories that occurred on a given day, with its significance and the resultant price fluctuation. The final data set can be seen here. The source code to combine the data can be found here.

To make the predictions, when an event category is given with a significance score, our model should be able to predict a possible price change percentage that can occur due to an event under that category. For this purpose, the best results were yielded from a “Support Vector Machine” supervised machine learning model. A comparison of the models evaluated, and which gave the best results can be seen in the below table.

The source code where I used SKLearns regression_models python library and the relevant training code can be seen here.

Okay, so up to now

We have created an entire pipeline to gather events related to a financial market for each day.
We have classified those events into 5 categories.
We have assigned a significance score for each event.
We have normalized those scores and got a final score for each category per day.
We have trained a machine learning model to find a correlation between an event category with a specific significance score, and the resultant price change of a financial subject per day.
Now we can use this model to predict a possible price change percentage once we enter an event category with a score on a future day.

The overall flow of this entire process can be shown below.

What's next?

As a final step let's try to use another machine-learning tool, and generate a set of human-readable rules that can point out some connection between the events and the price changes.

In order to create a set of the human-readable rule set that is defined by the regression prediction made in the previous price prediction models, here the skLearn DecisionTreeRegressor is used to create a decision tree out of the regression model and then translate those into human-readable rules by adding “if-then-else” statements instead of comparators.

Below is a set of interesting connections that were generated between events and the Bitcoin price changes.

if (marketCategoryScore <= 44.132) and (marketCategoryScore > -15.918) and (marketCategoryScore > 3.158) then response: -0.589 | based on 7 samples
if (marketCategoryScore <= 44.132) and (marketCategoryScore > -15.918) and (marketCategoryScore <= 3.158) then response: 0.304 | based on 7 samples
if (marketCategoryScore <= 44.132) and (marketCategoryScore <= -15.918) and (otherCategoryScore > 15.472) then response: -1.03 | based on 2 samples
if (marketCategoryScore > 44.132) and (otherCategoryScore > 19.454) then response: 3.02 | based on 1 sample
if (marketCategoryScore > 44.132) and (otherCategoryScore <= 19.454) then response: 3.63 | based on 1 sample
if (marketCategoryScore <= 44.132) and (marketCategoryScore <= -15.918) and (otherCategoryScore <= 15.472) then response: -1.58 | based on 1 sample

The code for the above rule set generation can be found here.

Also as another further improvement, the use of machine learning pipelines which could continuously update the classification models and prediction models from daily fetching data and training the sets automatically through a pipeline, and using those updated models to do predictions would result in an accurate and timely prediction which can adapt to new events. More details about a few pipeline options are as follows.

Try it yourself!!!

The entire source code and the required datasets for the 2 cryptocurrencies can be found in the following file.

If you like to read about all the literature on this topic, the detailed implementation, and also about the results, refer to this document.

Hope you have fun trying this, and hope you will be able to get some gains at the end of it. Thank you and see you!