10 Years of Financial Data from 500 Traded Companies Data Led Me To 295% Profitable ML Model

9 min readSep 11, 2022

FOREWORD

Four months ago I stepped into the journey intending to build a stock price predictive ML model. I post all the developments on Medium to keep records and share something valuable with the audience.

Here is a short recap of what was done in case you've missed it:

I calculated and combined the 15 most important financial KPIs from the public companies’ published reports.
I scrapped the information about the important market conditions and future market expectations.
I created a test ML model based on 3 years of data for DOW30 companies with a 75% prediction score.

Now it's time to push this project further. As the next target before going live, I am going to test the ML model on S&P 500 companies with the extended historical data to 10 years.

Let's go!

PART I. INPUT DATA

I accumulated 10 Years of trading and financial data for 500 companies on the S&P list.

In total there are a little more than 1.1 mln rows of data.

In regards to the variables we use for prediction, there is a total list of 25 which are separated into four groups:

one variable is the the company’s business sector;
15 variables are financial KPIs values that allow us to get a snapshot about a company’s health;
6 describe expected vs real earnings and dividends. All of these above are described in detail in the previous article.
3 variables are related to the external factors: 10 Year U.S. Treasury Notes Yield and its Month over Month change and the Volatility Index Value (VIX that is a market fear gauge)

A quick reminder - I am avoiding any time-series analysis as I believe that stock trading can’t be explained with the same approach we are predicting bike sharing. I assume the factual company's performance along with some market factors can impact the evaluation. And in this area, I am going to try to catch a chance.

To convert time-series data into a regression model I measure how much the stock rises or drops in 30 days and use it as a predicted variable.

In order to find the best matching forecast time frame I am going to use 6 different time periods (read as predicted variables):

Stock % change in 15 days.
Stock % change in 30 days.
Stock % change in 60 days.
Stock % change in 90 days.
Stock % change in 120 days.
Stock % change in 150 days.

It means I would need to test at least 6 models and choose the best one in the end.

To save time, I will work with only one ML technique that proved good performance with excellent execution time — Light GBM.

PART II. ANALYTICS PROCESS

I have to admit the data analysis part of the project took me a massive stake of time with zero to no results. All the independent variables do not correlate with the targets even after data cleaning and additional processing.

The only highly correlated values are of the target variables themself, which is self-explanatory.

Luckily, I am using the ML Model that requires minimum data massaging and can handle low correlated data. I know it is not the best practice, but let's keep it secret as I am not defending an academic certificate here 😊

I did a little data trimming to avoid overfitting and removed 80% of the data. As a result, we got about one record per trading week per company.

I don’t want you to get toted with all the details of minor data processing steps and splitting the data frame to train/test, so let's skip it and jump to the most exciting part.

PART III. CHANGING THE PERSPECTIVE

It is common knowledge that there are two major types of market participants: investors and traders. Traders play on short-term market fluctuations, usually do not hold stocks for long and they can go into short or long positions with an equal probability.

I personally find an approach of selling stock that you don’t own highly risky because your losses can fast go over the bearable threshold. As you may understand I am mentally more inclined to investors type even though I am trying to build a trading model here.

I want to buy a stock, hold it for a while and sell it with profit. Stating this I want to change my model from the one that has the lowest error of regression type, to the one that has the highest precision and the classification type. The best case scenario is when the model can give me a signal to buy stock with a 99% probability that it will grow within some time. Then even betting a million dollars wouldn’t be a risky business.

Having a model with 99% precision would be a red flag about overfitting, so maybe it's too early to talk about zero risks here 😊. But we are going to check what precision level we can achieve with the data we have.

Here is the values distribution of the dependent variables over the historical 10 years for the list of S&P 500 companies.

You can see that in 25% of cases companies grow by more than 3.8% in 15 days, 6% in 30 days, and above 17% in 150 days. It should be safe to say that if we were able to guess this 25%, then we would have a good chance at least not to lose our money.

In order to convert our regression problem into classification, I will change all the values above the threshold of top 25% to "1", and what is under to "0".

After creating the classification model here are the results I achieved:

Testing six classification models with the six dependent variables

Those are calculated precision scores for every dates range:

6 prediction periods and 6 precision scores

We achieved 71–76% precision score among all date ranges. It sounds promising but so far not much better than the regression model did.

Whereas we shall recall that in converting to the classification model, we did not use the "0" value as a threshold, but we tried to predict the top 25% performers.

Now let's add back the numeric values of the price valuation changes so we can compare how many stocks have actually lost their value.

Here is the output:

future_15dprice_change precision is 
85.52%

future_30dprice_change precision is 
86.82%

future_60dprice_change precision is 
88.06%

future_90dprice_change precision is 
90.05%

future_120dprice_change precision is 
91.6%

future_150dprice_change precision is 
94.33%

94.33%!

Based on our data we achieved a maximum value of precision of 94% with a 150-day investing period!

It sounds like magic. Let's validate the results!

Photo by National Cancer Institute on Unsplash

PART IV. DATA VALIDATION

I used all the obtained data to train and test the model. To get the new unseen data for the validation I ran a script repeatedly. This is the latest date in the training dataset:

The maximum date in the training dataset is July 1st 2022, and we will use unseen data after that day for the validation purpose

I am writing the post at the end of August, and there are only 54 days in the validation dataset. I can run a test for two target variables: the one is predicting 15 days and 30 days ahead.

As we are modulating the real-world scenario, our models will try to guess if the price will go above "0" (meaning win or class "1" for the model) or will drop below "0" (class "0")

Here are the results:

For the model with 15 days prediction period, the precision went down from 85% to 74%, and for the 30 days model, the score dropped to 63% from 87% in the training dataset.

We got a significant drop in the validation dataset vs the training/testing data. It tells us that there is a way to improve our model to make the numbers closer.

At the same time, the precision we achieved is higher than 50% which makes it a workable model in a highly unpredictable field.

PART V. LET'S TALK NUMBERS

To understand what we potentially can earn or lose we need to visualize the average numbers in each scenario when our model is right or wrong.

To do this let's connect the predicted classification categories with the actual categories and real values.

The result:

In the first column, you may find "True" vs "Predicted" categories. For example, the value "0.0 1.0" means it's about stocks that lost value in reality (“0.0”) but were forecasted to grow (“1.0”). Other columns represent aggregated values MEAN, AVG, MIN and MAX.

For the validation dataset, our model predicted 674 cases when the stocks were expected to increase in value. Out of these 674: 497 were correctly predicted and in 177 cases we made a mistake.

Let’s assume we bet 1000$ with every deal and we take the average percentage of gains/losses from the table above.

The "true positive" total calculation is:

497 DEALS x 1000$ x 1.0679% = 530'746 $ (We earned $33'746)

For the "false positive" the calculation is:

177 DEALS x 1000$ x (1–0.0264%) = 172'327 $ (We lost $4'673)

Total "true positive" combined with the "false positive" outcome:

$497K + $177K → $530'7K + $172'3K

$674K →$703.1K (In total we earned $ 28'803)

Percentage of the total amount we earned:

$26.7K / $674K = 4.27%

It shows that we are able to earn an average of 4.27% of the invested amount every 2 weeks (15 days).

Let's input all the values in the formula for the compound interest, having 52 weeks in a year (26 periods by 2 weeks).

15Days Forecast Model Annual Interest

That is almost 300% of the validated annual return.

That is better than I expected and for sure I need to give this a try!

WHAT'S NEXT

I feel that now comes the stage of going live!!!!

Before that, I will try to fine-tune the model for the top 5% of performers instead of 25%. The reason here is that I believe that the minimum amount for a bet should be about $1K and for a guy with a limited budget I should reduce the number of deals I'm betting and at least have the same precision rate.

I also want to test the performance of other models with 60 days, 90 days forecast periods. My expectation is that they won't show the outcome as on the testing dataset merely because we are in a bear market now and the model is trained mostly on bull's historical data. Anyway, I shall try without my personal prejudice.

There is a ton of optimizations that can be done here, starting with fine-tuning a model and adding more variables.

Other than that, I will set up a daily job to pull and combine the data and send email notifications with the list of most likely bets.

I will keep recording my adventures and share information here.

I hope you find it valuable.

Hey Guys,

I am pretty new to writing on Medium and would love to hear your feedback. If you like what I am writing about, don’t hesitate to put a thumb up. Please feel free to leave a comment if you have a question or recommendation. I read every message and try to answer as soon as I can.

You may also visit my Patreon page to keep me motivated to write more posts about stock analysis and related topics here on medium.com

www.patreon.com/GenerousDataAnalyst