75% Accuracy Stock Prediction with Financial KPIs

Oleg Kazanskyi
9 min readJun 27, 2022

--

Photo by Yiorgos Ntrahas on Unsplash

FOREWORD

In the series of posts I am trying to collect data from the financial markets and build a working ML model for predicting the future stock value fluctuations.

As a recap, I have already created scripts to collect the 15 most important financial KPIs for publicly traded companies including historical data.

I also have scripts on hand to scrap the information about market conditions and future market expectations for the same stocks.

As the final result, I combined the information above and now I am ready to start building an ML model. I am covering the results in today's article.

A short spoiler — the model successfully predicted 75% of stock movements correctly with a mean absolute error of under 5%.

Let's go and check the details!

PART I. THE LIST OF VARIABLES

We have about 30 variables in total and it may be a little overwhelming to remember all of them. I will list them with a proper name for future reference:

  • date — day of the operation for trading transactions, for financial KPIs we show the latest results. DATE
  • roe — Return on Equity. NUMBER
  • longTermDebtEquity — Debt to Equity Ratio for the long-term notes. NUMBER
  • revenueQoQ — Revenue Quarter over Quarter. NUMBER
  • epsQoQ — Earnings per Stock Quarter over Quarter. NUMBER
  • piotroskiFScore — Financial health in the range from 1 to 10. NUMBER
  • currentRatio — Company's short-term liquidity. NUMBER
  • roa -Return on Assets. NUMBER
  • profitMargin — how many cents of profit have been generated for each dollar of sale. NUMBER
  • peRatio — Price to Earnings Ratio. NUMBER
  • pbRatio — Price to Balance Ratio. NUMBER
  • trailingPEG1Y — Price/Earnings to Growth Ratio. NUMBER
  • VIX_high — Volatility index daily highest value. NUMBER
  • sector — company's sector. CATEGORICAL
  • industry — company's industry. CATEGORICAL
  • 10Y_bonds — 10-year US treasury notes daily yield value. NUMBER
  • 10Y_bond_MoM — Month over Month changes of the 10 year US treasury note. NUMBER
  • Debt-to-Equity_Ratio — Total company's liabilities divided by the equity. NUMBER
  • DividendsYield — The percentage of a company's share price that it pays out in dividends each year. NUMBER
  • PayoutRatio — The portion of earnings a company pays as dividends. NUMBER
  • Acc_Rec_Pay_Ration — Accounts Receivable Turnover Ratio. NUMBER
  • Earnings_per_stock — profit per outstanding share of stock divided by stock's value to get the percentage value. NUMBER
  • dividends_change — After the company's new dividends announcement, how strongly the payments changed. NUMBER
  • prev_div_change — What's the latest dividends change was before the latest announcement. NUMBER
  • days_after_divid_report — how many days passed from the latest announcement of dividends change. NUMBER
  • surprise_% — the difference between the latest reported earnings per stock and the market expectations. NUMBER
  • expected_growth — what are the expected earnings per stock by market participants. NUMBER
  • previous_surprise — the difference between the reported earnings and expectations of the previous report period. NUMBER
  • days_after_earn_report — how many days passed after the latest report date when the company publishes the official results. NUMBER
  • future_30dprice_change — Percent of price change for a specific stock in 30 days. NUMBER

We use the latest variable as our target. We will try to teach our model to predict the price after 30 days of today's date.

In this case, we are going to use information about companies from DOW 30 index. It's only thirty companies but it should be sufficient for us to understand if the collected data can provide any acceptable results.

dates range

We have collected almost three years of historical data.

Photo by Markus Spiske on Unsplash

PART II. VARIABLES ANALYSIS

NUMERIC VARIABLES

Let's have a quick look at the data. At first, we can check the correlation, and hope for a simple answer right away.

Unfortunately, there are no strong correlations in the dataset. Especially for the target variable, there is no correlation stronger than 0.2. It means that the solution for our prediction may be complex if we find one.

Now we can look at data distribution across variables and how strong the outliers are.

Here is the big picture we see from the charts above:
- Most DOW30 companies have Piotroski scores above five which means they are in good financial health.
- Return on assets for most companies was above 0 and below 30%. It shows that most of the time, companies were profitable.
- Profit margin for the majority is under 100%. A spike of companies with a profit margin of 100% should be highlighted.
- 10Y bonds yield did not go above 3% for long for the specified period.
- Companies usually inform about dividends change once a year and in general, dividends grow.
- future 30d price change looks like a normally distributed curve with the outliers on the left side and a mean value close to "0". It shows that our dataset has balanced data of stock movements up and down.
- Average market Return on Equity is 38%. A value in the range of 15–20% is considered as good for a company. A standard deviation is 326%. It shows there are companies with periods of significant losses (negative values) and there are companies with very low shareholder's equity (very high values)
- Dividends, on average, grow 9.7% after an announcement. 50% of Companies announce dividend change every 213 days.
- On Average, stocks' value increased 0.2% every 30 days.

CATEGORICAL VARIABLES

I have to drop the "Industry" as it contained 26 values. Knowing that we have only 30 companies, it may block us from generalizing the future model.

I will remove the “date” column for the same reason, we are building not a time-series model here and the algorithm doesn’t need to know the date of the transaction.

We have only one categorical column now, which is "Sector". Here is the number of companies per sector we have.

I wouldn't add the boxplots of values distribution I used to analyze the companies by sector; you can check these in my Jupyter Notebook.

A summary:
- All sectors except for "consumer cyclical" have mostly positive ROE.
- Energy sector has the broadest range of quarter over quarter revenue range;
- Basic Materials and Energy sectors have mostly negative earnings per share quarter over quarter growth;
- Companies in Healthcare, Consumer defence and Technology have the highest Piotroski score with a mean of 6 out of 10;
- Companies in the Basic Materials and Financial sectors have the best ratio of current assets to current liabilities;
- The highest return on assets is in Technology and Consumer Cyclical sectors;
- Companies in Financial Services have the highest profir margin of mean 100%;
- Consumer Cyclical has the highest PE_Ratio of 30 while the lowest is in Basic Materials, about 3;
- Energy Sector has the worst ratio of Debt to Equity of mean 2.3;
- The fastest growth pace showed companies in Healthcare, about 1.3% in a month, while energy has been dropping by 1% a month.

Photo by Markus Winkler on Unsplash

PART III. MACHINE LEARNING TIME

To avoid overfitting, the first step I took was removing all the data points too close. We have daily trading data, and many values are very close day over day. Keeping all the records will make our model very precise on our train and even test data, but it will fail in the real world without a doubt.

We will keep only a day out of a week of data to make it harder to predict.

data_trimming

Knowing the data don't have a strong correlation, and there are many outliers, I decided not to wrangle a lot and work with a powerful boosting technique right away — LightGBM.

It is fast and as powerful as XGBoost. It doesn't require dealing with outliers and even changing the categorical data. You can specify the categorical column inside the function (the label encoding is still needed as it works with the numeric data only).

train-test split
initialize the model

After the model training and running on testing data I got a 5% mean absolute error.

Model accuracy

The error value may look like a lot. Knowing that we are trying to peek only 30 days into the future 5% error may be a big difference, some stocks may not even change so much in 30 days.

But I would argue that this is the only good metric to check for the model we are trying to build. The best metric would be to know how many times the model is right and wrong about knowing the stock will rise or fall.

Building a plot
True vs Predicted Values

As we see the outliers are captured pretty well and most of the big drops and rises are covered with our model.

Let’s count what is the accuracy of our model. We will count the value as correctly captured if both predicted and true values are positive or both are negative.

Accuracy measurement for the regression model

As you see 74.45% of the values are correctly predicted. It is a very good result I did not expect to achieve at starting the project.

We can check the list of variables our model considers the most important

Important Variables

It’s hard to see, so I duplicate the pure values below:

Feature Value %
VIX_high 9.27%
10Y_bonds 8.13%
10Y_bond_MoM 8.07%

peRatio 6.87%
days_after_earn_report 6.87%
expected_growth 5.53%
Earnings_per_stock 5.20%
pbRatio 5.13%
surprise_% 4.67%
DividendsYield 4.13%
days_after_divid_report 4.00%
trailingPEG1Y 3.73%
roa 3.60%
previous_surprise 3.33%
revenueQoQ 3.20%
profitMargin 2.60%
currentRatio 2.47%
roe 2.40%
epsQoQ 2.33%
Debt-to-Equity_Ratio 1.93%
PayoutRatio 1.67%
Acc_Rec_Pay_Ration 1.27%
prev_div_change 1.13%
piotroskiFScore 1.07%
dividends_change 0.93%
sector 0.47%

Summary:

The factors that go out of the company’s financial reports (market in Italic) cover 47% of price fluctuation and are among the TOP predictors that makes them significantly important for the stock price prediction in the range of 30 days.

Out of the financial KPIs, those that are related to the company’s earnings are the most significant.

What surprised me is that even though companies were from different sectors that have some significant differences, this categorical variable did not show a big impact. This may be explained by the case that other numeric variables have already made a good evaluation of the company and knowing from what branch of economy it came did not help much.

WHAT’S NEXT

I am happy with the achieved results so far. At the same time, there are many improvements to be made in the data collection and the model itself.

As the next step, I will try to extend the data column from DOW30 to S&P500. And I would like to expand the dates range to five years. This will help to capture periods of higher US bond yields that were lacking in the current scenario.

I will test the current model first on the bigger dataset and I will try to create an improved model after. The amount of data in the next stage should be enough to have testing and validation stages.

Here is a link to the Jupyter Notebook for this project.

Please write a comment if you have any questions or ideas, I am happy to hear what you think.

Please push the “applause” button and subscribe for the next projects.

--

--

Oleg Kazanskyi

Master of Finance with a high passion for BI and Data Analysis