75% Accuracy Stock Prediction with Financial KPIs

9 min readJun 27, 2022

FOREWORD

In the series of posts I am trying to collect data from the financial markets and build a working ML model for predicting the future stock value fluctuations.

As a recap, I have already created scripts to collect the 15 most important financial KPIs for publicly traded companies including historical data.

I also have scripts on hand to scrap the information about market conditions and future market expectations for the same stocks.

As the final result, I combined the information above and now I am ready to start building an ML model. I am covering the results in today's article.

A short spoiler — the model successfully predicted 75% of stock movements correctly with a mean absolute error of under 5%.

Let's go and check the details!

PART I. THE LIST OF VARIABLES

We have about 30 variables in total and it may be a little overwhelming to remember all of them. I will list them with a proper name for future reference:

date — day of the operation for trading transactions, for financial KPIs we show the latest results. DATE
roe — Return on Equity. NUMBER
longTermDebtEquity — Debt to Equity Ratio for the long-term notes. NUMBER
revenueQoQ — Revenue Quarter over Quarter. NUMBER
epsQoQ — Earnings per Stock Quarter over Quarter. NUMBER
piotroskiFScore — Financial health in the range from 1 to 10. NUMBER
currentRatio — Company's short-term liquidity. NUMBER
roa -Return on Assets. NUMBER
profitMargin — how many cents of profit have been generated for each dollar of sale. NUMBER
peRatio — Price to Earnings Ratio. NUMBER
pbRatio — Price to Balance Ratio. NUMBER
trailingPEG1Y — Price/Earnings to Growth Ratio. NUMBER
VIX_high — Volatility index daily highest value. NUMBER
sector — company's sector. CATEGORICAL
industry — company's industry. CATEGORICAL
10Y_bonds — 10-year US treasury notes daily yield value. NUMBER
10Y_bond_MoM — Month over Month changes of the 10 year US treasury note. NUMBER
Debt-to-Equity_Ratio — Total company's liabilities divided by the equity. NUMBER
DividendsYield — The percentage of a company's share price that it pays out in dividends each year. NUMBER
PayoutRatio — The portion of earnings a company pays as dividends. NUMBER
Acc_Rec_Pay_Ration — Accounts Receivable Turnover Ratio. NUMBER
Earnings_per_stock — profit per outstanding share of stock divided by stock's value to get the percentage value. NUMBER
dividends_change — After the company's new dividends announcement, how strongly the payments changed. NUMBER
prev_div_change — What's the latest dividends change was before the latest announcement. NUMBER
days_after_divid_report — how many days passed from the latest announcement of dividends change. NUMBER
surprise_% — the difference between the latest reported earnings per stock and the market expectations. NUMBER
expected_growth — what are the expected earnings per stock by market participants. NUMBER
previous_surprise — the difference between the reported earnings and expectations of the previous report period. NUMBER
days_after_earn_report — how many days passed after the latest report date when the company publishes the official results. NUMBER
future_30dprice_change — Percent of price change for a specific stock in 30 days. NUMBER

We use the latest variable as our target. We will try to teach our model to predict the price after 30 days of today's date.

In this case, we are going to use information about companies from DOW 30 index. It's only thirty companies but it should be sufficient for us to understand if the collected data can provide any acceptable results.

We have collected almost three years of historical data.

PART II. VARIABLES ANALYSIS

NUMERIC VARIABLES

Let's have a quick look at the data. At first, we can check the correlation, and hope for a simple answer right away.

Unfortunately, there are no strong correlations in the dataset. Especially for the target variable, there is no correlation stronger than 0.2. It means that the solution for our prediction may be complex if we find one.

Now we can look at data distribution across variables and how strong the outliers are.

Here is the big picture we see from the charts above:
- Most DOW30 companies have Piotroski scores above five which means they are in good financial health.
- Return on assets for most companies was above 0 and below 30%. It shows that most of the time, companies were profitable.
- Profit margin for the majority is under 100%. A spike of companies with a profit margin of 100% should be highlighted.
- 10Y bonds yield did not go above 3% for long for the specified period.
- Companies usually inform about dividends change once a year and in general, dividends grow.
- future 30d price change looks like a normally distributed curve with the outliers on the left side and a mean value close to "0". It shows that our dataset has balanced data of stock movements up and down.
- Average market Return on Equity is 38%. A value in the range of 15–20% is considered as good for a company. A standard deviation is 326%. It shows there are companies with periods of significant losses (negative values) and there are companies with very low shareholder's equity (very high values)
- Dividends, on average, grow 9.7% after an announcement. 50% of Companies announce dividend change every 213 days.
- On Average, stocks' value increased 0.2% every 30 days.

CATEGORICAL VARIABLES

I have to drop the "Industry" as it contained 26 values. Knowing that we have only 30 companies, it may block us from generalizing the future model.

I will remove the “date” column for the same reason, we are building not a time-series model here and the algorithm doesn’t need to know the date of the transaction.

We have only one categorical column now, which is "Sector". Here is the number of companies per sector we have.

I wouldn't add the boxplots of values distribution I used to analyze the companies by sector; you can check these in my Jupyter Notebook.

A summary:
- All sectors except for "consumer cyclical" have mostly positive ROE.
- Energy sector has the broadest range of quarter over quarter revenue range;
- Basic Materials and Energy sectors have mostly negative earnings per share quarter over quarter growth;
- Companies in Healthcare, Consumer defence and Technology have the highest Piotroski score with a mean of 6 out of 10;
- Companies in the Basic Materials and Financial sectors have the best ratio of current assets to current liabilities;
- The highest return on assets is in Technology and Consumer Cyclical sectors;
- Companies in Financial Services have the highest profir margin of mean 100%;
- Consumer Cyclical has the highest PE_Ratio of 30 while the lowest is in Basic Materials, about 3;
- Energy Sector has the worst ratio of Debt to Equity of mean 2.3;
- The fastest growth pace showed companies in Healthcare, about 1.3% in a month, while energy has been dropping by 1% a month.

PART III. MACHINE LEARNING TIME

To avoid overfitting, the first step I took was removing all the data points too close. We have daily trading data, and many values are very close day over day. Keeping all the records will make our model very precise on our train and even test data, but it will fail in the real world without a doubt.

We will keep only a day out of a week of data to make it harder to predict.

Knowing the data don't have a strong correlation, and there are many outliers, I decided not to wrangle a lot and work with a powerful boosting technique right away — LightGBM.

It is fast and as powerful as XGBoost. It doesn't require dealing with outliers and even changing the categorical data. You can specify the categorical column inside the function (the label encoding is still needed as it works with the numeric data only).

initialize the model

After the model training and running on testing data I got a 5% mean absolute error.

The error value may look like a lot. Knowing that we are trying to peek only 30 days into the future 5% error may be a big difference, some stocks may not even change so much in 30 days.

But I would argue that this is the only good metric to check for the model we are trying to build. The best metric would be to know how many times the model is right and wrong about knowing the stock will rise or fall.

As we see the outliers are captured pretty well and most of the big drops and rises are covered with our model.

Let’s count what is the accuracy of our model. We will count the value as correctly captured if both predicted and true values are positive or both are negative.

Accuracy measurement for the regression model

As you see 74.45% of the values are correctly predicted. It is a very good result I did not expect to achieve at starting the project.

We can check the list of variables our model considers the most important

It’s hard to see, so I duplicate the pure values below:

Feature Value %
VIX_high 9.27%
10Y_bonds 8.13%
10Y_bond_MoM 8.07%
peRatio 6.87%
days_after_earn_report 6.87%
expected_growth 5.53%
Earnings_per_stock 5.20%
pbRatio 5.13%
surprise_% 4.67%
DividendsYield 4.13%
days_after_divid_report 4.00%
trailingPEG1Y 3.73%
roa 3.60%
previous_surprise 3.33%
revenueQoQ 3.20%
profitMargin 2.60%
currentRatio 2.47%
roe 2.40%
epsQoQ 2.33%
Debt-to-Equity_Ratio 1.93%
PayoutRatio 1.67%
Acc_Rec_Pay_Ration 1.27%
prev_div_change 1.13%
piotroskiFScore 1.07%
dividends_change 0.93%
sector 0.47%

Summary:

The factors that go out of the company’s financial reports (market in Italic) cover 47% of price fluctuation and are among the TOP predictors that makes them significantly important for the stock price prediction in the range of 30 days.

Out of the financial KPIs, those that are related to the company’s earnings are the most significant.

What surprised me is that even though companies were from different sectors that have some significant differences, this categorical variable did not show a big impact. This may be explained by the case that other numeric variables have already made a good evaluation of the company and knowing from what branch of economy it came did not help much.

WHAT’S NEXT

I am happy with the achieved results so far. At the same time, there are many improvements to be made in the data collection and the model itself.

As the next step, I will try to extend the data column from DOW30 to S&P500. And I would like to expand the dates range to five years. This will help to capture periods of higher US bond yields that were lacking in the current scenario.

I will test the current model first on the bigger dataset and I will try to create an improved model after. The amount of data in the next stage should be enough to have testing and validation stages.

Here is a link to the Jupyter Notebook for this project.

Please write a comment if you have any questions or ideas, I am happy to hear what you think.

Please push the “applause” button and subscribe for the next projects.