Photo by Gilly on Unsplash

Predicting Returns with Fundamental Data and Machine Learning in Python

Can ML algorithms predict winners and losers using fundamentals?

Nate Cibik
Published in
50 min readNov 25, 2020

--

On May 8th, 2020, I deployed a web scraper on TD Ameritrade’s website to gather all of the fundamental data available on the securities in the S&P 500 at that time into a clean, Python-ready database of .csv files on my local machine. Now, six months later, I have brushed the dust off of that database and performed an in-depth study on the ability of various machine learning algorithms to utilize it in a variety of regression/classification tasks which might lead to a trading edge in the market, as well as insight into the relative importance of fundamental features in predicting performance. This blog post will summarize my findings, but interested readers can find the full technical workflow in the Jupyter notebooks in the repository, along with the dataset.

This article will cover exploration of the dataset, providing discussion of the meaning behind the fundamental financial features it contains, establishing domain knowledge which aids in the process of feature selection for modeling. After data exploration and cleaning, different methods of imputing missing data will be explored, as well as demonstrations of hyperparameter tuning on the machine learning algorithms which showed the best performance on the regression/classification tasks in the study, all implemented in Python. After analyzing the results of the modeling processes, the article will conclude with a demonstration of constructing a long/short equity strategy portfolio using the model predictions, and simulate the performance and risk management of such a portfolio. For details on the construction of the web scraper used to gather the data used in this study, please see the upcoming blog post.

Please note that the contents of this article are intended only as a demonstration of a data science workflow in the context of quantitative finance, and are not to be taken as any form of personal investment advice.

Overview

Several target variables were modeled in order to see what kinds of questions the data could prove useful in answering. First, a value investing perspective was taken, in which the data were used to model the current trading price of the securities. The logic behind this was that since markets are mostly efficient in reflecting the underlying value of an asset, creating a model which regresses the current trading prices of securities could take the place of an asset pricing model in a value investing strategy, such as the Capital Asset Pricing Model (CAPM), in evaluating the intrinsic value of securities. Value investors like Warren Buffett develop asset pricing models to find discrepancies between current market price and their estimates of intrinsic value of securities, which they use to make directional bets on those that their model indicates are currently over/under valued. A so-called margin of safety represents the proportional distance between the estimate of intrinsic value and the current market price. If a margin of safety is sufficiently large, then the value investor will buy or short the stock accordingly, expecting the market price to adjust itself over time toward their calculated intrinsic value. The key benefit of this kind of analysis is that the model does not require future pricing data. The hypothesis of this experiment was that the residuals of the trained model could serve as a margin of safety and would show a linear relationship to the log returns of the securities in the six months since the scrape, but no such relationship could be confirmed.

The other three target variables were all related to the log returns of the securities in the S&P 500 since the date of the scrape. These returns were 1) regressed as a continuous variable, then 2) classified as a binary categorical variable representing winners and losers (above and below zero), and lastly 3) classified as a binary categorical variable representing over/under performers relative to the index as a whole (subtracting index mean from all returns, then classifying above/below zero). The last of these targets proved to be the most useful. The reasoning behind subtracting the mean of the index from the returns of the individual securities within it is that the market will trend upward or downward as a whole over different observed time periods, affecting the returns of all securities within it, but these movements have more to do with the behavior of the whole market, and less to do with the differences in fundamentals of the companies within it. In the time period studied, the market had an overall upward trend, partially to do with a recovery after the Covid-19 related crash in late March. Since most of the companies in the index saw gains during this period, a model trained on the gains themselves without adjustment for overall market movement may falsely associate certain features with gains that were actually being caused by macroeconomic processes, and when classifying securities based on returns, these broader market movements affect class membership. Adjusting for the movement of the index allows the model to focus on deviations in behavior between the securities within it that are being driven by differences in their underlying fundamentals, and would be likely to be more robust in different market conditions. Unfortunately, the performance of the models trained on these three targets could not be evaluated on different time periods, due to the one-off nature of how the data were collected. Tests on holdout sets within the time period of study indicate that the models may indeed give a trader an edge on the market, but this would need to be validated with future investigations. Regardless, the models have shown that fundamental data do contain predictive power for performance, and further research is warranted.

Data Exploration

The web scraper gathered a lot of information for each security in the S&P 500, not just strictly fundamental data. This study is concerned with two major forms of target variable: one being the price of securities on the date of the scrape, and the other being the returns since. The considerations for what features to choose are different for these two tasks. For developing an asset pricing model (the first task of modeling current price), it is important to leave out any features which contain information about the asset price, as this is a form of target leakage. Remember that the point of fundamental analysis is to estimate the intrinsic value of a security using fundamental information about the company in order to compare it to the current market price, so allowing the current market price to leak into the model will be detrimental to estimating the intrinsic value (which is actually a hidden variable, and subject to the investor’s calculations) through this process.

Let’s look at the complete list of features, and perform a separate round of feature selection for both of our major modeling tasks. If we call .info() on the raw data frame, we see the following.

<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, A to ZTS
Data columns (total 106 columns):
% Above Low 274 non-null float64
% Below High 227 non-null float64
% Held by Institutions 495 non-null float64
52-Wk Range 501 non-null object
5yr Avg Return 501 non-null float64
5yr High 501 non-null float64
5yr Low 501 non-null float64
Annual Dividend $ 394 non-null float64
Annual Dividend % 394 non-null float64
Annual Dividend Yield 389 non-null float64
Ask 494 non-null float64
Ask Size 492 non-null float64
Ask close 22 non-null float64
B/A Ratio 492 non-null float64
B/A Size 501 non-null object
Beta 488 non-null float64
Bid 499 non-null float64
Bid Size 492 non-null float64
Bid close 22 non-null float64
Change Since Close 112 non-null object
Change in Debt/Total Capital Quarter over Qua...473 non-null float64
Closing Price 500 non-null float64
Day Change $ 501 non-null float64
Day Change % 501 non-null float64
Day High 501 non-null float64
Day Low 501 non-null float64
Days to Cover 495 non-null float64
Dividend Change % 405 non-null float64
Dividend Growth 5yr 354 non-null float64
Dividend Growth Rate, 3 Years 392 non-null float64
Dividend Pay Date 501 non-null object
EPS (TTM, GAAP) 490 non-null float64
EPS Growth (MRQ) 487 non-null float64
EPS Growth (TTM) 489 non-null float64
EPS Growth 5yr 438 non-null float64
Ex-dividend 394 non-null object
Ex-dividend Date 501 non-null object
FCF Growth 5yr 485 non-null float64
Float 495 non-null float64
Gross Profit Margin (TTM) 422 non-null float64
Growth 1yr Consensus Est 475 non-null float64
Growth 1yr High Est 475 non-null float64
Growth 1yr Low Est 475 non-null float64
Growth 2yr Consensus Est 497 non-null float64
Growth 2yr High Est 497 non-null float64
Growth 2yr Low Est 497 non-null float64
Growth 3yr Historic 497 non-null float64
Growth 5yr Actual/Est 497 non-null float64
Growth 5yr Consensus Est 497 non-null float64
Growth 5yr High Est 497 non-null float64
Growth 5yr Low Est 497 non-null float64
Growth Analysts 497 non-null float64
Historical Volatility 501 non-null float64
Institutions Holding Shares 495 non-null float64
Interest Coverage (MRQ) 389 non-null float64
Last (size) 501 non-null float64
Last (time) 501 non-null object
Last Trade 112 non-null float64
Market Cap 501 non-null object
Market Edge Opinion: 485 non-null object
Net Profit Margin (TTM) 489 non-null float64
Next Earnings Announcement 489 non-null object
Operating Profit Margin (TTM) 489 non-null float64
P/E Ratio (TTM, GAAP) 451 non-null object
PEG Ratio (TTM, GAAP) 329 non-null float64
Prev Close 501 non-null float64
Price 1 non-null float64
Price/Book (MRQ) 455 non-null float64
Price/Cash Flow (TTM) 468 non-null float64
Price/Earnings (TTM) 448 non-null float64
Price/Earnings (TTM, GAAP) 448 non-null float64
Price/Sales (TTM) 490 non-null float64
Quick Ratio (MRQ) 323 non-null float64
Return On Assets (TTM) 479 non-null float64
Return On Equity (TTM) 453 non-null float64
Return On Investment (TTM) 437 non-null float64
Revenue Growth (MRQ) 487 non-null float64
Revenue Growth (TTM) 489 non-null float64
Revenue Growth 5yr 490 non-null float64
Revenue Per Employee (TTM) 459 non-null float64
Shares Outstanding 495 non-null float64
Short Int Current Month 495 non-null float64
Short Int Pct of Float 495 non-null float64
Short Int Prev Month 495 non-null float64
Short Interest 495 non-null float64
Today's Open 501 non-null float64
Total Debt/Total Capital (MRQ) 460 non-null float64
Volume 500 non-null float64
Volume 10-day Avg 500 non-null float64
Volume Past Day 501 non-null object
cfra 479 non-null float64
cfra since 479 non-null datetime64[ns]
creditSuisse 337 non-null object
creditSuisse since 337 non-null datetime64[ns]
ford 493 non-null float64
ford since 493 non-null datetime64[ns]
marketEdge 484 non-null float64
marketEdge opinion 484 non-null object
marketEdge opinion since 484 non-null datetime64[ns]
marketEdge since 484 non-null datetime64[ns]
newConstructs 494 non-null float64
newConstructs since 494 non-null datetime64[ns]
researchTeam 495 non-null object
researchTeam since 495 non-null datetime64[ns]
theStreet 496 non-null object
theStreet since 496 non-null datetime64[ns]
dtypes: datetime64[ns](8), float64(82), object(16)
memory usage: 418.8+ KB

Here we can see a nice rich dataset, complete with information about current/historic price, historical volatility, volume, dividends, various growth rates, fundamentals, price ratios, and even analyst rating information. We can see that there are 501 companies in the raw dataset, and a total of 106 columns. Notice that almost none of the columns have data for every company in the index, and an investigation into the nature of this sparsity reveals that there are only 77 rows for which there is data in every column, meaning that imputation will be necessary for making use of this dataset. As mentioned above, the process of feature selection will be different for the two main tasks in this project, so these will be investigated separately. First, we need to get pricing data with which we can develop our target variables.

Acquiring Pricing Data

The date of the scrape was May 8th, 2020, and the date of the analysis was October 29th, 2020, so we need to get pricing information for the S&P 500 over this period of time. One simple way to do this is using the yfinance package in python, shown below. In the repository there is a .csv file containing all of the tickers used in the scrape. A couple of them did not get data during the scrape, and aren’t recognized by yfinance, so they will be removed, and two more tickers, AGN and ETFC, were acquired by other companies and no longer exist, so they will be dropped from our analysis.

We see that we have our closing prices, with the ticker symbols as column names

While we are at it, let’s go ahead and calculate our log returns. This is as simple as converting prices to log prices, then subtracting the earliest closing log price from the latest. The benefits of using log prices and returns rather than dollars and percent returns is discussed in detail here, but it suffices to say that it is common industry practice in financial analysis to work with the log prices/returns due to log normality of stock prices and time additivity of log returns (as opposed to multiplicative nature of percent returns). We will also go ahead and drop AGN and ETFC which were both acquired by other companies over the time period of study.

Let’s take a look at the distribution of these log returns.

We can see that the log returns take a symmetrical, normal-shaped distribution, with a mean that much greater than zero (0.11), reflecting the strong overall upward movement of the market during the time period of study.

Keep in mind that our first task will be modeling the log prices of the date of the scrape, that is, the first row of log_close, and the second task will be modeling the log_returns with both regression and classification.

Feature Selection and Modeling

This section will be split into our two major modeling tasks: asset price modeling and modeling of returns. For each, we will perform feature engineering, then compare the efficacy of imputation techniques, train machine learning algorithms, and then interpret our modeling results.

Modeling Asset Price

The purpose of this task is to model the prices of the securities using the fundamental data related to the companies, in order to perform a pseudo-fundamental analysis of intrinsic value to develop a value investing-style trading strategy based on the residuals of our model, which represent margin of safety, or in other words, how over or under valued a given security is compared to what the model estimates its value to be. The hypothesis is that the residuals will be correlated to the returns of the securities over the six month period since the data were scraped, as overvalued stocks would see their prices move down as the market adjusts toward their true value, and undervalued stocks would move up.

In financial analysis, intrinsic value is at least partially subjective, as different analysts will arrive at different estimates through construction of their own individual proprietary pricing models, meaning that in this experiment we are actually trying to model a hidden variable using the current trading price. The logic behind why this may work is that markets are at least partially efficient, meaning that the current prices of assets reliably reflect their value, with some error and noise present. So, while we are using market price as a target variable, the hope is that the model we build finds the way that the features contribute to prices across the market, and that our residuals will reflect deviations from actual (intrinsic) value.

Feature Selection:

As discussed above, using features which contain information about the current price of the securities will cause target leakage, and undermine our goal of estimating intrinsic value. For example, price to earnings ratio (P/E) is calculated as the price of a security divided by its earnings per share (EPS). Since we have EPS and P/E present in our features, the model could easily factor out the current price of the security, and too accurately reflect the current trading prices, rather than the intrinsic value. Therefore, we must be careful to remove or modify all features which would allow such target leakage.

The first step is to remove all features which are directly related to price, such as anything to do with periodic high/low, percent above/below these marks, and any open/close/last/bid/ask related features. The price ratios can have the prices factored out by dividing the current price of the securities by the ratio itself, leaving only the earnings/book/sales/cash flow behind, which represent fundamental information about the company. We will also remove any columns with highly incomplete data, and columns which are datetime, since none of these are useful in this task. Annual Dividend % and Annual Dividend Yield represent the same thing, and the latter is missing more values, so it will be dropped.

<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, A to ZTS
Data columns (total 65 columns):
% Held by Institutions 495 non-null float64
5yr Avg Return 501 non-null float64
Annual Dividend % 394 non-null float64
Beta 488 non-null float64
Change in Debt/Total Capital Quarter over Qua...473 non-null float64
Days to Cover 495 non-null float64
Dividend Change % 405 non-null float64
Dividend Growth 5yr 354 non-null float64
Dividend Growth Rate, 3 Years 392 non-null float64
EPS (TTM, GAAP) 490 non-null float64
EPS Growth (MRQ) 487 non-null float64
EPS Growth (TTM) 489 non-null float64
EPS Growth 5yr 438 non-null float64
FCF Growth 5yr 485 non-null float64
Float 495 non-null float64
Gross Profit Margin (TTM) 422 non-null float64
Growth 1yr Consensus Est 475 non-null float64
Growth 1yr High Est 475 non-null float64
Growth 1yr Low Est 475 non-null float64
Growth 2yr Consensus Est 497 non-null float64
Growth 2yr High Est 497 non-null float64
Growth 2yr Low Est 497 non-null float64
Growth 3yr Historic 497 non-null float64
Growth 5yr Actual/Est 497 non-null float64
Growth 5yr Consensus Est 497 non-null float64
Growth 5yr High Est 497 non-null float64
Growth 5yr Low Est 497 non-null float64
Growth Analysts 497 non-null float64
Historical Volatility 501 non-null float64
Institutions Holding Shares 495 non-null float64
Interest Coverage (MRQ) 389 non-null float64
Market Cap 501 non-null object
Market Edge Opinion: 485 non-null object
Net Profit Margin (TTM) 489 non-null float64
Operating Profit Margin (TTM) 489 non-null float64
P/E Ratio (TTM, GAAP) 451 non-null object
PEG Ratio (TTM, GAAP) 329 non-null float64
Price/Book (MRQ) 455 non-null float64
Price/Cash Flow (TTM) 468 non-null float64
Price/Earnings (TTM) 448 non-null float64
Price/Earnings (TTM, GAAP) 448 non-null float64
Price/Sales (TTM) 490 non-null float64
Quick Ratio (MRQ) 323 non-null float64
Return On Assets (TTM) 479 non-null float64
Return On Equity (TTM) 453 non-null float64
Return On Investment (TTM) 437 non-null float64
Revenue Growth (MRQ) 487 non-null float64
Revenue Growth (TTM) 489 non-null float64
Revenue Growth 5yr 490 non-null float64
Revenue Per Employee (TTM) 459 non-null float64
Shares Outstanding 495 non-null float64
Short Int Current Month 495 non-null float64
Short Int Pct of Float 495 non-null float64
Short Int Prev Month 495 non-null float64
Short Interest 495 non-null float64
Total Debt/Total Capital (MRQ) 460 non-null float64
Volume 10-day Avg 500 non-null float64
cfra 479 non-null float64
creditSuisse 337 non-null object
ford 493 non-null float64
marketEdge 484 non-null float64
marketEdge opinion 484 non-null object
newConstructs 494 non-null float64
researchTeam 495 non-null object
theStreet 496 non-null object
dtypes: float64(58), object(7)
memory usage: 258.3+ KB

Things are looking cleaner. The next step for this task is to get rid of the analyst ratings, since these are not really fundamental. Another issue to deal with is that we have duplicates of P/E ratio. The column named ‘P/E Ratio (TTM, GAAP)’ is of object data type, and we have the same thing below with ‘Price/Earnings (TTM, GAAP)’ with a numeric dtype. We also have ‘Price/Earnings (TTM)’ that incorporates what is called non-GAAP earnings, which purposefully leave out any large nonrecurrent expenses that the company has had recently which may obfuscate financial analysis, and are considered favorable for purposes such as ours. Thus, we will drop the GAAP earnings, and keep the non-GAAP earnings. PEG ratio is the P/E ratio divided by the annual EPS growth rate, both of which we already have, so it can be dropped.

Things are getting cleaner, but we have more considerations to make. The Market Cap column is still encoded in string format, but it also contains information about the prices, and will cause target leakage. Market capitalization is calculated as the number of outstanding shares times the current price of the security, and since we have a Shares Outstanding column, we need to get rid of Market Cap altogether. Also, we have not checked our dataset for nan’s evil stepbrother: inf. Let’s drop Market Cap and perform this check.

True

We have indeed located some culprits. Not to worry, since we were already planning to deal with missing data, we will just re-encode these as nans to be dealt with later.

And it is done. We need to check now to make sure that there are no companies in our feature set that aren’t present in our target data. We can perform a check with a list comprehension as follows:

['AGN', 'ETFC']

These are familiar faces, they are the companies which have been acquired by others over the period of study. We need to drop them from our feature set.

Finally, we need to remove the pricing information from the pricing ratios by dividing the current price by each ratio. This can be done like so:

<class 'pandas.core.frame.DataFrame'>
Index: 499 entries, A to ZTS
Data columns (total 52 columns):
% Held by Institutions 493 non-null float64
5yr Avg Return 499 non-null float64
Annual Dividend % 392 non-null float64
Beta 486 non-null float64
Change in Debt/Total Capital Quarter over Qua...471 non-null float64
Days to Cover 493 non-null float64
Dividend Change % 403 non-null float64
Dividend Growth 5yr 354 non-null float64
Dividend Growth Rate, 3 Years 392 non-null float64
EPS (TTM, GAAP) 488 non-null float64
EPS Growth (MRQ) 485 non-null float64
EPS Growth (TTM) 487 non-null float64
EPS Growth 5yr 437 non-null float64
FCF Growth 5yr 483 non-null float64
Float 493 non-null float64
Gross Profit Margin (TTM) 420 non-null float64
Growth 1yr Consensus Est 473 non-null float64
Growth 1yr High Est 473 non-null float64
Growth 1yr Low Est 473 non-null float64
Growth 2yr Consensus Est 494 non-null float64
Growth 2yr High Est 495 non-null float64
Growth 2yr Low Est 494 non-null float64
Growth 3yr Historic 495 non-null float64
Growth 5yr Actual/Est 495 non-null float64
Growth 5yr Consensus Est 494 non-null float64
Growth 5yr High Est 495 non-null float64
Growth 5yr Low Est 494 non-null float64
Growth Analysts 495 non-null float64
Historical Volatility 499 non-null float64
Institutions Holding Shares 493 non-null float64
Interest Coverage (MRQ) 388 non-null float64
Net Profit Margin (TTM) 487 non-null float64
Operating Profit Margin (TTM) 487 non-null float64
Quick Ratio (MRQ) 322 non-null float64
Return On Assets (TTM) 477 non-null float64
Return On Equity (TTM) 451 non-null float64
Return On Investment (TTM) 435 non-null float64
Revenue Growth (MRQ) 485 non-null float64
Revenue Growth (TTM) 487 non-null float64
Revenue Growth 5yr 488 non-null float64
Revenue Per Employee (TTM) 457 non-null float64
Shares Outstanding 493 non-null float64
Short Int Current Month 493 non-null float64
Short Int Pct of Float 493 non-null float64
Short Int Prev Month 493 non-null float64
Short Interest 493 non-null float64
Total Debt/Total Capital (MRQ) 458 non-null float64
Volume 10-day Avg 498 non-null float64
Book (MRQ) 453 non-null float64
Cash Flow (TTM) 466 non-null float64
Earnings (TTM) 447 non-null float64
Sales (TTM) 488 non-null float64
dtypes: float64(52)
memory usage: 210.5+ KB

There we have it, a clean feature set with all continuous numeric variables. It should be noted at this point that multicollinearity is present among the features here, but since we will not be concerning ourselves with feature importances in this task of modeling current price, we do not need to address it, since the accuracy of the model will not be negatively impacted. We will, however, be dealing with multicollinearity in the next task of modeling returns, where feature importances will be examined.

Imputing Missing Data:

Scikit-learn offers a variety of effective methods of imputation between their SimpleImputer, KNNImputer, and IterativeImputer classes. The SimpleImputer can utilize multiple strategies to fill nans: using the mean, median, mode, or a constant value. The KNNImputer utilizes K Nearest Neighbor modeling to estimate the missing values using the other data columns as predictors. The IterativeImputer is experimental, and allows for any estimator to be passed into it, which it uses to estimate missing values in a round-robin fashion. It is highly recommended that the reader check out the documentation for these classes in the previous links.

The best choice between these options depends on the data, task, and algorithm at hand, which is why it is generally best practice to train a model instance with each imputation method and compare the results. Below, I will establish some helper functions which are modified versions of the code found in the scikit-learn documentation links above, which will help us compare these imputation methods in the context of our problem. Note that the IterativeImputer requires importing a special item called ‘enable_iterative_imputer’ in order to work. Let’s import what we need and make our functions, which will generate cross validation scores for each of our imputation methods, and give us results we can use to compare them.

Notice that scaling is built into this process. It is now helpful to combine all off the above functions into a wrapper function that will make testing and graphing the scores of all the imputation methods for various regressors and tasks easy without having to repeat any code.

Excellent, now we have a framework for testing imputations methods for any estimator that we choose, and returning convenient axes objects to view them. The notebooks in the repository go into a comparison of many different regressors for the task of modeling current price, but to save space here, we will focus on the regressor which demonstrated the best performance for this task: scikit-learn’s GradientBoostingRegressor. First, we instantiate an out-of-the-box regressor, and pass it into the wrapper function with our data to see how the imputation methods compare. Note that the compare_imputer_scores function takes in a list of estimators to be used with the IterativeImputer, which can have variable parameters, and we need to make this list to pass it into the function.

The top chart represents the scores from the SimpleImputer, KNNImputer, and out-of-the-box IterativeImputer. The bottom chart compares the performance of the IterativeImputer used with each of the estimators in the estimators list. We can see some solid R squared scores all around, but the best performance is happening with the KNNImputer. It is always relieving to see a less computationally intense imputer (ie not the IterativeImputer) with the highest score before performing a grid search, because the IterativeImputer takes much longer, and when fitting thousands of models this does make a noticeable difference.

Modeling:

Now that we have a selection for the best regressor/imputer combo, we can do a grid search to find the optimal hyperparameters for our model, and then move on with our investigation of the residuals.

Awesome, after a while we have our best tuned model. Let’s see the optimal parameters and the predictive accuracy they produced.

Best Model: 
r-squared: 0.8738820298532683
{'imputer__n_neighbors': 5,
'regressor__learning_rate': 0.1,
'regressor__max_depth': 2,
'regressor__n_estimators': 1000,
'regressor__subsample': 0.7}

Nice, we can see that the model has a solid R squared score of .87, using 1000 weak learners with a max depth of only 2 levels per tree, subsampling 70% of the data with a learning rate of 0.1.

Now that we have an asset pricing model trained with our data, we can see how the actual current prices deviate from the model’s predictions in the residuals, then look to see if these residuals are correlated with the returns since the date of the scrape. As a reminder, the hope is that the higher above the model estimate an asset price is, the more it can be expected to move downwards, or the lower below the model estimate an asset price is, the more it would be expected to move upward toward the estimated value over time. Since residuals are calculated as actual minus predicted, the residuals should be negatively correlated with the returns, if our hypothesis is correct. Let’s first generate the residuals, and take a look at them.

We can see that the residuals have a thin, fairly symmetrical distribution around zero, with some big outliers. Let’s see if we can see any linear relationship between them. I will be removing outliers past three standard deviations and fitting a linear regression model to do this. To see this coded out, refer to the repository, it is just a simple linear regression of the log returns using the residuals as the independent variable.

This linear regression, while showing a slight negative slope, offers an R squared of .004, and a p-value of .186 for the coefficient relating the residuals to the returns, so this is not indicating a powerful linear relationship. Thus, the hypothesis that the residuals of our asset pricing model would be correlated to the returns over the six month period since the scrape is not supported. Disappointing though this may be, this is the nature of science. Investigating this on longer time periods would likely be appropriate, since this hypothesis was derived from a value-investing perspective, and value investors typically plan to hold assets for much longer time periods, generally over a year at the very least. Incidentally, for shorter time periods such as the one used in this study, one could argue that overvalued or undervalued securities would be experiencing price momentum, attracting traders to trade with the current trend, thereby feeding it, and thus these securities may be likely to move more in the same direction in the short term before the market corrects itself toward their underlying value. Additionally, the residuals used above have a very tight distribution, because the regression was trained using all of the samples for which residuals were calculated. It may be worthwhile to repeat this process on a holdout set and compare the results.

Modeling Returns

Now we move on to our second task: modeling the returns over the six month period since the date of the scrape. This can be done a number of ways, here we will do three. The first will be to try and regress the continuous values of the returns since the scrape using the data as predictors, the second will be to create a binary classifier to predict gainers/losers (returns above or less than or equal to zero), and the third will be to classify stocks which over or underperformed the market (returns above or less than or equal to the average of returns of the index). The reason for the two different classification tasks, as mentioned above, is that a bull (rising) market will cause all stocks within an index to move upward on average, as similarly a bear (falling) market will cause them all to move downward on average. By subtracting the mean of returns from the returns for this particular time period, the model may be better generalized to application on a different time period. Unfortunately, we do not have data to test another time period, so we will not be able to test that theory, nor will be be able to evaluate the performance of a model trained on one time period on predicting the returns of another. We will be able to test the model on holdout sets of securities, albeit in the same time period as the training. Despite this unfortunate detail, a model using our features which can successfully predict winners and losers over any time period is an indication that the features have predictive power, and that further study on different time periods in the future is merited. We will also be able to observe the feature importances of the models, which may provide insight into which of the features contributes the most in predicting returns.

Let’s keep our target variables in a convenient data frame:

Feature Selection:

Since we are not predicting the prices of securities this time, having features which contain information about the price is no longer an issue, which means we can include features that we left out last time. However, since we want to get an accurate look at feature importances, we are going to need to deal with multicollinearity among the features to make that possible. Sometimes a model can get more accuracy by leaving multicollinearity alone, so it is generally best to train a separate model each way and compare them, using the most accurate for predictions, and using the one trained with multicollinearity managed for analysis. In this case, it was found that the classification models were even more accurate after multicollinearity was managed, but that the (already disappointing) performance of the regression models suffered as a result of removing multicollinearity. In order to save space here, I will briefly touch on the results of regression, which were not very impressive, and then move on to removing multicollinearity and performing classification.

First thing to do for any of the upcoming tasks is to drop features we will definitely not be using, and see what next steps should be taken to clean the data. We are starting again from the totally raw data frame.

<class 'pandas.core.frame.DataFrame'>
Index: 501 entries, A to ZTS
Data columns (total 83 columns):
% Held by Institutions 495 non-null float64
52-Wk Range 501 non-null object
5yr Avg Return 501 non-null float64
5yr High 501 non-null float64
5yr Low 501 non-null float64
Annual Dividend % 394 non-null float64
Ask 494 non-null float64
Ask Size 492 non-null float64
B/A Ratio 492 non-null float64
B/A Size 501 non-null object
Beta 488 non-null float64
Bid 499 non-null float64
Bid Size 492 non-null float64
Change in Debt/Total Capital Quarter over Qua...473 non-null float64
Closing Price 500 non-null float64
Day Change % 501 non-null float64
Day High 501 non-null float64
Day Low 501 non-null float64
Days to Cover 495 non-null float64
Dividend Change % 405 non-null float64
Dividend Growth 5yr 354 non-null float64
Dividend Growth Rate, 3 Years 392 non-null float64
EPS (TTM, GAAP) 490 non-null float64
EPS Growth (MRQ) 487 non-null float64
EPS Growth (TTM) 489 non-null float64
EPS Growth 5yr 438 non-null float64
FCF Growth 5yr 485 non-null float64
Float 495 non-null float64
Gross Profit Margin (TTM) 422 non-null float64
Growth 1yr Consensus Est 475 non-null float64
Growth 1yr High Est 475 non-null float64
Growth 1yr Low Est 475 non-null float64
Growth 2yr Consensus Est 497 non-null float64
Growth 2yr High Est 497 non-null float64
Growth 2yr Low Est 497 non-null float64
Growth 3yr Historic 497 non-null float64
Growth 5yr Actual/Est 497 non-null float64
Growth 5yr Consensus Est 497 non-null float64
Growth 5yr High Est 497 non-null float64
Growth 5yr Low Est 497 non-null float64
Growth Analysts 497 non-null float64
Historical Volatility 501 non-null float64
Institutions Holding Shares 495 non-null float64
Interest Coverage (MRQ) 389 non-null float64
Last (size) 501 non-null float64
Market Cap 501 non-null object
Market Edge Opinion: 485 non-null object
Net Profit Margin (TTM) 489 non-null float64
Operating Profit Margin (TTM) 489 non-null float64
P/E Ratio (TTM, GAAP) 451 non-null object
PEG Ratio (TTM, GAAP) 329 non-null float64
Prev Close 501 non-null float64
Price/Book (MRQ) 455 non-null float64
Price/Cash Flow (TTM) 468 non-null float64
Price/Earnings (TTM) 448 non-null float64
Price/Earnings (TTM, GAAP) 448 non-null float64
Price/Sales (TTM) 490 non-null float64
Quick Ratio (MRQ) 323 non-null float64
Return On Assets (TTM) 479 non-null float64
Return On Equity (TTM) 453 non-null float64
Return On Investment (TTM) 437 non-null float64
Revenue Growth (MRQ) 487 non-null float64
Revenue Growth (TTM) 489 non-null float64
Revenue Growth 5yr 490 non-null float64
Revenue Per Employee (TTM) 459 non-null float64
Shares Outstanding 495 non-null float64
Short Int Current Month 495 non-null float64
Short Int Pct of Float 495 non-null float64
Short Int Prev Month 495 non-null float64
Short Interest 495 non-null float64
Today's Open 501 non-null float64
Total Debt/Total Capital (MRQ) 460 non-null float64
Volume 500 non-null float64
Volume 10-day Avg 500 non-null float64
Volume Past Day 501 non-null object
cfra 479 non-null float64
creditSuisse 337 non-null object
ford 493 non-null float64
marketEdge 484 non-null float64
marketEdge opinion 484 non-null object
newConstructs 494 non-null float64
researchTeam 495 non-null object
theStreet 496 non-null object
dtypes: float64(73), object(10)
memory usage: 348.8+ KB

B/A Size is encoded as strings, but we have numeric features for Bid Size and Ask Size, so we do not need the B/A Size feature at all, and it can be dropped. There is also the duplicate P/E Ratio (TTM, GAAP) feature in string format that we found earlier that can be dropped again. There are two columns, Market Edge Opinion and marketEdge opinion which are the same, and since there is a numeric counterpart to this analyst rating in the marketEdge column, we can drop both of the market edge opinion columns. The Volume Past Day feature is a categorical variable in string format that tells us whether the previous trading day had light, below average, average, above average, or heavy volume compared to a typical trading day for the security. This feature could be one hot encoded, but since there is a Volume 10-day Avg feature which is numeric and more descriptive of recent volume trend, we will just drop Volume Past Day. That leaves us with five features with an object dtype to manage before we begin the imputation step: 52-Wk Range, Market Cap, creditSuisse, researchTeam, and theStreet. Let’s drop the unneeded columns, then take a closer look at these 5 columns.

It looks as though we can modify the strings in 52-Wk Range to give us two numeric columns: 52-Wk Low and 52-Wk High. Then, we can move on to converting the Market Cap column, but we will need to know all of the unique letters that it contains, which represent some order of magnitude for the number they are associated with. Let’s treat 52-Wk Range, and check our result.

Alright, this has worked. Now we need to determine what letter suffixes the numbers in the Market Cap column have, so we can build a function to convert this column into a numeric dtype. We can also drop the old 52-Wk Range column.

array(['B', 'T', 'M'], dtype=object)

Here we can see that the column has B, T, and M suffixes expressing amounts in billions, trillions, and millions, respectively. Knowing this, we can make a function to treat this column.

Excellent, another success story. The old column can be dropped. Now we need to check if there are any companies in our feature set that aren’t present in the log_returns. Recalling from earlier, we should expect to find AGN and ETFC in this list.

['AGN', 'ETFC']

Just as suspected. We need to drop these from the index of our feature set. Since we also know from earlier that we have infs that need to be re-encoded as nans, we will do that in this step as well.

Now we have just one more chore before we can move on to dealing with multicollinearity: we need to one hot encode the categorical analyst ratings that were encoded as strings. Since we have fixed all of the other columns, we can simply call the pandas get_dummies method, and make sure that we set drop_first to True, in order to avoid the dummy variable trap.

<class 'pandas.core.frame.DataFrame'>
Index: 499 entries, A to ZTS
Data columns (total 82 columns):
% Held by Institutions 493 non-null float64
5yr Avg Return 499 non-null float64
5yr High 499 non-null float64
5yr Low 499 non-null float64
Annual Dividend % 392 non-null float64
Ask 492 non-null float64
Ask Size 490 non-null float64
B/A Ratio 490 non-null float64
Beta 486 non-null float64
Bid 497 non-null float64
Bid Size 490 non-null float64
Change in Debt/Total Capital Quarter over Qua...471 non-null float64
Closing Price 498 non-null float64
Day Change % 499 non-null float64
Day High 499 non-null float64
Day Low 499 non-null float64
Days to Cover 493 non-null float64
Dividend Change % 403 non-null float64
Dividend Growth 5yr 354 non-null float64
Dividend Growth Rate, 3 Years 392 non-null float64
EPS (TTM, GAAP) 488 non-null float64
EPS Growth (MRQ) 485 non-null float64
EPS Growth (TTM) 487 non-null float64
EPS Growth 5yr 437 non-null float64
FCF Growth 5yr 483 non-null float64
Float 493 non-null float64
Gross Profit Margin (TTM) 420 non-null float64
Growth 1yr Consensus Est 473 non-null float64
Growth 1yr High Est 473 non-null float64
Growth 1yr Low Est 473 non-null float64
Growth 2yr Consensus Est 495 non-null float64
Growth 2yr High Est 495 non-null float64
Growth 2yr Low Est 495 non-null float64
Growth 3yr Historic 495 non-null float64
Growth 5yr Actual/Est 495 non-null float64
Growth 5yr Consensus Est 495 non-null float64
Growth 5yr High Est 495 non-null float64
Growth 5yr Low Est 495 non-null float64
Growth Analysts 495 non-null float64
Historical Volatility 499 non-null float64
Institutions Holding Shares 493 non-null float64
Interest Coverage (MRQ) 388 non-null float64
Last (size) 499 non-null float64
Net Profit Margin (TTM) 487 non-null float64
Operating Profit Margin (TTM) 487 non-null float64
PEG Ratio (TTM, GAAP) 329 non-null float64
Prev Close 499 non-null float64
Price/Book (MRQ) 453 non-null float64
Price/Cash Flow (TTM) 466 non-null float64
Price/Earnings (TTM) 447 non-null float64
Price/Earnings (TTM, GAAP) 447 non-null float64
Price/Sales (TTM) 488 non-null float64
Quick Ratio (MRQ) 322 non-null float64
Return On Assets (TTM) 477 non-null float64
Return On Equity (TTM) 451 non-null float64
Return On Investment (TTM) 435 non-null float64
Revenue Growth (MRQ) 485 non-null float64
Revenue Growth (TTM) 487 non-null float64
Revenue Growth 5yr 488 non-null float64
Revenue Per Employee (TTM) 457 non-null float64
Shares Outstanding 493 non-null float64
Short Int Current Month 493 non-null float64
Short Int Pct of Float 493 non-null float64
Short Int Prev Month 493 non-null float64
Short Interest 493 non-null float64
Today's Open 499 non-null float64
Total Debt/Total Capital (MRQ) 458 non-null float64
Volume 498 non-null float64
Volume 10-day Avg 498 non-null float64
cfra 478 non-null float64
ford 491 non-null float64
marketEdge 482 non-null float64
newConstructs 492 non-null float64
52-Wk Low 499 non-null float64
52-Wk High 499 non-null float64
marketCap 499 non-null float64
creditSuisse_outperform 499 non-null uint8
creditSuisse_underperform 499 non-null uint8
researchTeam_hold 499 non-null uint8
researchTeam_reduce 499 non-null uint8
theStreet_hold 499 non-null uint8
theStreet_sell 499 non-null uint8
dtypes: float64(76), uint8(6)
memory usage: 303.1+ KB

Regression

Excellent, we have a clean data frame to perform regression with. The study found that the variety of regression algorithms from scikit-learn and xgboost that were tested had consistently lackluster performance in the task of using these data to regress the returns since the date of the scrape. The best performance was provided by the RandomForestRegressor from scikit-learn, with an average r-squared score of just below .21 after either zero or mean imputation, with a wide variance of scores. Below we can see these results.

These scores are not particularly impressive, but they are high enough to indicate that the model is effective on some level. R-squared tells us the proportion of the variance of the target which is explained by the model, and even though a large proportion of this variance is not being explained, a significant portion of it is, meaning there is some predictive power being provided by our features. Although the results from regression are somewhat disappointing, the upcoming classification tasks were quite fruitful. As mentioned above, the performance of classification was actually aided by the removal of multicollinearity, so from here I will demonstrate that removal first, then move on to the classification model building.

Removing Multicollinearity:

Since we want to get meaningful insights from the feature importances of the models that we train, we need to now deal with multicollinearity among the features. To start this process, we need to generate a heatmap to visualize the correlation matrix of all of the features. We can do this with a combination of the pandas .corr method and seaborn’s heatmap.

The first thing to notice is that Annual Dividend % has a negative correlation with many price based features, especially with 5yr Avg Return. We can look at a scatter plot of this relationship to get a more detailed understanding.

We can see a strong negative correlation here. It can also be noticed that some of these annual dividends are quite large, all the way up to 17.5% of the price per share. Whenever a dividend is paid out to shareholders, it is subtracted from the price of the share on the date of payment. Thus, if an annual dividend is 5%, then the stock will need to have gained 5% that year in order to break even after the all dividends are paid out. This explains the negative correlation seen above: the higher the dividend payouts, the more share price is reduced as a result, and the less the gains in price per share will be for the year. We will remove the Annual Dividend % feature.

Next, we can see that all of the price related features are highly correlated, which is surprising to no one. Although features like 5yr High, 5yr Low, Ask, Bid, Closing Price, Day High, Day Low, etc. are all expressing distinct things about price, they are so closely correlated that there will be no way to determine their individual effects on the model, and since we will be looking at feature importances, we need to cull our price-related features. Additionally, the growth estimates have some strong multicollinearity, so we will drop some of these as well.

Another strong correlation exists between Float and Shares Outstanding. These are very similar features. Float Represents the number of shares available for public trade, and Shares Outstanding is the total number of shares that a company has outstanding. These are bound to be highly related, but we can see just how related with another scatter plot.

Above we can see what verifies our understanding of these features: a company has a certain number of shares outstanding, but some of those shares may not be available for the public to trade. Thus we see that Float is never above Shares Outstanding, and the relationship is mostly perfectly linear along the line of slope 1 (identity function). We can drop the Float feature in favor of Shares Outstanding.

Return on Assets and Return on Investment are very strongly correlated. As explained in this article, cross-industry comparison of Return on Assets may not be meaningful, and it is better to use Return on Investment in these cases, including the one we find ourselves in. Thus, we will drop Return on Assets.

Net Profit Margin is Operating Profit Margin minus taxes and interest, and therefore the two are highly correlated. Since the prior contains more information, and they have the same number of missing values, we can drop Operating Profit Margin.

Volume is highly correlated to Volume 10-day Avg, and since the latter contains more information, we can drop the prior. Historical Volatility has strong correlation with Beta, which is a similar metric expressing relative volatility to the market. We will drop Historical Volatility in favor of Beta.

Price/Earnings (TTM) and Price/Earnings (TTM, GAAP) are highly correlated. As mentioned above, the non-GAAP metric is considered more useful in quantitative financial analysis, since it leaves out large non-recurrent costs that may have appeared on recent financial statements, so we will drop the feature calculated with GAAP earnings.

Dividend Growth Rate, 3 Years, and Dividend Growth 5yr are highly correlated, since they contain a lot of similar historical information. The 3 year growth rate should be informative enough to predict returns over a six month period, so we will drop Dividend Growth 5yr.

We can also seem some collinearity occurring among the Short Interest related features. Let’s view these features together to see what they look like.

We can see that the Short Int Pct of Float and Short Interest columns are the same, but with more resolution in the prior, so we can drop the latter. Short Int Current Month and Short Int Prev Month are highly correlated with one another, and it stands to reason that the Current Month feature is more valuable looking forward, so we can drop the Short Int Prev Month.

This is looking much better. Although there is still some correlation present between variables, we have dealt with most of the redundant features and multicollinearity. This is a good place to move on to our imputation and modeling phases.

Classification — Class 1: Gainers/Losers

Now we move on to our classification tasks, both of which have binary categorical targets. The first class will be gainers/losers, the second will be over/under performers relative to the market

Imputing Missing Data (Class 1: Gainers/Losers):

This study showed that the XGBClassifier from the xgboost package had superior predictive performance on this target, so we will investigate it here. For full performative comparison of various classifiers, see the notebooks.

We can see that the SimpleImputer is giving us decent performance, though with a wide variance. The IterativeImputer with the DecisionTreeRegressor is doing the best, and with the KNeighborsRegressor it is doing nicely with a tighter variance, which sometimes can be preferable for robustness. One thing to always keep in mind when preparing to do a grid search using these imputers is that the IterativeImputer makes the process take a lot longer, for reasons implicit in the name. In this study, both were compared, and it turned out that the SimpleImputer was both faster and produced a better model, so we will demonstrate with this imputation method here.

Modeling (Class 1: Gainers/Losers):

We can now construct a pipeline and a grid search to build an optimal model for this task, which we can then analyze.

After a while, we have our completed grid search. Let’s look at the best cross validation score and parameters that it produced.

Best CV roc_auc score: 0.7014277550220106{'clf__colsample_bylevel': 1,
'clf__colsample_bytree': 1,
'clf__learning_rate': 0.001,
'clf__max_depth': 5,
'clf__n_estimators': 1000,
'clf__reg_lambda': 0.5,
'clf__subsample': 1}

The best auc score from the cross validation was a .701, which is neither awesome nor terrible. Let’s see how the predicted probabilities for the test set relate to the log returns.

The r-squared for this simple linear regression between the mode’s predicted probabilities and log returns is very low at 0.059, but the coefficient for the probabilities is significant with a p-value of 0.015. We can see the imbalance of the classes in the scatter plot above, since as the market was uptrending during the period of study, the majority of stocks in the index had gains, and are therefore members of the gainer class. Let’s look now at what the probability of selecting a gainer at random would have been by dividing the number of securities in the gainer class by the total number of securities in the index.

0.7294589178356713

Here we can see that an investor would have had a 73% chance of randomly picking a gainer during the time period of study due to the uptrending market. We can now start to see why it does not make much sense to model gainers/losers if we intend to gain insights or make a predictive model which will be useful during different time periods, because the proportion of gainers/losers (class membership) will change depending on what the behavior of the market is, and this model may be attributing gains to the predictive features that had less to do with those features and more to do with the movement of the market. In a moment, we will remove the average return of the market to adjust for this, and model over/under performers relative to the market, but first let’s take a quick look at the accuracy and roc curve for this model.

Accuracy Score (test set): 0.81

We can see that the model predicted the correct class 81% of the time, which sounds impressive, but knowing that the probability of randomly selecting a gainer was 73%, it isn’t quite so, and we know that this accuracy could vary widely in another time period where the market trended differently. The auc score on the test set is 0.68, which is not super impressive. Let’s move on to modeling our last target, now that we know that it will give us the most robust insight into the performance of stocks relative to each other.

Classification — Class 2: Over/Under Performers

As we can see from the modeling and analysis above, there is a significant disadvantage to modeling gainers/losers over a given time period, because the category that a stock falls into is highly dependent on the behavior of the overall market during that time period. This can lead the model to be biased in how it evaluates the contributions of features to stock performance because some of that performance can be attributed to the market’s behavior, and not to the underlying features of the stock itself. In order to adjust for this, we can subtract the average return of the market over the time period from the return of each security, thereby excluding the influence of the overall market movement from the target, and focusing solely on the differences in performance among the securities in the index. By modeling relative performance, an investor can construct a hedged portfolio using a long/short equity strategy based off of the model predictions, which should then be robust to varying market conditions.

Imputing Missing Data (Class 2: Over/Under Performers):

The XGBRFClassifier from xgboost was found to be the most effective classifier for this task. Let’s take a look at how the imputation methods get along with this classifier.

We can see that the best performance is being achieved using the SimpleImputer, and although these results are not particularly exciting, some hyperparameter tuning of the classifier will help. Let’s move on to a grid search.

Modeling (Class 2: Over/Under Performers):

After running the cell above, we have a fitted grid search object that will have our optimal estimator fit to the training set. Let’s inspect some of the features of this estimator.

Best CV training score: 0.690298076923077{'clf__colsample_bylevel': 0.8,
'clf__colsample_bynode': 1.0,
'clf__colsample_bytree': 0.6,
'clf__learning_rate': 0.001,
'clf__max_depth': 7,
'clf__n_estimators': 100,
'clf__reg_lambda': 0.75,
'clf__subsample': 0.8}

The best cross-validation score in the grid search on the training set was an roc_auc of 0.69, which is neither great nor terrible. We can now use the best estimator from the grid search to generate predicted probabilities for the holdout set, then see if there is a linear relationship between these probabilities and the actual log returns of the securities.

We can see that the numeric range of these predicted probabilities is extremely slim, but that the classifier is still effective. The r-squared for this simple linear regression of the log returns using the predicted probabilities is very low at .088, but the coefficient for the probability is significant with a p-value of .003, meaning that there is a linear relationship between the log returns and the predicted probabilities generated by the model. What this means for us is that rather than simply choosing securities above/below a threshold probability, one could create a more conservative portfolio by creating a window around a threshold probability and only selecting stocks to long or short that are outside of that window of probabilities. For simplicity, here we will look at just longing or shorting on either side of our chosen threshold, but first we need to figure out what the optimal threshold for this model is by looking at the roc curve.

We can see that the roc_auc score for the holdout set is .70, which isn’t too bad. We can also see that there appears to be a sweet spot right around a True Positive Rate of just under 0.8 and a False Positive Rate around 0.4. We can create a data frame of these rates with their thresholds using the roc_curve function, and use it to determine what the best threshold to use for our model is.

Great, now we can slice into this data frame to find the probability threshold that corresponds with the sweet spot we saw on the roc curve above.

And there we have it, the sweet spot between tpr and fpr appears to be at index 21. We can get the full resolution of the threshold by indexing the column, and use it to manually generate predictions using this threshold. It will be useful to compare the predictive accuracy using our chosen threshold with the accuracy attained by using a standard probability threshold of 0.5.

Accuracy Score with Standard Threshold: 0.62
Accuracy Score with Selected Threshold: 0.69

As we can see, fine tuning our classification threshold has given us another 7% on our overall predictive accuracy, bringing us up to 69%. This is considerably better than a random guess, which for this target had a 53% chance of randomly picking an over performer. We can see what the probability of correctly picking an overperforming stock by dividing the number of overperforming stocks by the total number of stocks in the index.

0.5270541082164328

We can see that the model is indeed giving us a predictive advantage over randomly selecting an overperforming stock. Now that we have our predictions, we can set about constructing a portfolio using a long/short equity strategy, and see how the returns of this portfolio would compare to buying and holding the market index over the same time period.

Before we move on to portfolio construction, however, let us take a look at the information on feature importances provided to us by the model. There are two ways to look at this: the first is to look at the feature importances from the training process, and the second is to look at permutation importances on the holdout set. Comparing the two can provide interesting insights into how the model behaves with the two sets of data. First, let’s look at the feature importances of the training process.

Above we can see the relative importances of the features in the model fit, which it uses to make predictions. Interestingly, the leading feature is the analyst rating from Ford Equity Research, perhaps they would be pleased to know. Beneath this we see the list of our fundamental features, all of which are shown to be playing an important role in estimation, apart from some of the dummy variables. This is very informative, but it only gives us a view into how the model built itself to best fit the training set. In order to see how these features contribute to the predictive accuracy of the holdout set, we need to use permutation importances. Permutation importances are generated by iteratively shuffling each feature (breaking the relationship to the target) and measuring the loss of predictive performance caused by doing so. If shuffling a certain feature leads to a major loss in performance, then it can be said to be an important feature in prediction. Let’s look at this below.

Above we can see the distributions of how much the model performance was affected by 10 shuffles of each feature. We can notice that the order of the importances has been shuffled around somewhat, although there are some consistencies. This is due to the fact that what is important in predicting the target for the securities of the training set may be different than what is important in predicting the target for securities found in the testing set. Features which are found toward the top of both lists can be assumed to be important overall, such as Growth 1yr High Est, Revenue Per Employee (TTM), ford, marketCap, and Beta. In the case of some of the permutation importances toward the bottom of the chart, random shuffling of the features actually improved the predictive accuracy on the holdout set!

Although we know that certain features are important to the model, their relationships to the target variable may not be totally clear to a human being, as we can see below by plotting the log returns over the (apparently highly valuable) ford ratings.

Ford is an important feature with an unclear relationship to the target.

Constructing a Portfolio

To demonstrate how our model can be useful to an investor, we will construct a basic long/short equity portfolio using the model’s predictions of over/under performers. The benefit of the long/short equity strategy is that the investor is hedged against the market by taking an equal amount of long and short positions. This way, the portfolio is not affected by overall upward or downward movement of the market, and instead is only affected by how well the investor has predicted the relative performance of securities within it. Our model from above is correctly identifying over and under performers with an accuracy of 69%, which isn’t perfect, but when this predictive power is combined with a long/short equity strategy, it can lead to a consistently profitable trading strategy by both diversifying the risk of the portfolio by taking many positions and also hedging against the market. If the market moves up, the investor will gain from their long positions while losing on their short positions, and conversely if the market moves down, they will gain on their short positions while losing on their long positions. By having predicted with better than random accuracy which securities were set to over or underperform the market, the gains from the winning side of the portfolio should average out to be bigger than the losses from the losing side, no matter which way the market moves. This is why it was so important to subtract the average return of the market from our target variable, because it led to developing a model which focused only on relative performances, ignoring the impact of the market.

We can simulate how such a portfolio would have performed using the securities of the test set by using the class predictions for each security to alter the sign of their respective log returns, and averaging all of these adjusted returns together. This would be the equivalent of the investor making equal dollar investments into each security in the test set, going short in any stock predicted by the model to underperform, and going long in any stock predicted to overperform. Since being short in a stock turns losses into gains and vice-versa, we will take all securities with predicted class 0 and reverse the signs.

Now we are ready to determine what the returns of a portfolio which took equal dollar value positions in each stock of the holdout set, going short in those predicted to underperform, and long in those predicted to overperform. We can do this as follows:

Portfolio Return (test set): 0.10186008839757786
Market Return (test set): 0.14362567579115593

We can see that the portfolio has underperformed the market considerably, but this is the nature of a hedging strategy. This is due to the fact that the market itself did very well over the time period studied, so the short side of the portfolio had losses which subtracted from the potential gains of buying and holding the market index. The sacrifice in comparative performance to such a booming market is made in order to protect the portfolio in the event that the market behaves poorly. To see this in action, let’s repeat this comparison, but with a simulated bear (falling) market, which we can create by subtracting the average market return twice from each security, thereby making the overall movement of the market the opposite of what actually happened. We can then compare how the portfolio would compare to buying and holding the market index under these averse circumstances.

Now we can repeat the process above to see how our portfolio performance would have compared to buying and holding this simulated bear market.

Portfolio Return (test set): 0.050753389653326986
Market Return (test set): -0.06931890230988941

Here we can see where the hedged portfolio design truly shines: where someone who had bought and held the market would have suffered losses, the hedged portfolio actually saw gains due to the short positions, thus demonstrating how the long/short equity portfolio strategy reduces market risk, and leads to more consistent profitability.

Visualizing the Portfolio Performance:

We can get a deeper understanding of what is happening by visualizing the performance of the portfolio vs the market. In this case, we are actually looking at the portion of the market which is within the holdout set. We can make a quick visual to see how the overall market differs from our test set, after we create a data frame representing the daily returns of all of the securities in the index by calling pandas .diff() method on the log_close data frame, which will calculate the daily change in log price, also known as log returns.

We can use the cumulative sum to represent the cumulative returns over time. Beneath we will look at how the test set compares to the entire index.

We can see above that the portion of the S&P 500 that is in the test set noticably outperformed the index, but that the overall shape is almost exactly the same. Recall above that the average for the holdout set was .14, while the index had an overall return of .11. Now, we can use the predictions of under and over performers given to us by the model to split the securities in the test set into long and short positions. The portfolio will go long in all companies predicted to outperform, and short in all that were predicted to underperform. To construct this portfolio, we will split the log_returns_full frame into the long side and short side, reverse the sign of the short side, and recombine the two. Then, we will look at a plot comparing the returns of a buy & hold strategy of the test set vs the long/short portfolio we have made.

Now we can really see much more detail of what is going on here. We see that the final return for the test set is at .14, and the total return for the portfolio is at .10, just as we calculated earlier, but now we can see the history leading up to that point. Notice that the portfolio is much less volatile than the buy & hold strategy, with the peaks and troughs much more subtle. This is due to the hedging. To get an even better idea of what is happening, we can look at the returns of the long side of the portfolio and the short side of the portfolio separately.

Here we see the mechanics behind our strategy most clearly. The green line corresponds to the collective returns from our long positions, and the red line shows that of our short positions. Notice how they almost look like reflections of each other, but that the green line goes further up than the red line goes down. This was the entire goal of our strategy in action! The mirrored peaks and troughs are what combine to create the smoother line of the portfolio vs the buy & hold strategy, since the market movements experienced by all of the stocks have been mostly canceled out by this hedging. The fact that the red line doesn’t lose as much as the green line gains is thanks to the help of our model, which successfully helped us pick stocks that were more likely to over or underperform the market. Since the short positions lose the investor money when the prices increase, the fact that these positions as a whole underperformed the market means that the losses will be minimized, while the positions that the investor is long in were likely to maximize gains.

To explain why we are willing to sacrifice profits for our hedging strategy, let’s create the simulated bear market, and see how the same portfolio would have behaved in this environment. To create the bear market, we can subtract the mean of each day’s log returns from each company’s log return that day twice, effectively reversing the flow of the market over this time period. Let’s do this, and create a visualization to verify that it has worked as expected.

We can see that this math has mirrored the cumulative returns of the index, and given us a simulation of a market downturn. Let’s see how the same portfolio positions, as determined by the predictions of our model, would have performed vs the test set in these market conditions.

Here we can see the beauty of our strategy. In these averse conditions, the profits generated from the short positions have outweighed the losses from the long positions, so that the portfolio has actually made gains while the buy & hold investor would have had losses, and without all of the dramatic swings to boot. Let’s look again at the two halves of the portfolio separately.

We can see that the red line representing the short positions is driving the profits, and that the predictive power of the model has again led to an overall win in these conditions by allocating our portfolio based on the securities most likely to over or underperform the market. Notice that the green line ends up just about breaking even, whereas the test set went down to -.06, meaning that our long positions have indeed overperformed the market. Since the short positions have collectively provided a cumulative log return of .16, they actually saw losses at -.16, meaning that they drastically underperformed the market during these hard times, according to our plan.

Conclusions

In this study, the data obtained from the web scraper has been used in a number of investigations. First, we unsuccessfully attempted to create an asset pricing model which could aid in a value investing strategy over a 6 month period by regressing the prices of the stocks on the day of the scrape and attempting to use the residuals as margins of safety. Then, we moved onto modeling the returns since the scrape in a number of ways. Regressing the continuous values of the log returns led to lackluster performance with r-squared around .20, but further testing may show that these models were not completely useless in providing a trader with an edge. Models were then trained to classify gainers and losers, but these were found to be flawed in their bias to the circumstances of the bull market since the date of the scrape, with class membership being influenced by market movement, rather than by only the differences in fundamentals among the companies. Thus, the most fruitful investigation came from training models to determine which stocks had better or worse performance in relation to the market. Such models use the fundamental features to determine which stocks over or under perform the others in the index, which is a far more robust comparison for varying market conditions, as the market has periods of upward as well as downward trends. Once a proper model was trained to classify the over and under performing stocks within the index, a simulated portfolio was constructed in a long/short equity strategy to show how the model’s predictions could be used to give an investor a trading edge by providing a guide for developing a hedged portfolio. This simulation showed that the model was indeed useful in creating a consistently profitable strategy which had diversified position risk and was hedged against market risk.

Further study in this domain would be highly appropriate. Although the data scraper gave us a wonderful dataset to work with, there are things which could be improved in future work. Firstly, the web scraping process was extraordinarily time consuming, both in development and deployment, and led to a limited dataset of only 500 companies to study. This path was taken in order to avoid the costs of paying for such in-depth fundamental data, but now that it can be shown that such data can be paired with machine learning algorithms to provide a trading edge, paying for such data is justifiable. Further, with a data subscription, multiple time periods could be studied (and of various lengths), rather than just one. This would greatly help in both training models and testing their deployment on out-of-sample data. Such data would also be likely to have less missing values, so that a smaller portion would need to be imputed, leading to better model performance.

--

--

Nate Cibik
Analytics Vidhya

Data Scientist with a focus in autonomous navigation