【Data Analysis(9)】Stock Selection by Random Forest Algorithm

Backtesting and stock-picking strategy with machine learning

TEJ 台灣經濟新報
TEJ-API Financial Data Analysis
8 min readDec 14, 2021

--

Photo by Jay Mantri on Unsplash

Highlights

  • Difficulty:★★★☆☆
  • Random forest algorithm
  • Advice: This article aims to predict the direction of stock price movement by using random forest algorithm, further explore what impacts the stock return the most, and use them as our standards of stock’s selection. We have introduced how to predict stock return by XGBoost previously. And this week we’ll do the backtesting and optimize our stock-picking strategy, and show the latest stock watchlist at the end.

Preface

To put it simply, random forest is one of algorithms made up of many decision trees with the adoption of bagging and random sampling. Since it’s based on CART algorithm, it can handle both classification and continuous data. Other advantages such as its comparability with high dimensional data, high tolerance with noise, high accuracy of fitting results, so it’s commonly used in business competition like Kaggle.

In this article, we’ll treat financial data as features that are used to predict the movement of stock price, so it belongs to binary classification problem. Then see whether there’s valuable information gained from the fitted model to improve our stock-picking strategy.

The Editing Environment and Modules Required

Windows OS and Jupyter Notebook

Database Used

Data Processing

Step 1. Obtain industry code, financial and return data

First of all, obtain TSE-listed stocks’ codes and their corresponding industries’ codes. We save the latter one as a dictionary as shown below.

security_list & industry_code

Here we form several groups with 50 firms each group for the next step. Because if we obtain huge amount of data at one time, the failure may happen.

Then use the loop to get each groups’ financial and seasonal stock return data. It’s worth noting that the frequency of seasonal stock return is daily, because it’s the cumulative seasonal return before that date. It’s like a rolling stock return, thus it has data for each trading day. date_data is the table provided by TEJ, and it’s very useful while combining return and financial data.

Step 2. Merge data

Obtain the last trading date before the next financial statement announcement date, because we’ll use this date to combine with the date of seasonal return. That’s to say, the seasonal return will represent the cumulative return after the financial statement announcement date. Finally, we change the column names to prepare for merging date.

date_data

Combine all the data and set codes of stock and date as our new indexes. Then only keep the numeric columns as features to predict the return movement.

merge

Model Training

Step 1. Split dataset into training and testing date and train the model

The dataset before 2020 is used as training data and testing data otherwise. We fill the missing value with zero and finally fit the random forest model with features and boolean labels in training dataset.

Step 2. Model performance

Backtesting and Visualization

Use the predicted outcome as ways of filtering stocks. So we will have stocks that are predicted to perform well in the next season.

test_data[selected]

We draw three lines here. Blue line means the cumulative return of portfolio based on model’s prediction. Orange line is the cumulative return of portfolio formed by selecting the stocks that are predicted to fall in stock return as our benchmark. Red line means we own all of the stocks without any filtering. It can be seen that blue line is the best of the three lines.

Optimize Stock-Picking Strategy

Step 1. Choose important features

Observe the most 20 important features to be standards of selecting stocks

Next step is to subjectively choose the positive features, meaning the higher its value, the better the company is, from those 20 features. Since we will convert those values into percentiles and rank them by values, the higher value should signify it has better performance.

Step 2. The setting of same industry comparison

Here we map industry code, the value of the dictionary industry_code stored in data processing step, into a new column in merge. And set this column, security code and date as new index.

merge

Step 3. Calculate important features scores to select stocks

Then we group data by date and industry, and rank the values and convert them into percentiles in each group by using rank(pct=True) . Next is sum up the values horizontally and rank again. Finally we choose the data which is better than 97th percentiles of all data and backtesting. Following picture shows the cumulative return of this strategy (blue line) is superior than the other two benchmarks.

Stock Watchlist Based on 2021 Q3 Data

Choose the date equals to ‘2021–09–01’ , and also choose the best 3% of each industry. Then adopt database TWN/AIND to see the names of the companies

firms

The cumulative return of the newly-formed portfolio

Conclusion

Because TEJ API database has comprehensive and high quality data, it’s easier to handle in data processing step. We just need to merge the data, split the dataset and then we can start to build the model. Even though the accuracy is only 54.88%, we still can extract valuable information from the fitted model. Readers can try to adjust parameters while training the model, pick different combination of important features, use different databases or consider the trend of industries and do the second filtering. Lastly build the portfolio with optimal performance and see if its great performance will remain.

The content of this webpage is not an investment device and does not constitute any offer or solicitation to offer or recommendation of any investment product. It is for learning purposes only and does not take into account your individual needs, investment objectives and specific financial circumstances. Investment involves risk. Past performance is not indicative of future performance. Readers are requested to use their personal independent thinking skills to make investment decisions on their own. If losses are incurred due to relevant suggestions, it will not be involved with author.

Source Code

Extended Reading

Related Link

You could give us encouragement by …
We will share financial database applications every week.
If you think today’s article is good, you can click on the
applause icon once.
If you think it is awesome, you can hold the
applause icon until 50 times.
If you have any feedback, please feel free to leave a comment below.

--

--

TEJ 台灣經濟新報
TEJ-API Financial Data Analysis

TEJ 為台灣本土第一大財經資訊公司,成立於 1990 年,提供金融市場基本分析所需資訊,以及信用風險、法遵科技、資產評價、量化分析及 ESG 等解決方案及顧問服務。鑒於財務金融領域日趨多元與複雜,TEJ 結合實務與學術界的精英人才,致力於開發機器學習、人工智慧 AI 及自然語言處理 NLP 等新技術,持續提供創新服務