I. A turnkey Python code for creating a stock forecasting dataset

Sébastien Cararo
Analytics Vidhya
Published in
8 min readDec 22, 2020

Creating a ready-to-train dataset from basic historical information.

This article is part of a series of articles that describe the step-by-step process to create, backtest, and deploy a trading bot. This first article focuses on computing predictive variables, while the second focuses on building the Deep Learning models, the third on identifying profitable trading strategies with the models, and the last one on the real-life implementation in a program (deployment). Links are given at the end of this article.

The pipeline presented down there can be applied to any sort of financial market (stock market, crypto, electricity, etc…). The trading bot that we build only concerns bullish moves, i.e. we only bet on increasing trends. I decided not to study bearish trading because of the multiple differences over the exchanges.

What you’ll only need to input:

  • An historical dataset containing the following variables within candles: ‘close’, ‘date’, ‘high’, ‘low’, ‘open’, ‘volume
  • Set a stoploss and a takeprofit
  • Split your dataset into three distinct parts : you’ll just need to set the two splitting dates

What you’ll have after applying the pipeline :

  • A set of Deep Learning models, along with their performance over the validation set
  • A simple trading method : follow the best model’s instructions
  • A more evolved trading method : follow the best model in the most profitable cases only
  • An operating trading bot code for Poloniex (can be adapted to any platform)

I tried to make this series as intelligible as possible towards non-financial readers : apart from certain technical points, the articles are quite illustrated and the underlying notions are usually easy to understand. This first article has the following outline :

0 — Import libraries and constants

I — Define standard predictive variables

II — Build the function to compute all variables

III — Compute the output

IV — Apply PCA and save results

V — Conclusion

0-Import libraries and constants

The very first step consists of initializing and inputting the required data. After importing the libraries, the user should provide an historical dataset containing the date and 5 basic informations within each candle, that are the opening and closing prices, the min and max values within the candle, and the volume exchanged. The whole pipeline will be constructed upon these features.

Also, the user just needs to set the values of the stoploss and the takeprofit, plus the dates at which we should split the dataset into training set, validation set and testing set. The training set is used for training the Deep Learning models, while the validation set is used to evaluate the accuracy of the different models. Finally, the test set is used to check that our final trading strategy is effectively working on totally new data.

In my example, I used a cryptocurrency dataset that I downloaded from the Poloniex API : it consists of candles of 7200 secs (2 hours) for the pair BTC_USDT, from 20/02/2015 to 21/12/2020. The dataset I used, along with the whole code, is available in a dedicated GitHub repository [2].

Figure 1: Overview of my historical dataset (relevant variables in green)
Figure 1: Overview of my historical dataset (relevant variables in green)

During the example, I did set a training period of 45 months, a validation set of 12 months and a testing set of 12 months (roughly 65%/17.5%/17.5%). I also decided to choose stoploss = 5% and takeprofit = 10%, and my splitting yields the following sets :

Figure 2 : Splitting between train set, validation set and test set.
Figure 2 : Splitting between train set, validation set and test set.

The initialization code is written below, the user should just change lines 10 & 11 aswell in order to fit with his own file paths:

Figure 3 : Librairies importation and initialization

I — Define standard predictive variables

The next step is to define the variables which we want to compute for forecasting the tendencies. I chose to compute only a couple of these variables in my example, however we will see that this is sufficient to build accurate models and generate profits. In the next step, these variables will be computed over various window sizes in order to account for both short-term and long-term movements.

The code for computing Simple Moving Averages [3] and RSI [4] is given down there :

Figure 4 : SMA and RSI definition

II — Build the function to compute all variables

Once we have defined our basic predictive variables, time is now to compute these variables in the dataset. Moreover, we will also compute some usual variables such as the min/max on rolling windows, the moving average of the volume, modulos, etc… You can choose which variables you may want to add/remove in your own trading bot. Usual forecasting variables include metrics such as Bollinger Bands [5], MACD [6] or even support/resistance trend lines [7].

The code for computing the variables on a given dataset is given down there:

Figure 5 : Variables computation

III — Compute the output

Now that we have defined our predictive variables X, it is time to define the output of our dataset, that is the result Y that we seek to predict. As trading bots operate in a binary way — that is either we buy/sell the asset — I decided to encode my output in a binary way. Concretely, given my stoploss and my takeprofit, I defined Y as following :

Figure 6 : Computing Y
Figure 6 : Computing Y

The observations where Y = -1 are then removed (these are only the most recent timesteps), because we wouldn’t be able to exploit them anyway.

To do this, for each timestep we browse over the dataset from this point until the price reaches either the stoploss or the takeprofit. As we only have a limited set of information (close, open, min, max, volume), I consider that within a candle it is not possible to reach both the stoploss and the takeprofit. This assumption is realistic given the values of the candles’ length and the values of stoploss/takeprofit, however it is worth noting it down.

The code for computing the output Y is the following one :

Figure 7 : Computing the output Y

IV — Apply PCA and save results

Once we have computed X and Y, the final step in building a ready-to-train financial dataset is to apply Principal Components Analysis (PCA — [8]). This technique is very useful to reduce the variance in the dataset before training DL models, especially in the case of highly correlated predictive variables.

Actually, many of the variables that we computed are highly correlated. That is mainly because they come from the same signal. Such correlation can be a problem for Deep Learning forecasting, as apart from increasing the training times it also increases the variance in the dataset, which often results in poor forecasting performances.

Figure 8 : Correlation matrix of the training set (Pearson)

To deal with this, we apply PCA to create a new set of orthogonal components, that contain exactly the same information overall. This will be useful in the training process, because we will be able to compare models with different set of Principal Components (PCs) via cross-validation, in order to pick the very best model.

We proceed to scaling the variables both before and after applying the PCA process. This permits a gain in both training and forecasting times — that is useful when you need the trading bot to take decisions as quick as possible.

NB : We pay attention to fitting the scalers and the PCA on the trainset only, otherwise it would generate causality issues in the next sections (i.e. we would be predicting past observations when having future information).

We pay attention to removing the output (‘result’) and ‘date’ variables before fitting the scalers and the PCA. For illustration purpose, we can display what the datasets look like before and after applying this section (scaler n° 1 + PCA + scaler n° 2):

Figure 9 : Overview of the training set variables before applying this block
Figure 10 : Overview of the trainset_final variables (we can observe that the variables are well scaled to have mean 0 and std = 1)

The code to scale, apply PCA, and scale the variables is written down there :

Figure 11 : Scaling, Applying PCa, then scaling the variables

V — Conclusion

First we can recall that the full code is available at [2]. As a conclusion, we can be satisfied for having transformed our basic historical dataset into a dataset that is ready-to-train on. To apply this pipeline, you just need to copy/paste the code and modify the input accordingly to your needs.

What we have created over the course of this article :

  • Train set, validation set, test set with X and Y
  • 2 scalers and a PCA function
  • The 3 datasets after applying the scalers and the PCA : these datasets are ready-to-train (ready-to-predict) datasets

The next article will explain how we can use these datasets to build and train several Deep Learning models. The final goal will be to compare the models, pick the best one and create profitable and robust trading strategies (article III). We will conclude this series by deploying the bot in real-life conditions (article IV).

Don’t hesitate to leave any feedback/questions/claps or to contact me for more information.

Other articles of the series :

Contact : sebcararo@hotmail.fr

Another article (building a sports-betting algorithm with Machine Learning) How Covid-19 prevented me from being a millionnaire in 2020 | by Sébastien Cararo | Analytics Vidhya | Dec, 2020 | Medium

Sources :

[1] Full paper

https://seb943.github.io/Data/Paper_CreatingATradingBot.pdf

[2] GitHub repository

https://github.com/Seb943/TBpolo

[3] Simple Moving Averages

https://www.wallstreetmojo.com/moving-average-formula/

[4] RSI (Relative Strength Index)

https://www.investopedia.com/terms/r/rsi.asp

[5] Bollinger bands

https://www.bollingerbands.com/bollinger-bands

[6] MACD (Moving Average Convergence Divergence)

https://www.investopedia.com/terms/m/macd.asp

[7] Support/resistance trendlines

https://trading.info/support-resistance

[8] PCA (Principal Components Analysis — EigenVectors decomposition)

https://blog.clairvoyantsoft.com/eigen-decomposition-and-pca-c50f4ca15501

Software : Python 3.8.5 on Pyzo IDE

Figure 12 : Cover picture
Figure 12 : Cover picture

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Ever interested in high-quality tools for professionals and DIY enthusiasts ? Check out this awesome online store ! Home — ToolsyLand

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--