I. A turnkey Python code for creating a stock forecasting dataset
Creating a ready-to-train dataset from basic historical information.
This article is part of a series of articles that describe the step-by-step process to create, backtest, and deploy a trading bot. This first article focuses on computing predictive variables, while the second focuses on building the Deep Learning models, the third on identifying profitable trading strategies with the models, and the last one on the real-life implementation in a program (deployment). Links are given at the end of this article.
The pipeline presented down there can be applied to any sort of financial market (stock market, crypto, electricity, etc…). The trading bot that we build only concerns bullish moves, i.e. we only bet on increasing trends. I decided not to study bearish trading because of the multiple differences over the exchanges.
What you’ll only need to input:
- An historical dataset containing the following variables within candles: ‘close’, ‘date’, ‘high’, ‘low’, ‘open’, ‘volume’
- Set a stoploss and a takeprofit
- Split your dataset into three distinct parts : you’ll just need to set the two splitting dates
What you’ll have after applying the pipeline :
- A set of Deep Learning models, along with their performance over the validation set
- A simple trading method : follow the best model’s instructions
- A more evolved trading method : follow the best model in the most profitable cases only
- An operating trading bot code for Poloniex (can be adapted to any platform)
I tried to make this series as intelligible as possible towards non-financial readers : apart from certain technical points, the articles are quite illustrated and the underlying notions are usually easy to understand. This first article has the following outline :
0 — Import libraries and constants
I — Define standard predictive variables
II — Build the function to compute all variables
III — Compute the output
IV — Apply PCA and save results
V — Conclusion
0-Import libraries and constants
The very first step consists of initializing and inputting the required data. After importing the libraries, the user should provide an historical dataset containing the date and 5 basic informations within each candle, that are the opening and closing prices, the min and max values within the candle, and the volume exchanged. The whole pipeline will be constructed upon these features.
Also, the user just needs to set the values of the stoploss and the takeprofit, plus the dates at which we should split the dataset into training set, validation set and testing set. The training set is used for training the Deep Learning models, while the validation set is used to evaluate the accuracy of the different models. Finally, the test set is used to check that our final trading strategy is effectively working on totally new data.
In my example, I used a cryptocurrency dataset that I downloaded from the Poloniex API : it consists of candles of 7200 secs (2 hours) for the pair BTC_USDT, from 20/02/2015 to 21/12/2020. The dataset I used, along with the whole code, is available in a dedicated GitHub repository [2].
During the example, I did set a training period of 45 months, a validation set of 12 months and a testing set of 12 months (roughly 65%/17.5%/17.5%). I also decided to choose stoploss = 5% and takeprofit = 10%, and my splitting yields the following sets :
The initialization code is written below, the user should just change lines 10 & 11 aswell in order to fit with his own file paths:
I — Define standard predictive variables
The next step is to define the variables which we want to compute for forecasting the tendencies. I chose to compute only a couple of these variables in my example, however we will see that this is sufficient to build accurate models and generate profits. In the next step, these variables will be computed over various window sizes in order to account for both short-term and long-term movements.
The code for computing Simple Moving Averages [3] and RSI [4] is given down there :
II — Build the function to compute all variables
Once we have defined our basic predictive variables, time is now to compute these variables in the dataset. Moreover, we will also compute some usual variables such as the min/max on rolling windows, the moving average of the volume, modulos, etc… You can choose which variables you may want to add/remove in your own trading bot. Usual forecasting variables include metrics such as Bollinger Bands [5], MACD [6] or even support/resistance trend lines [7].
The code for computing the variables on a given dataset is given down there:
III — Compute the output
Now that we have defined our predictive variables X, it is time to define the output of our dataset, that is the result Y that we seek to predict. As trading bots operate in a binary way — that is either we buy/sell the asset — I decided to encode my output in a binary way. Concretely, given my stoploss and my takeprofit, I defined Y as following :
The observations where Y = -1 are then removed (these are only the most recent timesteps), because we wouldn’t be able to exploit them anyway.
To do this, for each timestep we browse over the dataset from this point until the price reaches either the stoploss or the takeprofit. As we only have a limited set of information (close, open, min, max, volume), I consider that within a candle it is not possible to reach both the stoploss and the takeprofit. This assumption is realistic given the values of the candles’ length and the values of stoploss/takeprofit, however it is worth noting it down.
The code for computing the output Y is the following one :
IV — Apply PCA and save results
Once we have computed X and Y, the final step in building a ready-to-train financial dataset is to apply Principal Components Analysis (PCA — [8]). This technique is very useful to reduce the variance in the dataset before training DL models, especially in the case of highly correlated predictive variables.
Actually, many of the variables that we computed are highly correlated. That is mainly because they come from the same signal. Such correlation can be a problem for Deep Learning forecasting, as apart from increasing the training times it also increases the variance in the dataset, which often results in poor forecasting performances.
To deal with this, we apply PCA to create a new set of orthogonal components, that contain exactly the same information overall. This will be useful in the training process, because we will be able to compare models with different set of Principal Components (PCs) via cross-validation, in order to pick the very best model.
We proceed to scaling the variables both before and after applying the PCA process. This permits a gain in both training and forecasting times — that is useful when you need the trading bot to take decisions as quick as possible.
NB : We pay attention to fitting the scalers and the PCA on the trainset only, otherwise it would generate causality issues in the next sections (i.e. we would be predicting past observations when having future information).
We pay attention to removing the output (‘result’) and ‘date’ variables before fitting the scalers and the PCA. For illustration purpose, we can display what the datasets look like before and after applying this section (scaler n° 1 + PCA + scaler n° 2):
The code to scale, apply PCA, and scale the variables is written down there :
V — Conclusion
First we can recall that the full code is available at [2]. As a conclusion, we can be satisfied for having transformed our basic historical dataset into a dataset that is ready-to-train on. To apply this pipeline, you just need to copy/paste the code and modify the input accordingly to your needs.
What we have created over the course of this article :
- Train set, validation set, test set with X and Y
- 2 scalers and a PCA function
- The 3 datasets after applying the scalers and the PCA : these datasets are ready-to-train (ready-to-predict) datasets
The next article will explain how we can use these datasets to build and train several Deep Learning models. The final goal will be to compare the models, pick the best one and create profitable and robust trading strategies (article III). We will conclude this series by deploying the bot in real-life conditions (article IV).
Don’t hesitate to leave any feedback/questions/claps or to contact me for more information.
Other articles of the series :
- II. Forecasting crypto tendencies with Deep Learning in Python | by Sébastien Cararo | Analytics Vidhya | Dec, 2020 | Medium
- III. Creating profitable trading strategies | by Sébastien Cararo | Analytics Vidhya | Dec, 2020 | Medium
- IV. Deploy a Poloniex trading bot | by Sébastien Cararo | Analytics Vidhya | Dec, 2020 | Medium
Contact : sebcararo@hotmail.fr
Another article (building a sports-betting algorithm with Machine Learning) How Covid-19 prevented me from being a millionnaire in 2020 | by Sébastien Cararo | Analytics Vidhya | Dec, 2020 | Medium
Sources :
[1] Full paper
https://seb943.github.io/Data/Paper_CreatingATradingBot.pdf
[2] GitHub repository
https://github.com/Seb943/TBpolo
[3] Simple Moving Averages
https://www.wallstreetmojo.com/moving-average-formula/
[4] RSI (Relative Strength Index)
https://www.investopedia.com/terms/r/rsi.asp
[5] Bollinger bands
https://www.bollingerbands.com/bollinger-bands
[6] MACD (Moving Average Convergence Divergence)
https://www.investopedia.com/terms/m/macd.asp
[7] Support/resistance trendlines
https://trading.info/support-resistance
[8] PCA (Principal Components Analysis — EigenVectors decomposition)
https://blog.clairvoyantsoft.com/eigen-decomposition-and-pca-c50f4ca15501
Software : Python 3.8.5 on Pyzo IDE
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Ever interested in high-quality tools for professionals and DIY enthusiasts ? Check out this awesome online store ! Home — ToolsyLand
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —