Demonstrating TSFRESH by predicting the price of a crypto asset.
With this project, I demonstrate how the tsfresh library can be applied for building a regression model on market data. The aim is to predict the value of the next data point in a given timeseries.
As a newcomer in the field of data science, I was inspired by this great presentation by author Nils Braun. It basically proves how feature extraction from timeseries can be done fast, accurately and automated with this amazing library.
Please note that this library was built to automate feature extraction and has no built-in prediction functionality. For that we will need a machine learning model to which we feed the extracted features from the timeseries data.
Rolling window
It is my understanding that tsfresh can be applied for predictions thanks to its built-in functionality to create a rolling timeseries window. For each rolling window, a target price can be predicted for the next data point using a machine learning model, for instance regression.
The script will select the best model from all available regressors in the scikit-learn library and use this to fit on the training data.
You will see that this technique of feature extraction leads to impressive results on timeseries data as unpredictable as market prices.
Method
The project will predict prices for all data points in the current month, with a regression model that is trained on a number of days prior to this month.
We will see how increasing the size of the training data results in better predictions.
Content
1. Setting variables
2. Import libraries and estimators
3. Loading data
4. Feature extraction
5. Selecting the best regression method
6. Train the model with the best regression model
7. Plot the graph
8. Evaluate the performance
Preparation
# 1. Setting Variables
For this project I picked a random ticker from the Binance library (Quant, QNT/USDT) and choose to a 4 hour interval for each data point.
To speed things up, I can choose to not calculate the best regressor, and use the last one that was saved in a .csv file from previous runs. Alternatively I can declare a regression method myself.
# 2. Import libraries and estimators
To choose a best regression method, we will import all available regressors from the scikit-learn library so that we can calculate their performance on the training data. We see that there are currently 203 regressors available.
# 3. Import data from Binance
A function will load the data for our ticker according to the selected interval from Binance. For this project I will only use the “close” price.
We load the data into a pandas dataframe and prepare it for use in tsfresh.
TSFRESH
# 4. Feature extraction
We perform rolling windows on the timeseries data to obtain windows with a length between ‘min_window_size’ and ‘max_window_size’. This can be easily done with the built-in roll_time_series function.
I have chosen ‘5’ and ‘60’ respectively, meaning that each row in our ‘df_rolled’ dataframe will contain features of at least 5, and maximum 60 data points from our timeseries.
df_rolled = roll_time_series(df_melted,
column_id="Symbols",
column_sort="timestamp",
max_timeshift=max_window_size,
min_timeshift=min_window_size
)
Now that we have created a rolling window dataframe we can extract the features of each of the windows with tsfresh’s extract_features function.
X = extract_features(df_rolled.drop("Symbols", axis=1),
column_id="id", column_sort="timestamp", column_value="close",
impute_function=impute, show_warnings=False)
The resulting dataframe will have as much rows as there are rolling windows, and the columns will be filled with the extracted features for each of these windows.
Note: I have not used the extract_relevant_features method which would extract only the most relevant features. Therefore every row has all the features that tsfresh calculates, which sums up to 783.
Now here comes the magic, each rolling window has a last data point that it has seen, so we set the index for the corresponding rows to their corresponding last value.
X = X.set_index(X.index.map(lambda x: x[1]), drop=True)
X.index.name = "last_timestamp"
Note: If we use only 1 data point per day, this would effectively mean “last_date”. Because we are using a 4 hour interval in this project, the name of the index is “last_timestamp”.
Now that we have indexed our new timeseries correctly we can link it with the index of the dependent variables.
Setting the target
We are doing supervised learning here, so we need to declare the target if we want to train our model.
To create a target for any given data point in our rolling window timeseries, we simply shift the values in the actual prices column of the original timeseries by 1 data point.
This will set the target value to be predicted for each data point of our rolling window timeseries.
Note: If we want to train our model to predict the price for a data point later in time, we could simply replace this value with a greater number.
y = df_melted.set_index("timestamp").sort_index().close.shift(-1)
Now each row in our rolling window timeseries X will have the true price of a next data point as target in our target dataframe y.
Reducing the shape
To make model training work, we have to ensure that both dataframes have the same length, ie. number of rows, for this we can call this method.
y = y[y.index.isin(X.index)]
X = X[X.index.isin(y.index)]
We now have a functional X and y dataframe to start the training!
Regression
# 5. Selection of the best regression method
With two functions we run over all the sklearn regression models and evaluate their metrics on the training data.
To speed up things we can add the worst performing models to list and save them to a file. Upon a next run these will not be included during evaluation.
When init==1, our script will calculate the metrics for each of the 203 available regression models that are available through the scikit-learn library.
To be in line with the scope of this project the regression model which produces the smallest mean absolute error (MAE) is chosen.
We can see that ARDRegression shows promising metrics, it shows $1.64 mean absolute error and $2.14 root mean squared error on the training data.
After calculating all the metrics for the available regression estimators, this method will prove to show the best performance. We thus will use ARDRegression as a model to predict the price in our test data.
Training
# 6. Train the model with the best regression method
We train our model on the training dataset using the best regressor. Then we use this model to predict the prices on the test data.
Remember that our script will test the model, or predict the price, only for the current month. The last value will be the actual price prediction for the next data point. In our project, this will be the next 4 hour close price.
As seen earlier, we could also train this model to predict price for later datapoints, if we shift the target values with a value greater than 1.
Results
# 7. Plot the graph
For a first view we will show how the model performs if it has only been trained on 30 days prior to the current month.
The following graph plots the actual price in blue, versus the predicted price by the model, in orange. We see that even our best regression method can not predict prices very accurately.
However, if we change the size of the training data to 90 days, with each day holding values for 6 data points with a 4 hour interval, we can see that the prediction drastically improves.
The model performs better with more training data!
Now if we change the size of the training data from 30 to 60, 90, 180, 360 and 720 days respectively, we can see how much better the model starts performing.
This graph shows the evolution of the price prediction by increasing the size of the training data from 30 to 720 days.
Evaluation
# 8. Evaluation of the performance
To visually inspect the performance, we can show the difference between predicted and actual prices in the following graph by plotting the values of the predicted prices for the test data of this month versus the actual price.
On the top right we see how the metrics improve dramatically when using more training data.
Conclusion
In this article, I have shown that tsfresh can be a powerful tool to extract features for a regression model, even on data that should have little to no correlation with any regression method.
This project has helped me to understand what the library is doing and how powerful it is. I could then apply it in a real use-case with satisfying results.
You can visit this project’s GitHub page here: https://github.com/Francode77/TSFRESH_price_prediction