Demonstrating TSFRESH by predicting the price of a crypto asset.

7 min readFeb 20, 2023

With this project, I demonstrate how the tsfresh library can be applied for building a regression model on market data. The aim is to predict the value of the next data point in a given timeseries.

As a newcomer in the field of data science, I was inspired by this great presentation by author Nils Braun. It basically proves how feature extraction from timeseries can be done fast, accurately and automated with this amazing library.

Please note that this library was built to automate feature extraction and has no built-in prediction functionality. For that we will need a machine learning model to which we feed the extracted features from the timeseries data.

Rolling window

It is my understanding that tsfresh can be applied for predictions thanks to its built-in functionality to create a rolling timeseries window. For each rolling window, a target price can be predicted for the next data point using a machine learning model, for instance regression.

The script will select the best model from all available regressors in the scikit-learn library and use this to fit on the training data.

You will see that this technique of feature extraction leads to impressive results on timeseries data as unpredictable as market prices.

Method

The project will predict prices for all data points in the current month, with a regression model that is trained on a number of days prior to this month.

We will see how increasing the size of the training data results in better predictions.

Content

1. Setting variables

2. Import libraries and estimators

3. Loading data

4. Feature extraction

5. Selecting the best regression method

6. Train the model with the best regression model

7. Plot the graph

8. Evaluate the performance

Preparation

# 1. Setting Variables

For this project I picked a random ticker from the Binance library (Quant, QNT/USDT) and choose to a 4 hour interval for each data point.

To speed things up, I can choose to not calculate the best regressor, and use the last one that was saved in a .csv file from previous runs. Alternatively I can declare a regression method myself.

# 2. Import libraries and estimators

To choose a best regression method, we will import all available regressors from the scikit-learn library so that we can calculate their performance on the training data. We see that there are currently 203 regressors available.

# 3. Import data from Binance

A function will load the data for our ticker according to the selected interval from Binance. For this project I will only use the “close” price.

We load the data into a pandas dataframe and prepare it for use in tsfresh.

TSFRESH

# 4. Feature extraction

We perform rolling windows on the timeseries data to obtain windows with a length between ‘min_window_size’ and ‘max_window_size’. This can be easily done with the built-in roll_time_series function.

I have chosen ‘5’ and ‘60’ respectively, meaning that each row in our ‘df_rolled’ dataframe will contain features of at least 5, and maximum 60 data points from our timeseries.

df_rolled = roll_time_series(df_melted, 
column_id="Symbols", 
column_sort="timestamp",
max_timeshift=max_window_size, 
min_timeshift=min_window_size
)

Now that we have created a rolling window dataframe we can extract the features of each of the windows with tsfresh’s extract_features function.

X = extract_features(df_rolled.drop("Symbols", axis=1), 
                     column_id="id", column_sort="timestamp", column_value="close", 
                     impute_function=impute, show_warnings=False)

The resulting dataframe will have as much rows as there are rolling windows, and the columns will be filled with the extracted features for each of these windows.

Note: I have not used the extract_relevant_features method which would extract only the most relevant features. Therefore every row has all the features that tsfresh calculates, which sums up to 783.

Now here comes the magic, each rolling window has a last data point that it has seen, so we set the index for the corresponding rows to their corresponding last value.

X = X.set_index(X.index.map(lambda x: x[1]), drop=True)
X.index.name = "last_timestamp"

Note: If we use only 1 data point per day, this would effectively mean “last_date”. Because we are using a 4 hour interval in this project, the name of the index is “last_timestamp”.

Now that we have indexed our new timeseries correctly we can link it with the index of the dependent variables.

Setting the target

We are doing supervised learning here, so we need to declare the target if we want to train our model.

To create a target for any given data point in our rolling window timeseries, we simply shift the values in the actual prices column of the original timeseries by 1 data point.

This will set the target value to be predicted for each data point of our rolling window timeseries.

Note: If we want to train our model to predict the price for a data point later in time, we could simply replace this value with a greater number.

y = df_melted.set_index("timestamp").sort_index().close.shift(-1)

Now each row in our rolling window timeseries X will have the true price of a next data point as target in our target dataframe y.

Reducing the shape

To make model training work, we have to ensure that both dataframes have the same length, ie. number of rows, for this we can call this method.

y = y[y.index.isin(X.index)]
X = X[X.index.isin(y.index)]

We now have a functional X and y dataframe to start the training!

Regression

# 5. Selection of the best regression method

With two functions we run over all the sklearn regression models and evaluate their metrics on the training data.

To speed up things we can add the worst performing models to list and save them to a file. Upon a next run these will not be included during evaluation.

When init==1, our script will calculate the metrics for each of the 203 available regression models that are available through the scikit-learn library.

To be in line with the scope of this project the regression model which produces the smallest mean absolute error (MAE) is chosen.

Calculating the metrics for choosing the optimal regression method (with verbose=1)

We can see that ARDRegression shows promising metrics, it shows $1.64 mean absolute error and $2.14 root mean squared error on the training data.

After calculating all the metrics for the available regression estimators, this method will prove to show the best performance. We thus will use ARDRegression as a model to predict the price in our test data.

Training

# 6. Train the model with the best regression method

We train our model on the training dataset using the best regressor. Then we use this model to predict the prices on the test data.

Remember that our script will test the model, or predict the price, only for the current month. The last value will be the actual price prediction for the next data point. In our project, this will be the next 4 hour close price.

As seen earlier, we could also train this model to predict price for later datapoints, if we shift the target values with a value greater than 1.

Results

# 7. Plot the graph

For a first view we will show how the model performs if it has only been trained on 30 days prior to the current month.

The following graph plots the actual price in blue, versus the predicted price by the model, in orange. We see that even our best regression method can not predict prices very accurately.

Price prediction plotted on the whole dataset (30 days training data)

However, if we change the size of the training data to 90 days, with each day holding values for 6 data points with a 4 hour interval, we can see that the prediction drastically improves.

Price prediction plotted on the whole dataset (90 days training data)

The model performs better with more training data!

Now if we change the size of the training data from 30 to 60, 90, 180, 360 and 720 days respectively, we can see how much better the model starts performing.

This graph shows the evolution of the price prediction by increasing the size of the training data from 30 to 720 days.

Price prediction with 30, 60, 90, 180, 360 and 720 days of training data

Evaluation

# 8. Evaluation of the performance

To visually inspect the performance, we can show the difference between predicted and actual prices in the following graph by plotting the values of the predicted prices for the test data of this month versus the actual price.

On the top right we see how the metrics improve dramatically when using more training data.

Graphical representation of the error between the predicted and the actual values for a 4H interval and a training set containing 30,60,90,180,360 and 720 days

Conclusion

In this article, I have shown that tsfresh can be a powerful tool to extract features for a regression model, even on data that should have little to no correlation with any regression method.

This project has helped me to understand what the library is doing and how powerful it is. I could then apply it in a real use-case with satisfying results.

You can visit this project’s GitHub page here: https://github.com/Francode77/TSFRESH_price_prediction

References

tsfresh - tsfresh 0.18.1.dev39+g611e04f documentation

tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called…

tsfresh.readthedocs.io

Rolling/Time series forecasting - tsfresh 0.18.1.dev39+g611e04f documentation

Features extracted with tsfresh can be used for many different tasks, such as time series classification, compression…