Bitcoin Price Prediction — How to Compare Models

Leonardo Kanashiro Felizardo
Coinmonks
6 min readFeb 21, 2020

--

How might we choose the best method for Bitcoin price prediction? In this piece, we explain our article published in 2019 with different machine learning methods for bitcoin price prediction. The techniques explored are LSTM, Wavenet, ARIMA, SVM, and Random Forest.

How can we be sure if the chosen model is the best one? The model comparison problem is probably one of the main concerns for the data scientist/machine learning engineering. In the field of price forecasting, there are a bunch of methods that scientists usually follow. We are going to list some of the methods to better compare models and provide the intuition of their operation.

Rolling Window

In machine learning experiments, it is a standard procedure to divide the data set into a training, validation, and test part. With this division, the model can be trained, and during the training phase, evaluated to find the model that generalizes the problem better, avoiding overfitting. Finally, the model is tested, and error metrics are generated for comparison with other models.

What is also common is the use of the k-fold cross-validation technique to provide a distribution of errors, making it possible to infer if models are different between then.

For instance, if we train a model to predict the future price of a time-series, we will generate a series o predicted values that are going to be compared with the original set of values. Then, an error originates from one among many of the available error metrics (e.g., MAPE, MPE, RMSE, MSE).

However, the only problem is the bias towards the choice of the test set. It is possible that the test set used was notably better for the model, consequently, causing it to have exceptional performance.

Avoiding test bias is possible if we provide not the error metric of one test set, but a distribution of the same error metric over different test sets. The k-fold approach could generate this distribution.

k-fold representation — Wikipedia Figure

What k-fold mostly does, is to divide the data set into many training and test sets so that we can have the distribution of the chosen error metric. With the distribution of errors, it is possible to calculate the mean and the standard deviation, so we have a better notion model behavior under different time-series regimes.

The form that k-fold is proposed is not necessarily adequate for the time-series prediction problem. There is a hypothesis that the training set must have continuity and must be before the test set to maximize performance. Therefore, the rolling window approach surges as a different kind of k-fold.

By moving the training, validation, and test sets by a predetermined size, it is possible to obtain different sets maintaining continuity and with the training set before the test set. This technique is also known as Walk forward testing.

Rolling Window visualization

Hyperparameter search

An important aspect that is not usually explored as a bias towards different publications is the hyperparameter’s influence on the comparison of models.

Firstly, how might we define what is a hyperparameter? A straightforward definition is that hyperparameter is a parameter whose value set before the learning process begins. During the training phase, only the model parameters will be modified to fit functions, not the hyperparameters.

In our published work, we explore a random search optimization (check here an argument for the use of random search) of the hyperparameters for all the compared algorithms. This method avoids hyperparameter bias. It is possible that for a small sample of hyperparameters, the new algorithm test has a better performance than the others. However, for the all possible results, the new algorithm is not better at all.

Many different hyperparameter optimization methods can be applied, but they should be simple enough to be used for all the compared models.

Image Source: Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, 281–305 (2012)

Statistical significance of the differences

Now that the distribution of errors for the best hyperparameters combination was established, it is possible to find for all the machine learning algorithms, how can we be sure that the mean of the distribution is different?

There are a bunch of statistical tests that can be executed to compare two different distributions. The following figure expresses one awesome summary that can be followed:

Source: Waning B, Montagne M: Pharmacoepidemiology: Principles and Practice: http://acesspharmacy.com

From the figure above, it is possible to conclude that we have some fundamental questions such as:

Is my data continuous or discrete? Usually, for Bitcoin time-series, we are analyzing price histories, which is continuous data, and the prediction can also be continuous such as the future price or the future return. The continuous data can be turned into a discrete prediction by classifying if the price is going to go up or downward. The input can also be discrete data such as sentiment of NLP analyzes of news or any general categorical data associated with the asset.

For continuous data with one variable only, one essential test is the normality test. The Kolmogorov-Smirnov or the Shapiro-Wilk test can verify the normality. In case there is more than one independent variable, you should use Qui-square, G test, Fisher’s exact test, or binomial test. The figure below shows another decision flux to be followed (sorry for the language).

Source: Marco Mello (http://marcoarmello.wordpress.com)

Final Comments and Conclusion

To fairly compare models, there are many possible methods, and here it was only explored a few of them, without getting into details. Unifying it and dealing with all the potential biases and problems sometimes can be computationally expensive. For academic purposes, the methods used highly depend on where you going to publish your paper and the type of problem.

After applying all the methods, we can generate results, such as shown in our article.

Table from our published article

These results are over the Bitcoin time-series prediction. The ARIMA and SVR had the best performance even after applying all the methods described before. One main problem with this study is the use of absolute price values and not returns, which can be impactful for the model fitting.

For future reference and more details, you can access the original publication: https://ieeexplore.ieee.org/abstract/document/8963009

Also, for testing and reproduction, all the codes are available in the GitHub

Get Best Software Deals Directly In Your Inbox

--

--