Bitcoin price forecasting with deep learning algorithms

Disclaimer: All the information in this article including the algorithm was provided and published for educational purpose only, not a solicitation for investment nor investment advice. Any reliance you place on such information is therefore strictly at your own risk.

Bitcoin is the first decentralized digital currency. This means it is not governed by any central bank or some other authority. This cryptocurrency was created in 2009 but it became extremely popular in 2017.

Some experts call bitcoin “the currency of the future” or even lead it as an example of the social revolution. The bitcoin price has increased several times during the 2017 year. At the same time, it is very volatile. Many economic entities are interested in tools for predicting the bitcoin prices. It is especially important for existing or potential investors and for government structures. The last needs to be ready to significant price movements to prepare a consistent economic policy. So, the demand for Bitcoin price prediction mechanism is high.

This notebook demonstrates the prediction of the bitcoin price by the neural network model. We are using 2-layers long short term memory (LSTM) as well as Gated Recurrent Unit (GRU) architecture of the Recurrent neural network (RNN). You can read more about these types of NN here:

The dataset we are using is available here: Bitcoin Historical Data

The first thing we do is importing all the necessary python libraries.

Now we load the dataset in the memory and test it on the presence of the null values:



We can see that there are not null values in the dataset. Now we want to preview the head of the dataset to know the structure of the data:


We want to transform the data to get the average price grouped by the day and to see usual datetime format (not a timestamp as above).



We need to split our dataset because we want to train and test the model only on some chunk of the data. So, in the next cell, we are counting the necessary parameters for splitting (number of days between some dates). We want to train our model on the data from January 1, 2016, until August 21, 2017, and to test the model on the data from August 21, 2017, until October 20, 2017.


Now we are splitting our data into the train and test set:

599 61

Exploratory Data Analysis

We want to estimate some parameters of our data because this can be useful in the further model designing. The first important thing when forecasting time series is to check if the data is stationary. This means that our data is influenced by such factors as trend or seasonality.

In the next cell, we concatenate train and test data to make analysis and transformations simultaneously.

In the next couple of cells, we perform a seasonal decomposition of the data to estimate its trend and seasonality. You can see the actual price movements on the plot below (“observed”) as well as the trend and seasonality in our data.

The next thing we do is the examination of the autocorrelation. It is it is the similarity between observations as a function of the time lag between them. It is important for finding repeating patterns in the data.

Now we need to recover our df_train and df_test datasets:

Data preparation

We need to prepare our dataset according to the requirements of the model, as well as to split the dataset into train and test parts. In the next cell, we define a function which creates X inputs and Y labels for our model. In the sequential forecasting, we predict the future value based on some previous and current values. So, our Y label is the value from the next (future) point of time while the X inputs are one or several values from the past. The amount of these values we can set by tuning the parameter look_back in our function. If we set it to 1, this means that we predict current value t based on the previous value (t-1).

Now we perform final data preparation:

  1. Reshape the train and test datasets according to the requirements of the model.

We have tried to train several different models and compare their results. You can find them in the table below. These results were obtained using the following hardware: 4-core CPU, 16 GB RAM and by training each model ten times with different random states. As we can see, the best result is obtained by using the 2-stacked LSTM. Nevertheless, this model is much slower then GRU or 1-layer LSTM. The Autoregressive integrated moving average model (ARIMA) shows the worst results both in performance and training time. We can also see, that the 1-layer LSTM model is not capable to recognize patterns in the data so we need more complex models. We are going to demonstrate 2-layers LSTM neural network in more detail.

Training 2-layers LSTM Neural Network

Eventually, we can build and train our model. We use Keras framework for deep learning. Our model consists of two stacked LSTM layers with 256 units each and the densely connected output layer with one neuron. We are using Adam optimizer and MSE as a loss. Also, we use an early stopping if the result doesn’t improve during 20 training iterations (epochs). We performed several experiments and found that the optimal number of epochs and butch_size is 100 and 16 respectively. Also, it is important to set shuffle=False because we don’t want to shuffle time series data.

Train on 599 samples, validate on 59 samples
Epoch 1/100
599/599 [==============================] - 2s 3ms/step - loss: 0.0074 - val_loss: 0.1025
Epoch 2/100
599/599 [==============================] - 1s 2ms/step - loss: 0.0644 - val_loss: 0.2629
Epoch 3/100
599/599 [==============================] - 1s 2ms/step - loss: 0.0107 - val_loss: 0.0181
Epoch 4/100
599/599 [==============================] - 1s 2ms/step - loss: 0.0019 - val_loss: 0.0070
Epoch 5/100
599/599 [==============================] - 1s 2ms/step - loss: 5.3863e-04 - val_loss: 0.0017
Epoch 6/100
599/599 [==============================] - 1s 2ms/step - loss: 4.1020e-04 - val_loss: 0.0027
Epoch 7/100
599/599 [==============================] - 1s 2ms/step - loss: 2.1977e-04 - val_loss: 0.0022
Epoch 8/100
599/599 [==============================] - 1s 2ms/step - loss: 2.5272e-04 - val_loss: 0.0022
Epoch 9/100
599/599 [==============================] - 1s 2ms/step - loss: 2.4554e-04 - val_loss: 0.0020
Epoch 10/100
599/599 [==============================] - 1s 2ms/step - loss: 2.6365e-04 - val_loss: 0.0019
Epoch 11/100
599/599 [==============================] - 1s 2ms/step - loss: 2.5525e-04 - val_loss: 0.0018
Epoch 12/100
599/599 [==============================] - 1s 2ms/step - loss: 2.6679e-04 - val_loss: 0.0018
Epoch 13/100
599/599 [==============================] - 1s 2ms/step - loss: 2.5337e-04 - val_loss: 0.0017
Epoch 14/100
599/599 [==============================] - 1s 2ms/step - loss: 2.5953e-04 - val_loss: 0.0017
Epoch 15/100
599/599 [==============================] - 1s 2ms/step - loss: 2.4082e-04 - val_loss: 0.0016
Epoch 16/100
599/599 [==============================] - 1s 2ms/step - loss: 2.4312e-04 - val_loss: 0.0016
Epoch 17/100
599/599 [==============================] - 1s 2ms/step - loss: 2.2189e-04 - val_loss: 0.0016
Epoch 18/100
599/599 [==============================] - 1s 2ms/step - loss: 2.2231e-04 - val_loss: 0.0016
Epoch 19/100
599/599 [==============================] - 1s 2ms/step - loss: 2.0289e-04 - val_loss: 0.0016
Epoch 20/100
599/599 [==============================] - 1s 2ms/step - loss: 2.0255e-04 - val_loss: 0.0016
Epoch 21/100
599/599 [==============================] - 1s 2ms/step - loss: 1.8815e-04 - val_loss: 0.0016
Epoch 22/100
599/599 [==============================] - 1s 2ms/step - loss: 1.8700e-04 - val_loss: 0.0016
Epoch 23/100
599/599 [==============================] - 1s 2ms/step - loss: 1.7834e-04 - val_loss: 0.0016
Epoch 24/100
599/599 [==============================] - 1s 2ms/step - loss: 1.7617e-04 - val_loss: 0.0016
Epoch 25/100
599/599 [==============================] - 1s 2ms/step - loss: 1.7182e-04 - val_loss: 0.0016
Epoch 26/100
599/599 [==============================] - 1s 2ms/step - loss: 1.6926e-04 - val_loss: 0.0016
Epoch 27/100
599/599 [==============================] - 1s 2ms/step - loss: 1.6698e-04 - val_loss: 0.0016
Epoch 28/100
599/599 [==============================] - 1s 2ms/step - loss: 1.6496e-04 - val_loss: 0.0016
Epoch 29/100
599/599 [==============================] - 1s 2ms/step - loss: 1.6336e-04 - val_loss: 0.0016
Epoch 30/100
599/599 [==============================] - 1s 2ms/step - loss: 1.6200e-04 - val_loss: 0.0016
Epoch 31/100
599/599 [==============================] - 1s 2ms/step - loss: 1.6081e-04 - val_loss: 0.0016
Epoch 32/100
599/599 [==============================] - 1s 2ms/step - loss: 1.5982e-04 - val_loss: 0.0016
Epoch 33/100
599/599 [==============================] - 1s 2ms/step - loss: 1.5899e-04 - val_loss: 0.0016
Epoch 34/100
599/599 [==============================] - 1s 2ms/step - loss: 1.5830e-04 - val_loss: 0.0016
Epoch 35/100
599/599 [==============================] - 1s 2ms/step - loss: 1.5775e-04 - val_loss: 0.0016
Epoch 36/100
599/599 [==============================] - 1s 2ms/step - loss: 1.5735e-04 - val_loss: 0.0016
Epoch 00036: early stopping

We have trained our model. You can see that it has good performance even after several iterations. On the plot above, we compare the Train and Test loss on each iteration of the training process. We can see, that after some iterations the train and test loss became very similar, which is a good sign (this means we are not overfitting the train set). Below, we use our model to predict labels for the test set. Then we inverse original scale of our data. You can see a comparison of true and predicted labels on the chart below. It looks like our model gives good results (lines are very similar)!

Below we calculated the root mean squared error (RMSE). The meaning of this indicator is what is the average distance between predicted points on the test set and the actual (true) labels. In other words, it shows the extent of our error. The less this number, the better. We can see, that our model’s RMSE is not very big (consider that the price in our data set is in thousands of USD, and we are mistaken only by tens of USD).

Test RMSE: 18.724

Below we extract the convenient format of dates and plot the same chart as above, but with these dates on the X-axis.

The results we obtained can be improved. For this, we will try the following thing. We get 10 different train and test datasets and train the model on each train test and then test it on the corresponding test dataset. After this, we calculate the RMSE for each pair of train/test dataset. Then we find an average RMSE on all these datasets and subtract this value from each prediction, obtained from our current model. This can improve the performance.

We want to demonstrate this approach on the GRU model just to show different models.

First what we do is to define three functions, which will be acting as subsequent elements in the pipeline. Basically, these functions are very similar to what we do when preparing data and training our previous 2-layers LSTM model.

The function below uses all three previous functions to build workflow of calculations and return RMSE and predictions of the model.

Now we can run a workflow function to calculate RMSE for a single GRU model:

Test GRU model RMSE: 32.764

Now we can run a cross_validate function to trigger calculations:

Iteration: 1
Test RMSE: 9.233
Iteration: 2
Test RMSE: 16.251
Iteration: 3
Test RMSE: 12.337
Iteration: 4
Test RMSE: 38.239
Iteration: 5
Test RMSE: 49.088
Iteration: 6
Test RMSE: 3.908
Iteration: 7
Test RMSE: 7.206
Iteration: 8
Test RMSE: 38.290
Iteration: 9
Test RMSE: 4.388
Iteration: 10
Test RMSE: 9.347
Average RMSE: 18.8287473079
RMSE list: [9.233330864072622, 16.25122406236244, 12.3374370718704, 38.2387143974303, 49.08764082707623, 3.908100289970251, 7.206358361324355, 38.29018303096499, 4.387561412580847, 9.346922761677483]

Next, we subtract the mean RMSE from each prediction our model produced. Then, we recalculate the RMSE for the model.

Test GRU model RMSE_new: 14.223

We can see, that the RMSE has been reduced significantly. This means that our experiment was successful. On the plot below you can see the difference between the predicted and true test labels.

Let’s calculate a symmetric mean absolute percentage error ( SMAPE). It will show how good our predictions are in percentage. We define function symmetric_mean_absolute_percentage_error, which will perform all necessary calculations.

Test SMAPE (percentage): 0.304

We can see that our SMAPE is less than 1%, which means that the error of our model is very small.

In this notebook, we trained the 2-layers Long Short Term Memory Neural Network as well as Gated Recurrent Unit Neural Network using Bitcoin Historical Data. These models can be used to predict future price movements of bitcoin. The performance of the models is quite good. On average, both models considered here, makes an error measured only in tens of USD.


  1. Original blog post

2. iPython Notebook Source code



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Igor Bobriakov

Data Scientist and Entrepreneur, Founder of Data Science School & Machine Learning for Startups →