Predicting Cryptocurrency Price With Tensorflow and Keras

Cryptocurrencies, especially Bitcoin, have been one of the top hit in social media and search engines recently. Their high volatility leads to the great potential of high profit if intelligent inventing strategies are taken. It seems that every one in the world suddenly start to talk about Cryptocurrencies. Unfortunately, due to their lack of indexes, Cryptocurrencies are relatively unpredictable compared to traditional financial instruments. This article aims to teach you how to predict the price of these Cryptocurrencies with Deep Learning using Bitcoin as an example so as to provide insight into the future trend of Bitcoin.

Getting Started

To run the code below, make sure you have installed the following environment and library:

  1. Python 2.7
  2. Tensorflow=1.2.0
  3. Keras=2.1.1
  4. Pandas=0.20.3
  5. Numpy=1.13.3
  6. h5py=2.7.0
  7. sklearn=0.19.1

Data Collection

Data for prediction can either collected from Kaggle or Poloniex. To make sure coherence, the column names for data collected from Poloniex are changed to match with Kaggle’s.

Data Preparation

Data collected from source needs to be parsed in order to send to the model for prediction. The PastSampler class was referenced from this blog for splitting the data into a list of datas and labels. The input size (N) is 256, while the output size (K) is 16. Note that data collected from Poloniex was ticked on a 5 minute basis. This indicates that the input spans across 1280 minutes, while the output covers over 80 minutes.

After creating the PastSampler class, I applied it on the collected data. Since the original data ranges from 0 to over 10000, data scaling is needed to allow the neural network to understand the data easier.

Building Models


A 1D Convolutional Neural Network is expected to capture the data locality well with the kernel sliding across the input data. As shown in the following figure.

CNN Illustration (retrieved from

The first model I built is Convolutional Neural Network. The following code set the GPU number “1” to be used (since I have 4, you might set it to any GPU you prefer). Since Tensorflow does not seems to do well when running on multiple GPUs, it is wiser to restrict it to run on only 1 GPU. Don’t worry if you do not have a GPU. Simply ignore these lines.

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] ='1'

The code for constructing CNN model is very simple. The dropout layer is for preventing overfitting problem. The loss function is defined as Mean Squared Error (MSE), while the optimizer is the state-of-the-art Adam.

model = Sequential()
model.add(Conv1D(activation='relu', input_shape=(step_size, nb_features), strides=3, filters=8, kernel_size=20))
model.add(Conv1D( strides=4, filters=nb_features, kernel_size=16))
model.compile(loss='mse', optimizer='adam')

The only thing you need to worry about is the dimension of input and output between each layer. The equation for computing the output of a certain convolutional layer is:

Output time step = (Input time step — Kernel size) / Strides + 1

At the end of the file, I added two callback function, CSVLogger, and ModelCheckpoint. The former one helps me to track all the training and validation progress, while the latter one allows me to store the model’s weight for each epoch.


Long Short Term Memory (LSTM) network is a variation of Recurrent Neural Network (RNN). It was invented to solve the vanishing gradient problem created by vanilla RNN. It is claimed that LSTMs are capable of remembering inputs with longer time steps.

LSTM Illustration (retrieved from

LSTM is relatively easier than CNN to implement as you don’t even need to care about the relationship among kernel size, strides, input size and output size. Just make sure the dimension of input and output is defined correctly in the network.

model = Sequential()
model.add(LSTM(units=units,activation='tanh', input_shape=(step_size,nb_features),return_sequences=False))
model.compile(loss='mse', optimizer='adam')


Gated Recurrent Units (GRU) is another variation of RNN. Its network structure is less sophisticated than LSTM with one reset and forget gate but getting rid of the memory unit. It is claimed that GRU’s performance is on par with LSTM but more efficient. (which is also true in this blog as LSTM takes around 45 secs/ epoch, while GRU takes less than 40 secs/ epoch)

GRU Illustration (retrieved from

Simply replace the second line of building model in LSTM

model.add(LSTM(units=units,activation='tanh', input_shape=(step_size,nb_features),return_sequences=False))


model.add(GRU(units=units,activation='tanh', input_shape=(step_size,nb_features),return_sequences=False))

Result Plotting

Since the result plotting is similar for the three model, I will only show CNN’s version. First, we need to reconstruct the model and load the trained_weights into the model.

Then, we need to invert-scaled the predicted data, which ranges from [0,1] because of the MinMaxScaler used previously.

Both Dataframes for the ground true (actual price) and the predicted price of Bitcoin are set up. For visualization purpose, the plotted figure only shows the data from August 2017 thereafter.

Plot the figure with pyplot. Since the predicted price is on a 16 minute basis, not linking all of them up would allow us to view the result easier. As a result, here the predicted data is plotted as red dot, as “ro” in the third line indicates. The blue line in the below graph represents the ground true (actual data), whereas the red dots represent the predicted Bitcoin price.

Best Result Plot for Bitcoin Price Prediction With 2-Layered CNN

As you can see from the above figure, the prediction closely resemble the actual price of Bitcoin. To select the best model, I decided to test several kinds of configuration of the network, yielding the below table.

Prediction Results for Different Models

Each row of the above table is the model that derives the best validation loss from the total 100 training epochs. From the above result, we can observe that LeakyReLU always seems to yield better loss compared to regular ReLU. However, 4-layered CNN with Leaky ReLU as activation function creates a large validation loss, this can due to wrong deployment of model which might require re-validation. CNN model can be trained very fast (2 seconds/ epoch with GPU), with slightly worse performance than LSTM and GRU. The best model seems to be LSTM with tanh and Leaky ReLU as activation function, though 3-layered CNN seems to be better in capturing local temporal dependency of data.

LSTM with tanh and Leaky ReLu as activation function
3-layered CNN with Leaky ReLu as activation function.

Although the prediction seems pretty good, there is a concern about overfitting. There is a gap between training and validation loss, (5.97E-06 vs 3.92E-05) when training LSTM with LeakyReLU, regularization should be applied in order to minimize the variance.


To find out the best regularization strategy, I ran several experiments with different L1 and L2 values. First we need to define a new function that facilitate fitting the data into LSTM. Here, I’ll use bias regularizer that regularizes over the bias vector as an example.

An experiment is done by repeating training the models for 30 times and each time with 30 epochs.

If you are using Jupyter notebook, you can see the below table directly from the output.

Result of Running Bias Regularizer

To visualize the comparison, we can use boxplot:

According to the comparison, it seems that L2 regularizer of coefficient 0.01 on the bias vector derives the best outcome.

To find out the best combination among all the regularizers, including activation, bias, kernel, recurrent matrix, it would be necessary to test all of them one by one, which does not seem practical to my current hardware configuration. As a consequence, I would leave it as a future plan.


You have learned:

  1. How to gather real-time Bitcoin data.
  2. How to prepare data for training and testing.
  3. How to predict the price of Bitcoin using Deep Learning.
  4. How to visualize the prediction result.
  5. How to apply regularization on the model.

Future work for this blog would be finding out the best hyper-parameter for the best model, and possibly using social media to help predict the trend more accurately. This is my first time to post in Medium. Should there be any mistakes or questions, please do not hesitate to leave any comments below.

For more information, please refer to my github.