Predicting Ethereum (ETH) Prices With RNN-LSTM in Keras (TensorFlow)

Adrien Borderon
Oct 12 · 10 min read
Image for post
Image for post

The idea of this topic is to present a simple way for predicting future prices of Ethereum cryptocurrency using exploratory analysis and recurrent neural networks (RNN), primarily LSTMs.

I will not go into details concerning cryptocurrencies and the operation of LSTMs, many articles are already present on the subject.

I invite you to look at these articles if these topics are still unclear for you :

How does Ethereum work, anyway?https://medium.com/@preethikasireddy/how-does-ethereum-work-anyway-22d1df506369

Understanding LSTM and its diagrams
https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714

The first step in any project is to recover a dataset, in our case we need the historical data from Ethereum.

The cryptodatadownload platform offers this type of data. I used the Kraken market and the ETH / USH dataset with hourly granularity available at this address :

Here is the csv file imported with pandas :

Image for post
Image for post
ethusd daframe

Before continuing, it is important to carry out some data structuring operations :

Image for post
Image for post
ethusd dtypes function

Simple check for null values :

Image for post
Image for post
Simple check for null values

Before starting the correlation analysis, it is important to use numeric values on the quantitative variables :

Image for post
Image for post
ethusd dtypes function

Correlations

With the data preparation complete, the next step is to analyze the different correlations to identify the most interesting variables.

The idea is to predict the future values of “Close” price. It is therefore important to know if the other variables can explain the variability of this one.

Here we will first plot the Pearson correlation heatmap and see the correlation of independent variables with the output variable Close. We will only select features which has correlation of above 0.5 (taking absolute value) with the output variable.

The correlation coefficient has values between -1 to 1
— A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
— A value closer to 1 implies stronger positive correlation
— A value closer to -1 implies stronger negative correlation

Image for post
Image for post
Heatmap on df_ethusd

I then recover the different pairs of correlations to check their pvalues and keep only the significant correlations with a threshold lower than 0.05%.

Image for post
Image for post
df_ethusd correlations

As we can see, only the features High, Low and Open are highly correlated with the output variable Close

To ensure this first hypothesis, a second analysis is performed with the Recursive Feature Elimination (RFE) method. The RFE method works by recursively removing attributes and building a model on those attributes that remain. It uses accuracy metric to rank the feature according to their importance.

Here is the result :

Image for post
Image for post
Recursive Feature Elimination

The RFE method also recommends the same variables as the first method with the Pearson coefficient.

We will therefore keep the Open, High and Low variables to predict the Close price.

Data pre-processing

We now come to the pre-processing stage. It is important to select the different variables and to define a time steps.

The specified number of time steps defines the number of input variables (X) used to predict the next time step (y). As such, for each time step used in the representation, that many rows must be removed from the beginning of the dataset.

In our case, I set the time steps to 24. This means that the model will each time use the last 24 hours to predict the next future hour.

Image for post
Image for post
Features selection

Here is the dataset at this point :

Image for post
Image for post
dataset afeter features selection

Evaluate normal distribution of selected features

The next step is to normalize the data so that the LSTM model is not affected by variations in scales. Differences in the scales across input variables may increase the difficulty of the problem being modeled.

There are several methods, the two main ones are offered by ScikitLearn :

StandardScaler : Removes the mean and scales the data to the unit variance.
MinMaxScaler : Rescales the dataset so that all function values are in the range [0, 1].

In order to use the best method, it is important to know beforehand whether our variables follow a normal distribution.

To do this, we must look at the distributions of the variables and compare them with the normal density.

Image for post
Image for post
Normal distribution of selected features

We can conclude that the variables do not seem to follow a normal distribution.

To ensure this, a Kolmogorov Smirnov test is applied to each of the variables, The Kolmogorov–Smirnov method is also another goodness of fit method that compares the maximum distance between the experimental cumulative distribution function and the theoretical cumulative distribution function.

Here are the results :

Image for post
Image for post
Kolmogorov Smirnov test

The variables do not have a pvalue greater than the significant threshold of 0.05%, which leads us to reject the hypothesis of normality. This is probably linked to the large number of outliers in the time series. A StandardScaler normalization seems very well suited in this context.

Train-test split

Now it’s time to split our dataset into two parts, one part for training the model and a second part of data validation.

I used a standard 80/20 split which brings us to 22,999 samples for training and 5749 for validation data, which is sufficient in our case.

Image for post
Image for post
Train/test size

Visualization of training and test data :

Image for post
Image for post
Train-test split

Feature Scaling Normalization

We can then normalize the two datasets with the StandardScaler method.

Image for post
Image for post
Training/testing scaled shapes

With the data now normalized, it now becomes important to transform the data structure into the input data expected by an LSTM model.

You always have to give a three-dimensional array as an input to your LSTM network. Where the first dimension represents the batch size, the second dimension represents the time-steps and the third dimension represents the number of units in one input sequence. For example, the input shape looks like (batch_size, time_steps, units).

There are many ways to transform data, here is one that achieves the 3 desired dimensions :

Image for post
Image for post
Input shapes

Check the shape (again) before start training :

Image for post
Image for post
Input shapes

We have 24 time steps on the X_train and X_test datasets with 4 features used as well as 4 features also used on y_train and y_test.

Build LSTM network

It is now time to prepare the LSTM model, I define a function which takes as input the training and test data as well as some hyper parameters.

The model is then formed with two LSTM hidden layers, each with 50 units.

25% dropout layers are also used between each LSTM hidden layer.

A dropout on the input means that for a given probability, the data on the input connection to each LSTM block will be excluded from node activation and weight updates.

In Keras, this is specified with a dropout argument when creating an LSTM layer. The dropout value is a percentage between 0 (no dropout) and 1 (no connection).

It is important to specify the input shape on the first LSTM hidden layer so that it uses the same as the training data.

Linear activation is then used on the Dense output layer.

The training can begin, I used 30 epochs with a batch size set to 256. These values seem to make the model converge quickly.

Here are the training and validation loss curves :

Image for post
Image for post
Model loss

The model seems to converge quickly to 0 on both training and validation data.

Performance visualization using the Test Set

It becomes important to check the performance of the model with different metrics and not just with loss curves.

The most interesting metrics for this are: MAE, MAPE, MSE, RMSE, R_Squared and Adjusted R-Squared.

They must be calculated from the predicted values on the validation data :

It is then possible to apply different metrics on y_pred_test and y_actual_test.

Here are the results :

Image for post
Image for post
Metrics

We obtain a MAPE value of 3.43% which means a very low average error between the actual values and the values predicted by the model.

In addition, the adjusted R2 and R2 coefficients are very close to 1, which means that the predicted values are strongly correlated with the real values and therefore explain a lot of variances on the real values.

We can visualize the performance of the model with a graph, for this I define two time series, one with validations data and another with predicted data :

Then just view the two series :

Image for post
Image for post
Testing predictions vs Actual

We can focus on the last month :

Image for post
Image for post
Testing predictions vs actual on last month

Although there are still improvements to be made to data processing and model parameters to improve the quality of predictions. The predictions made by the model follow the main trend in test data.

Predicting the future

It is then possible to use the model to predict the future, the future prices of Ethereum for the next few hours.

There are several approaches to predict the future. A direct prediction or a recursive prediction.

I used recursive prediction to predict the next 12 hours. The model predicts the 4 features for 1 time step each time.

So I used the predicted variables and I integrated them as input variables in the last window by shifting by once step each time.

This approach is interesting for small predictions because it multiplies the predicted error each time which can considerably impact the quality of the predictions over long periods.

Here is the result of the next 12 hours:

Image for post
Image for post

Summary

The Ethereum time series was used with the Open, High, and Low variables to measure network performance on the Close variable.

The estimation results obtained are compared to the graphs. MSE, MAPE, and R values were examined as predictive success criteria.

However, it is possible to achieve more successful results with more data points and by modifying the hyper parameters of the LSTM network.

Thats it! Hope this article provides a good understanding on using LSTM’s to forecast time series.

References :

https://machinelearningmastery.com/use-dropout-lstm-networks-time-series-forecasting/
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
https://machinelearningmastery.com/use-timesteps-lstm-networks-time-series-forecasting/
https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

Follow me here.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Adrien Borderon

Written by

Data Scientist / AI specialist

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Adrien Borderon

Written by

Data Scientist / AI specialist

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store