Crypto Price Forecasting with GRU

Published in

Analytics Vidhya

14 min readAug 19, 2021

Photo by André François McKenzie on Unsplash

Note: Not to be taken as financial advice.

Cryptocurrency has been known to be immensely volatile, hence it’s even more challenging to predict the general direction let alone the daily signaling. In this blog post, I’ll attempt to address this very challenge with the help of some deep learning techniques.

1. Problem formulation

The task at hand is to forecast a single time-series value given a previous sequence of fixed length. The cryptocurrency chosen for the problem will be Satoshi Nakamoto’s creation, the one and only Bitcoin. But this can easily be extended to any token as you’ll find out in the upcoming section.

Type: Multi-variate single step forecasting, which means multiple features will be used in the sequence for prediction instead of using just one feature (univariate forecasting). Single step denotes prediction of one value ahead of the sequence. Alternatively, multiple values can also be predicted.

Performance metric: Mean Absolute Error (MAE) which is the average of the absolute errors in the test data.

2. Gathering data

I’ll be making use of LunarCRUSH API to fetch historical Bitcoin data. Additionally you can fetch data for almost any token in real-time and analyze it, which makes this resource so powerful. We’ll be exploiting Assets Rest API endpoint to load time-series data and selecting any number of metrics of our choice.

The reason for choosing LunarCrush is that apart from returning normal time-series data like opening/closing prices, it also includes insightful metrics like social dominance, sentiment scores, frequency of tweets, Reddit activity and some of their own metrics like Galaxy score for all supported cryptocurrency assets. The free tier plan allows us to request for 720 data points at a time.

Let me break down the request:

‘&key={api_key}’: using the loaded API key (unique for each account).

‘&symbol=BTC’: requesting time-series data for Bitcoin. Can also pass a comma-separated list to fetch data for multiple cryptocurrencies.

‘&interval=day’: an interval string value of either “hour” or “day”. Defaults to “hour” if omitted.

‘&time_series_indicators={features}’: a comma-separated list of metrics to include in the time series values.

‘&data_points=720’: number of time series data points to include for the asset. Defaults to 24. Maximum of 720 data points accepted.

We receive a JSON string object containing the time series data which can be parsed and converted into a Python dictionary object using json.loads() method. As you can see above, we get Bitcoin series data (only first three data points shown) for the respective indicators we had passed, as well as other metrics for the particular day the request was sent.

3. Preprocessing data

It’s a necessity to transform raw data into a form that is ready to go to the advanced stages of data analysis and feedable to machine learning models. Let’s check out some of the steps in the process:

Converting to dataframe: Firstly, we convert the time series JSON string to a dictionary by referencing the ‘timeSeries’ key and transform it into a pandas dataframe for further processing.

series_dict = json.loads(obj)[‘data’][0][‘timeSeries’]
df = pd.DataFrame(series_dict)
df

Dealing with null values: There are many imputation techniques for time series data like Next observation carried backward (NOCB), LOCF, Linear/Spline interpolation but these methods rely on the assumption that adjacent observations are similar. As only the starting 20 points contain null values, I’ve decided to remove those points itself.

# removing rows with null values
df = df.dropna()
df = df.reset_index(drop=True)

Extracting dates and months: The time column contains Unix timestamps (seconds since epoch) which can be converted into a more interpretable format using the datetime module’s utcfromtimestamp() method. We’ll also be extracting months from the timestamps which can be used for some data analysis in the coming section.

4. Exploratory Data Analysis

Multi-collinearity

Neural networks with more than one hidden layer do not suffer from multi-collinearity. Regardless it’s a good practice to check for features which may be highly linearly related to each other. We’ll check for collinearity in the dataset by displaying the correlation matrix using seaborn’s heatmap on top of pandas’ corr() method.

import seaborn as snsdata = df.iloc[:, 2:].copy()
ax = sns.heatmap(data.corr(), cmap='inferno', annot=True)

Close price is highly correlated with both tweets and social score (>0.7). In case you’re not familiar, correlation between a ‘predictor’ and ‘target’ is a good indication of better predictability. tweets and social_score are extremely correlated as expected while news and 24h_% aren’t correlated with any other feature which is quite underwhelming but expected given how volatile Bitcoin can be. Now let’s plot close and social_score to closely observe the nature of relationship between the features.

The plots above re-affirm the observation from before: closing price and social activity both almost being inversely proportional to each other. As the Bitcoin price goes up, the tweet activity and social score normally go down (unless massive rise), while social activity appears to increase when the market is down. Why could this be the case? Well the world feeds off fear and negativity, ‘FUD’ as termed in the crypto space.

Monthly analysis

Grouping our data by month and exploring for trends can give us some interesting insights for the particular token.

monthly = df.groupby('month').mean()
monthly

Let’s look at how social score and daily price change (in %) fair up on a monthly basis.

We can again indirectly observe an inverse relation between the market and social activity, with June and May seeing the highest social activity while observing the biggest dips among the months. September stands out having the 2nd most average dip and still having the least social engagements. We must note and keep in mind that the dataset is limited with just 700 days (almost 2 years) of data and the insights are limited to this time-frame, hence cannot be generalized on a monthly or seasonal basis for future data points.

Candlestick chart

A candlestick chart is a style of financial chart used to describe price movements of a security, derivative, or in our case currency. It makes use of four price points (open, close, high, and low) throughout the period of time the trader specifies. Trading is often dictated by emotion, which can be read in candlestick charts.

It has three basic features:

Body which represents the open-to-close range.
Wick, or shadow, that indicates the intra-day high and low.
Color, which reveals the direction of market movement — a green (or white/blue) body indicates a price increase, while a red (or black) body shows a price decrease.

We’ll be using Plotly, an interactive graphing library and Cufflinks library which enables us to create visualizations directly from Pandas. QuantFig method easily plots candlestick chart if fed the required open, close, high, and low columns for a given asset.

This is a snapshot of the candlestick plot with a simple moving average (red) over 7 days for all the data points, 28 August 2019 to 27 July 2021. The plot is interactive in nature (in the notebook) and can be zoomed in or out over a specific time period, as shown below for the last 2 months.

Expert traders leverage candlestick charts as one of the most important components of technical analysis to identify different types of patterns (bullish, bearish and neutral) and make relevant decision to buy, sell or hold the asset. For more exploration, you can find the most popular candlestick patterns here.

5. Generating sequence data

This is the most delicate and critical part of this post. All forecasting models ahead will crumble if the sequence data isn’t appropriate or even if there’s a slight mistake. Firstly let’s look at how to split the dataset and scale the features.

Time-based splitting:

Standardizing features in the dataset i.e. bringing them onto the same scale is necessary for neural networks to perform well and converge faster. Instead of randomized splitting, we prefer to do time based-splitting into train and test sets for time-series data. In our case, the training data points will be used to predict prices of the last 30 days. First we fit the data up till the split_index and then scale the dataset using sklearn’s StandardScaler implementation.

Apart from scaling the features, it’s crucial to preserve the scaler (mean and variance) for the labels (close prices). This scaler will be used to transform the prediction back into the scale it was originally using inverse_transform method as demonstrated later in the plotting function.

Windowing into sequence data

Let me start off by giving you an intuition on how to perform windowing of data points and generate sequence data which can be fed to the models for prediction.

1 --> 2 --> 3 --> 4 --> 5 --> 6 --> 7 --> 8 --> 9 --> 10

Let’s assume a simple time series, first ten natural numbers. Now we’ll convert it into sequences given a window size. Window size determines the length of each sequence and is a hyper-parameter, choosing the right size will lead to optimal results. Assuming window size of 4, the sequences will look as follows:

features --> label
__________________1, 2, 3, 4 --> 5
2, 3, 4, 5 --> 6
3, 4, 5, 6 --> 7
4, 5, 6, 7 --> 8

The model will be trained to predict the respective label given the sequence on the left. Similarly, in our case both training and test sets will be formed in this format. The major difference being we’re dealing with a multi-variate problem which will include sequences of 4 dimensions instead of one.

Let’s define a windowingData function which takes in starting/ending indices of the dataset, window size (14) and the dataset itself. The function returns sequences and their respective labels for both training and test data.

Train data contains 656 sequences, with each input sequence of size 14x4 i.e. 14 days and 4 features: close, tweets, social score and news. The input sequence shown below is expected to output the value of -0.5761.

Plotting function:

Defining this utility plotting function will help us to plot for either a single future prediction or multiple future predictions on the test data conveniently.

The function takes in data labels (history, true and predicted future), previously saved scaler, plot title and a boolean variable as arguments. multi decides the type of plot, either single or multiple future. The points are converted to their original scale using the saved scaler and plotted using relevant markers as shown in the sample plot below:

i = 0
day = df.index[i+14]
title = f'Sample plot: {day}'
ax = plot((X_train[i], y_train[i]), scaler_labels, title)

First input sequence and label, in original scale.

6. Baseline model

Baseline models are typically the most basic models you could implement and their performance should be the bare minimum for complex models ahead. In our case, we’ll implement a simple moving window average model which will return mean of the last 14 observations (closing prices) as the output.

The function will return a list of values (mean values of sequences) on passing the test data which is a 3 dimensional numpy array. Let’s look at a couple of predictions based on the baseline model:

Not bad! We will now visualize all single-step predictions for whole of test data by passing the test sequences, labels and moving average values using base_ma function.

The prediction curve looks way too simplistic and smoother than the true future prices but still does a decent job. Thus it is expected that the next models generalize the predictions better, but before their implementation, I’ll be giving a brief intro on sequence models and their characteristics or advantages in the upcoming section.

7. Intro to Sequence models

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Above is a passage from Christopher Olah’s brilliant explanation on RNN and LSTM. Traditional techniques and neural networks simply do not care about the concept of a sequence and can’t preserve previous useful information for solving the present task. Recurrent Neural Networks come to the rescue. RNN can be thought of as multiple copies of same neural network through the passage of time carrying information from one time-step to another.

But RNN’s suffer from short-term memory. If a sequence is long enough, they’ll have a hard time carrying information from earlier time-steps to later ones. Consider trying to predict the last word in “Lionel Messi is Argentine, he speaks Spanish.” Recent information suggests that the last word should be name of a language. Here the gap between the context word “Argentine” and the last word is small, but as this gap widens RNN’s are unable to perform well. During backpropagation, RNN’s may suffer from vanishing-gradient problem. When a gradient becomes too small, it doesn’t contribute much to learning.

Long Short Term Memory (LSTM’s) networks and Gated Recurrent Unit (GRU’s) take care of long-term dependencies. They are special kinds of RNN that have internal mechanisms called gates which regulate the flow of information.

The gates have different functionalities like what fraction of the cumulative information from previous cell (time-step) to preserve or determining the output of the current cell. GRU’s have a less complex architecture resulting in lesser equations, which in turn speeds up backpropagation. Hence GRU’s offer a faster and more efficient solution as well as are found to be performing better on lesser amounts of data. So arguably GRU’s suit our problem better than LSTM’s.

Bidirectional networks:

In bidirectional RNN’s, the input sequence is fed in normal time order for one network, and in reverse time order for another. The outputs of the two networks are usually concatenated at each time step. This implementation allows the networks to have both backward and forward information about the sequence at every time step i.e. take a look at future context as well.

8. LSTM model

Now it’s time for code implementation and testing out some models. I’ll be making use of Tensorflow 2.0/Keras. Firstly we’ll be defining some callbacks which are a powerful way to customize our model during training or evaluation phase.

ModelCheckpoint: Saving the model or model weights at some frequency, in our case whenever the validation loss is least (the best weights).

EarlyStopping: Stop the training when validation loss doesn’t improve for 10 epochs.

LearningRateScheduler: Reducing the learning rate by 5% on every 2nd epoch. It’s typically a good practice to change/reduce learning rate consistently for better performance.

It’s a stacked LSTM architecture (multiple LSTM layers) with Relu activations and Adam optimizer. With return_sequences=True, the input to the next layer will be a sequence of same length instead of just one vector.

Consolidated predictions:

day = df.index[670]
title = f'LSTM model: {day} onwards'
ax = plot((y_train[-30:], y_test, lstm.predict(X_test)), scaler_labels, title, multi=True)

The prediction curve is slightly more accurate and less smoother than the baseline moving average. Yet, it’s not able to be more precise on the sudden rise in Bitcoin prices towards the end as well as starts off pretty low in the beginning. Let’s try and experiment with another architecture.

9. Bi-GRU model

As mentioned earlier, it’s reasonable to assume that GRU might be a better fit to this problem. Additionally, we’ll be making use of stacked bidirectional GRU’s with a densely connected layer in the end instead of another GRU layer. Rest the same with dropouts to control overfitting, relu activations and adam optimizer.

We load the model weights from epoch 30th (#29 in plot) which gives us the best test loss (doesn’t improve in the consequent epoch’s). Note that training loss continues to decrease right until the last epoch and as the gap widens between the two losses, overfitting increases. Let’s look at some single-step predictions and consolidated predictions of test data sequences.

Consolidated predictions:

day = df.index[670]
title = f'BiGRU model: {day} onwards'
ax = plot((y_train[-30:], y_test, BiGRU.predict(X_test)), scaler_labels, title, multi=True)

Definitely the best generalization of Bitcoin prices among the three models. When consolidated, it’s able to predict the sudden rise towards the end to a much more degree than LSTM’s. It’s almost impossible to predict the exact prices but getting to know asset price direction in advance using features like social score or news activity is a major positive.

Model comparison.

10. Key learnings

Learnt the lifecycle of building a multi-variate time-series predictive model from scratch for financial data with steps including: fetching live cryptocurrency data, cleaning and converting to relevant format, getting to know key patterns through data analysis, generating sequence data, implementing and tweaking sequence models.

Future work

From feature engineering perspective, some complex features could be added (e.g. difference/mean of intra-day high and low prices). Hyperparameters like window size (in our case 14 days) could be increased if there are more data points available along with variations in model architecture.