Achieving Success as a Data Scientist Part 4: Bitcoin Price Prediction

Byte Brilliance
13 min readFeb 7, 2024

Note: This article is part of series designed to guide aspiring Data Scientists. The series is structured in such a way that Part 1 is for complete beginners and increases in complexity later on. Feel free to explore different parts of the series according to your experience level:

Introduction

In the ever-evolving landscape of finance and technology, the allure of cryptocurrencies has captivated the imagination of investors and enthusiasts alike. At the forefront of this digital revolution stands Bitcoin, a decentralized and boundary-defying digital currency that has not only disrupted traditional financial norms but has also sparked a new era of exploration into predictive analytics.

As we delve into the intricate world of predicting Bitcoin prices, we embark on a journey through the realms of time-series analysis. Time-series analysis, a powerful tool in the data scientist’s arsenal, allows us to dissect historical data trends, unveiling patterns and insights that can help us anticipate future movements. In this blog post, we aim to demystify the process by employing Long Short-Term Memory (LSTM) networks, a sophisticated form of artificial intelligence designed to handle sequential data.

Before we immerse ourselves in the intricacies of LSTM and Bitcoin price predictions, let’s take a step back to understand the context. Cryptocurrencies, born out of the need for a decentralised and transparent financial system, have emerged as a disruptive force challenging the conventional notions of currency and value exchange.

Bitcoin, the pioneer and poster child of cryptocurrencies, has faced its fair share of triumphs and tribulations. In the dawn of 2024, the cryptocurrency landscape witnessed a monumental event — the approval of a Bitcoin Exchange-Traded Fund (ETF) in January, a development that further legitimised the digital asset in the eyes of traditional investors.

Adding to the intrigue, a significant event looms on the horizon — the Bitcoin halving scheduled for April 2024. This predetermined event, encoded in the Bitcoin protocol, reduces the reward for miners by half approximately every four years. As the supply of new Bitcoin diminishes, the halving has historically been associated with notable price fluctuations, making it a compelling catalyst for our exploration into predictive analytics.

With the Bitcoin ETF approval and the impending halving as our motivation, we embark on a tutorial time-series analysis, leveraging historical data to uncover potential patterns and trends in Bitcoin prices. However, it’s essential to emphasize that the information presented in this tutorial is not financial advice. Rather, it serves as an educational exploration into the fascinating world of time-series analysis and its application in predicting cryptocurrency prices.

So, fasten your seatbelts as we navigate the tumultuous waves of Bitcoin’s historical data, guided by the analytical prowess of LSTM networks, on our quest to unravel the mysteries of cryptocurrency price prediction.

Data preprocessing

As always, our project starts with the dataset. Investing provides Bitcoin price history data. At the time of writing, the last day of data available was 3rd February 2024. I set the beginning filter to 1st January 2014 (however, you can go as far back as you’d like!).

Some of the information provided in the dataset includes:

  • Date: The day of the sample
  • Price: The closing price for that day
  • Open: The opening price for that day
  • High: The highest price for that day
  • Low: The lowest price for that day
  • Vol.: The volume of Bitcoin traded that day
  • Change %: The percentage change in the current days price and the previous days price

The Vol. column is provided in the format of a float number and the multiplier e.g. 32.15K. To get this in numerical format, we will extract the numerical part and multiply it by the given multiplier. So for the example 32.15K we’ll return 32150. The three possible multipliers are K (thousands), M (millions) and B (billions, yes billions!).

The Change % is provided as a percentage, for example -0.29%. To get this in a numerical format, we’ll simply extract the numerical part. For the example -0.29% we’ll return -0.29.

From the Date column, we can extract the Day, Month, and Year of the sample.

The Price, Open, High, and Low columns are provided in text-based format, e.g. 43,070.1. So we’ll remove the comma and convert the column type to float.

The code for the above transformations is provided below.

raw_data = pd.read_csv('./Data/bitcoin-03Feb2024.csv')
raw_data['Multiplier'] = raw_data['Vol.'].apply(lambda x: 1000 if 'K' in x else (1000000 if 'M' in x else 1000000000))
def preprocessing(raw_data):
data = raw_data
data['Date'] = pd.to_datetime(data['Date'])
data['Day'] = data['Date'].dt.day
data['Month'] = data['Date'].dt.month
data['Month'] = data['Month'].astype(str).str.zfill(2)
data['Year'] = data['Date'].dt.year
data['Price'] = data['Price'].apply(lambda x: x.replace(',', '')).astype(float).round(2)
data['Open'] = data['Open'].apply(lambda x: x.replace(',', '')).astype(float).round(2)
data['High'] = data['High'].apply(lambda x: x.replace(',', '')).astype(float).round(2)
data['Low'] = data['Low'].apply(lambda x: x.replace(',', '')).astype(float).round(2)
data['Change %'] = data['Change %'].apply(lambda x: x.replace('%', '')).astype(float)
data['Vol.'] = data['Vol.'].apply(lambda x: x.replace('K', '').replace('M', '').replace('B','')).astype(float)
data['Vol.'] = data['Vol.'] * data['Multiplier']
data = data.sort_values(by=['Date']).reset_index(drop=True)
return data[['Date', 'Price']]
processed = preprocessing(raw_data)
processed

Although we have been provided with multiple features, for the sake of simplicity we’ll treat this as a univariate problem. Univariate data refers to a single variable measured over a period of time. In the context of time-series analysis, this variable represents the main focus of the study. In our use-case, we consider the daily closing prices of Bitcoin. Here, the variable of interest is the Bitcoin price, and each data point corresponds to the price of the stock at the end of a trading day.

In contrast, multivariate data involves multiple variables measured over the same period of time. Each variable may influence, or be influenced by, one or more other variables. In our use-case, the Open, High, Low, Vol., and Change % features would all be used to predict the Bitcoin closing price.

The defining difference between time-distributed data and cross-sectional data, is that cross-sectional data captures information at a single point in time, for example think back to our Fraud Predictor where we captured fraudulent transactions for a point in time. Whereas time-distributed data captures information over a specified period of time (for example, we can use 5 days of Bitcoin closing prices to predict the 6th days price).

As such, we have to prepare our time-distributed data a little differently. We need to pick a window period (i.e. how many days in the past we want to look) and a predicting period (i.e. how many days in the future we want to predict). For simplicity, we will use a window period of 2 days and a predicting period of 1 day — in simple terms, we will use 2 days’ history of Bitcoin prices to predict the 3rd days’ price (yes, we are predicting the future!).

To accomplish this, let’s define a global variable to represent our window period:

# GLOBAL VARS
n_time_steps = 2 # Our window period

Next, we need a function to create our input features (i.e. 2 days’ Bitcoin closing prices) and our output (i.e. the 3rd days’ Bitcoin price):

def prepare_lstm_data(df, input_features=['Price'], 
target_feature='Price', time_steps=10):
"""
Prepare data for LSTM training.

Parameters:
- df: DataFrame with daily-level bitcoin prices.
- input_features: List of feature columns to be used as input for the LSTM model.
- target_feature: The target feature column to be predicted.
- time_steps: Number of past days to use as input for predicting the next day.

Returns:
- X: Input data in the shape (NUMBER_OF_SAMPLES, TIME_PERIOD, FEATURES).
- y: Target data for prediction.
"""


# Normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))

df['Price'] = scaler.fit_transform(df['Price'].values.reshape(-1,1))

# Select relevant columns
selected_columns = input_features
df_selected = df[selected_columns].copy()
pickle.dump(scaler, open('./scaler', 'wb'))
# Create sequences of input data and target values
sequences, targets = [], []

for i in range(len(df_selected) - time_steps):
seq = df_selected.iloc[i:i + time_steps][input_features].values
target = df_selected.iloc[i + time_steps][target_feature]
sequences.append(seq)
targets.append(target)

X = np.array(sequences)
y = np.array(targets)

return X, y

# Example usage:
# Assuming your DataFrame is named 'bitcoin_data'
# Adjust the parameters as needed, e.g., time_steps, input_features, target_feature
X, y = prepare_lstm_data(final_data, time_steps=n_time_steps)
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

The shape of our input (X) is : (3684, 2, 1). Simply, this means that we have 3 684 samples each having 2 dates and 1 variable.

The shape of our output (y) is (3684,). Meaning that for each of the 3 684 samples, we have 1 output.

Notice the use of the MinMaxScaler to normalise our data. Normalisation scales the data to a common range (0–1 in our case), preventing the model from being influenced by the magnitude of prices (think about a $500 Bitcoin price vs a $45000 price!). Normalisation enables the model to focus on learning patterns and relationships in the data rather than being biased by the scale of the input features

Don’t feel disheartened if you don’t immediately grasp this concept of a 3-dimensional input! It took me some time to fully understand it as well. To illustrate it further let’s look at an example:

The first row of X is [[0.01044703], [0.01105519]] which represents the (normalised) Bitcoin prices for 2 days (1 and 2 Jan 2014) and the first row of y is 0.011461619815979223 which represents the (normalised) Bitcoin price for the 3rd day (3 Jan 2014).

The next step is to create a function that will split our data into training and testing subsets. This is a bit different to how we might do it for cross-sectional data, see the Fraud Predictor tutorial for a comparison. When splitting time-distributed data, we need to take the order of the data into account too. For example, we might want our training subset to be all dates before an arbitrarily chosen date (e.g. 23 Dec 2023). In our case though, I want to follow a more traditional split where the training subset contains 80% of the data and the testing subset contains 20% of the data. As such, we can create a function as follows:

def time_based_train_test_split(X, y, test_size=0.2):
"""
Time-based train-test split for LSTM data.

Parameters:
- X: Input data.
- y: Target data.
- test_size: Proportion of the dataset to include in the test split.

Returns:
- X_train, X_test, y_train, y_test: Train and test sets for X and y.
"""

# Calculate the split index
split_index = int(len(X) * (1 - test_size))

# Split the data
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

return X_train, X_test, y_train, y_test

If we print the shapes of the subsets we get the following:

Shape of X_train: (2947, 2, 1)
Shape of y_train: (2947,)
Shape of X_test: (737, 2, 1)
Shape of y_test: (737,)

Excellent! Our training and testing subsets are the correct dimensions and are in the correct order! We’re now ready to train a model!

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to analyse sequences of data, making them particularly useful for tasks involving time-series, text, audio, and more. What sets RNNs apart is their ability to retain memory of previous inputs through recurrent connections, allowing them to process sequential information effectively. Imagine reading a sentence word by word; an RNN processes each word while maintaining a memory of the words it has seen so far. This memory enables RNNs to capture dependencies and patterns in sequential data, making them suitable for tasks like language translation, sentiment analysis, and time-series prediction. However, traditional RNNs can struggle with long-term dependencies due to vanishing or exploding gradient issues. This led to the development of more advanced variants like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs), which address these problems by selectively remembering and forgetting information over long sequences, enhancing their ability to learn and retain patterns in sequential data.

The specific network I’ve chosen for this tutorial is the Long Short-Term Memory (LSTM) network. LSTMs have a special architecture with gates that control the flow of information, allowing them to selectively remember or forget information over time. This enables LSTMs to effectively process and learn from sequences of data, making them well-suited for tasks like natural language processing, time-series prediction, and speech recognition. For more information on LSTMs please consult the following article.

Tensorflow allows us to create and train our very own LSTMs (amongst many other neural networks!). Specifically, the architecture used for training is as follows:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
import matplotlib.pyplot as plt
from tensorflow.keras.optimizers import Adam

# Assuming you have already loaded and prepared your data as X and y


# Define the LSTM model
model = Sequential()
model.add(LSTM(units=300, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2]),
return_sequences=True))
model.add(LSTM(units=200, activation='relu', return_sequences=True))
model.add(LSTM(units=100, activation='relu'))
model.add(Dense(units=50, activation='relu'))
model.add(Dense(units=1, activation='softplus')) # Output layer with one neuron for regression

custom_optimizer = Adam(learning_rate=0.0001)

# Compile the model
model.compile(optimizer=custom_optimizer, loss='logcosh', metrics=['mean_absolute_percentage_error'])

# Train the model
history = model.fit(X_train, y_train, epochs=25, batch_size=64, verbose=1)

The model has multiple LSTM layers followed by Dense layers, which are fully connected layers. These layers progressively reduce the size of the data while extracting important features. The model is then compiled with an optimiser (Adam) and a loss function (‘logcosh’) to measure how well it’s performing. Finally, the model is trained on the provided data (X_train and y_train) for 25 epochs (training iterations) with a batch size of 64, and the training history is stored for later analysis.

Let’s look at the model performance during training:

The graph above shows that the training loss is decreasing over time, which means that our model is learning! However, the loss plateaus around epoch 10 which means we could probably stop training then. Please refer to Tensorflow’s documentation to understand the concept of Early Stopping, and try to implement yourself as practice.

While we can see that the loss is decreasing, I wonder what’s happening to the mean absolute percentage error?

Again, the MAPE is decreasing over time (meaning that the model is getting better). However, the MAPE also starts to plateau around epoch 10. I wonder what might happen if we increase/decrease the learning rate, try it out!

Evaluating the model on the testing subset

Let’s now have a look at the predictions on the testing subset. Remember, because we normalised our data previously, we have to un-normalise our predictions to get the actual Bitcoin price:

# Load scaler object
scaler = pickle.load(open('./scaler', 'rb'))

# Evaluate the model on the test set
test_pred = model.predict(X_test)
train_pred = model.predict(X_train)

# Reverse the normalization
test_pred = scaler.inverse_transform(test_pred)
train_pred = scaler.inverse_transform(train_pred)

y_test_inv = scaler.inverse_transform(y_test.reshape((-1,1)))
y_train_inv = scaler.inverse_transform(y_train.reshape((-1,1)))


# Calculate Mean Squared Error
mse = mean_squared_error(y_test_inv, test_pred)
mape = mean_absolute_percentage_error(y_test_inv, test_pred)
print("Mean Squared Error on Test Set:", mse)
print(f"Mean Absolute Percentage Error: {round(mape,2)}%")

Which returns the output:

Mean Squared Error on Test Set: 3574302.90
Mean Absolute Percentage Error: 0.06%

Although the MSE is quite high, we can see that the MAPE is very decent! Let’s now plot the predicted values vs the actual values to see how it compares:

Wow! The plot above shows that our predicted values (blue) are quite close to our actual values (orange)!

Let’s now plot our predicted values with the training data, to see if our predicted values follow the historical trends!

Fantastic! The above two charts show that the LSTM model has learned the patterns between the sequences quite well!

Finally, let’s try to use our model to predict Bitcoins price in the future!

Predicting the future

First, we need to determine how many days into the future we want to predict. I will choose 2 days and 7 days. However, there is one caveat: When predicting the future price, we will have to use our predictions as inputs into the model which means that our error will compound the more into the future we predict!

First, let’s define a function that will allow us to predict into the future. This function uses the last two prices we have to predict the 3rd price. When predicting the 4th price, the function will replace the second last price with the last price and the last price will be replaced with the 3rd price. This is a bit complex, so let’s illustrate with an example:

The last 2 (normalised) prices we have are [0.639061, 0.637212]. The predicted (normalised) 3rd price is [0.64]. To predict the 4th price, our input now becomes [0.637212, 0.64]. As you can see, the predicted price is now used an input to the model.

Remember, the last day of data we have is 3 Feb 2024. So in the 2 days prediction, we will predict the prices for 4 and 5 Feb 2024:

Date       | PredictedPrice
2024-02-04 | 45172.585938
2024-02-05 | 45732.656250

For the 7 days prediction, we’ll predict the prices for 4–10 Feb 2024:

Date       | PredictedPrice
2024-02-04 | 45172.585938
2024-02-05 | 45732.656250
2024-02-06 | 47189.441406
2024-02-07 | 47867.480469
2024-02-08 | 48990.777344
2024-02-09 | 49704.421875
2024-02-10 | 50590.484375

Interesting! Remember, because the predictions of the model are used as an input for further predictions, the error compounds and therefore the predictions start to be less reliable the further into the future we predict!

All code for this tutorial is available on GitHub.

Conclusion

In this article, we explored the fascinating realm of time-series analysis, using Bitcoin price prediction as a use-case.

We explored the various nuances that come with pre-processing time-distributed data, such as working with 3-dimensional arrays and creating ordered training and testing subsets.

We introduced the concept of Recurrent Neural Networks and specifically looked at the LSTM network for the problem of predicting Bitcoin prices. We saw that our model learned the historical trends quite well, but that when we try to predict more than 2 days into the future, the error compounds and the predictions become less reliable!

I hope you enjoyed this topic as much as I did! As always, please follow and join me in exploring the boundless possibilities that a career in Data Science can offer. The journey is challenging, but the destination is worth every step.

--

--

Byte Brilliance

Data Science information, tutorials, and advice from an industry expert with multiple years of experience.