The hidden secrets of the bitcoin price

Exploring the bitcoin price from a statistical point of view.

Loic Tetrel
Analytics Vidhya
10 min readJan 16, 2021

--

Photo by Pixabay from Pexels

Bitcoin is a digital currency created in 2009 by Satoshi Nakamoto, he describes it as a “peer-to-peer version of electronic cash”. One big advantage of bitcoin (and other cryptocurrencies) is that all the data is open and immutable, residing inside the blockchain. The openness and immutability of the data has made the research behind blockchain really active, mostly on the price forecasting. Many, rightfully, rush into the blockchain data (such as addresses, transactions etc..), but I will show in this post that the bitcoin price itself is already really informative. Understanding how the price behaves will make a substantial difference in the choice of models and parameters for predicting price.

The behaviour of the price is best understood via two main properties of time series: stationarity and seasonality. For example, a stationary time series can be much easier to model than a non-stationary.

In what’s coming, I will share with you my thought process in looking at the price, using statistical tools and python programming.

Note
The original and interactive version of this post (via binder) is available at my website, please check it if you want more information.
This post was rendered using the jupyter_to_medium tool.

tl;dr

  1. Two important properties for a time-series: stationarity (distribution does not depend on the time) and seasonality (recurrent patterns in the data).
  2. Auto-correlation to check if a data is non-stationary; derivative or data filtering/subtraction to remove the non-stationary component.
  3. FFT and short FFT to analyse the seasonality.

1. A quick primer on time series

As I said, there are two important properties attached to time series: seasonality and stationarity.
A stationary process means that the distribution (statistical properties) of the data does not changes over time, this is why it is much easier to model. Seasonality represents how frequently the data change (for the bitcoin price, we can express it in cycles per day), and also when it starts.

We will first focus on the analysis of the stationarity, and after the seasonality.

1.1. Stationnarity

One way to detect if a data is stationary is to compute the autocorrelation of the data, if it degrades quickly it is stationary.
There are many different types of non-stationary data in the literature, so I suggest you to read the following post if you want to learn more on it. Check also this figure and try to guess which time-series is stationary!

1.2. Seasonnality

To analyse the seasonality of the bitcoin, we can make a fourier analysis to extract the most prominent frequencies.
The magnitude of the FFT inform us how the given frequency component affect the price. In the other hand, the phase of the FFT is interesting to watch when does the dynamic of the price starts. If the magnitude or phase has a random white noise trend, then there is no evidence of principal component.

Check this nice blog post if you want to learn more on seasonality.

2. Code

2.1. Loading the data

The hourly USD price for the bitcoin can be collected using glassnode, with their advanced subscription. If you don’t want to pay for it, the 24-hour data comes free of charge. Here we will use hourly data to get a more precise analysis.

### imports
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import warnings
warnings.filterwarnings("ignore")

First, we want to create a function to load the data.

# function to read the data
def read_data(filepath):
price = []
time = []

with open(filepath) as f:
lines = f.readlines()[1:]

for data in lines:
curr_time = float(data.split(",")[0])
curr_price = -1
if data.split(",")[1][:-1]:
curr_price = float(data.split(",")[1][:-1])
time += [curr_time]
price += [curr_price]

return np.array(price, dtype=np.float32), np.array(time, dtype=int)

Now we will load the data by skipping the first year.

# define paths
filepath = "../data/market/price_usd_close_BTC_1h"
figure_dir = ""

# loading the hourly data, to avoid unimformative data, we skip the first year (8760 h)
price, time = read_data(filepath)
time_shifted = time - time[0]
price = price[8760:]
time_shifted = time_shifted[8760:]

Let’s look at the bitcoin price over the time,

### plot
plt.figure()
plt.plot(time_shifted, price)
plt.title("Bitcoin price over time (USD)")
plt.ylabel("price (USD)")
plt.xlabel("time (h)")
if figure_dir:
plt.savefig(os.path.join(figure_dir, "price.png"))
plt.show()
plt.close()
png

The non-stationary behaviour of the data is obvious when looking at the bitcoin price. We can also see clearly the big rises of Dec 2017/2020 there.

2.2. Stationarity

One way to remove the non-stationary component on the data is to compute its derivative. Another way is to filter the data with a Gaussian kernel, and subtract it to the original price data.

# derivative
price_dt = price[1:] - price[:-1]

# filter
filter_width = 12
def gaussian_kernel_1d(filter_width):
#99% of the values
sigma = (filter_width)/2.33
norm = 1.0 / (np.sqrt(2*np.pi) * sigma)
kernel = [norm * np.exp((-1)*(x**2)/(2 * sigma**2)) for x in range(-filter_width, filter_width + 1)]
return np.float32(kernel / np.sum(kernel))
f = tf.reshape(gaussian_kernel_1d(filter_width), [-1, 1, 1])
tf_price = tf.reshape(tf.constant(price, dtype=tf.float32), [1, -1, 1])
tf_price = tf.reshape(tf.nn.conv1d(tf_price, filters=f, stride=1, padding='VALID'), [-1])
# padding is necessary to keep same dim
tf_price = tf.concat([ tf.constant(tf_price[0].numpy(), shape=filter_width), tf_price ], axis=0)
filt_price = tf.concat([ tf_price,tf.constant(tf_price[-1].numpy(), shape=filter_width) ], axis=0).numpy()
price_centered = price - filt_price

By comparing the two methods (derivative and filtering), we see that the resulting prices are now zero-centered. They are shown with the orange colour in the below charts:

### plot
fig, axes = plt.subplots(2, figsize=(12, 8))
axes[0].plot(time_shifted, price, label="non-stationnary bitcoin price")
axes[0].plot(time_shifted[:-1], price_dt, label="stationnary bitcoin price")
axes[0].set_title('Derivative method')
axes[0].legend(loc="upper left")
axes[1].plot(time_shifted, price, label="non-stationnary bitcoin price")
axes[1].plot(time_shifted, price_centered, label="stationnary bitcoin price")
axes[1].plot(time_shifted, filt_price, label="filtered bitcoin price")
axes[1].set_title('Filtering and substraction method')
axes[1].legend(loc="upper left")
if figure_dir:
plt.savefig(os.path.join(figure_dir, "price_stationnarity.png"))
plt.show()
plt.close()
png

In order to verify the quality of the process, one can check the auto-correlation for both the raw price data (blue line), and stationary price data with the filtering method (green line). This will inform us about how well the data is stationary after the process.

We will compute the auto-correlations with different delays of up to 2 days every hours.

### auto-correlation function
def autocorr(input, delay):
input = tf.constant(input, dtype=tf.float32)
input_delayed = tf.roll(input, shift=delay, axis=0)

x1 = tf.reshape(input, [1, -1, 1])
x2 = tf.reshape(input_delayed, [-1, 1, 1])
return tf.reshape(tf.nn.conv1d(x1, filters=x2, stride=1, padding='VALID'), [-1])
# autocorrelation of the price for different delays
delays = np.arange(0, 48)
# raw price data
autocorr_price = []
for hour in delays:
autocorr_price += [autocorr(price, hour)]
# stationnary data
autocorr_centered_price = []
for hour in delays:
autocorr_centered_price += [autocorr(price_centered, hour)]

Looking at the plot, it is clear that the auto-correlation for the stationary data degrades much faster than for the raw price data. This means that we successfully removed the non-stationary component for the price!

### plot
fig, axes = plt.subplots(2, figsize=(12, 8))
axes[0].stem(delays, autocorr_centered_price, linefmt='b-', markerfmt='bo', basefmt='', use_line_collection=True)
axes[0].set_title('stationnary bitcoin price auto-correlation')
axes[1].stem(delays, autocorr_price, linefmt='b-', markerfmt='bo', basefmt='', use_line_collection=True)
axes[1].set_title('raw bitcoin price auto-correlation')
axes[1].set(xlabel='delay (h)', ylabel='amplitude')
if figure_dir:
plt.savefig(os.path.join(figure_dir, "check_stationnarity.png"))
plt.show()
plt.close()
png

Looking into the stationarity component also allows us to determine the window of prediction that is most suitable for the data. For example by checking how fast, for a given timestamp, the distribution of the raw price differ with its neighbors.

By comparing the histogram (i.e. computing the correlation) for each timestamp with its neighbors, one can get an overview of what would be the acceptable range for a prediction. With the idea that if the distributions are close to each other, it is obviously easier to predict (because they are closed to each other).

### histogram function
def data_distribution(inp):
return np.histogram(inp, range=(0, 20000), bins=500, density=True)[0]
win_size = 2*24 #distribution of the data is calculated within 2 days (in hours)
slide = 5*24 #we slide up to -/+ 5 days
corr = []

# loop through al timestamps
timestamps_range = np.arange(slide + int(win_size/2), len(price) - slide - int(win_size/2), 72)
sliding_range = np.arange(-slide, slide + 1)
for i in timestamps_range:
idx = i-int(win_size/2)
# distribution of the price (over price from day -7.5 to day +7.5), the fixed distributioin
fixed_price = price[idx:int(idx + win_size)]
fixed_distrib = data_distribution(fixed_price)
curr_corr = []
# compare to each distribution at different timestamps (sliding from -30 to +30), the moving distribution
for offset in sliding_range:
idx = offset + i - int(win_size/2)
moving_price = price[idx:(idx + win_size)]
moving_distrib = data_distribution(moving_price)
curr_corr += [np.correlate(fixed_distrib, moving_distrib)]
curr_corr = curr_corr / np.max(curr_corr)
corr += [curr_corr]
if i%7992 == 0:
print("day {}/{}".format(i/24, len(price)/24))

output = np.array(corr)[:, :, 0]
day 333.0/3439.5
day 666.0/3439.5
day 999.0/3439.5
day 1332.0/3439.5
day 1665.0/3439.5
day 1998.0/3439.5
day 2331.0/3439.5
day 2664.0/3439.5
day 2997.0/3439.5
day 3330.0/3439.5

In the following plot, the y-axis describes some samples taken at different timestamps of the bitcoin price. From up to down, it follows the chronological order, but this is not important since each sample can be taken independently. The x-axis describes the different offsets to compute the histograms (from -120 hours to +120 hours). And the color is the resulting correlation between these distributions and the distribution at timestamp h0 (the current timestamp of the sample).

### plot
plt.imshow(output, cmap="gray")
plt.axis("tight")
idx_sliding_range = np.arange(0, len(sliding_range), 30)
plt.xticks([i for i in idx_sliding_range], ["h{}".format(sliding_range[i]) for i in idx_sliding_range])
plt.xlabel("time offset (h)")
plt.ylabel("samples")
cbar = plt.colorbar()
cbar.set_label('correlation value')
if figure_dir:
plt.imsave(os.path.join(figure_dir, "range_accuracy.png"), output, cmap="gray")
plt.show()
plt.close()
png

Looking at it, we can say that the acceptable range for prediction is around +/-15 hours.

Note
The range for the color is very granular, and sometimes constant. This is because of the number of bins in the histogram (500) and price values ranging from 0 to 20k$, meaning the precision is about ~40$. So if the price moves inside the 40$ range within a certain period, the histograms will have a perfect match.

2.3. Seasonality

Let’s now switch the seasonality analysis by computing the FFT, and extract its magnitude and phase components.
As explained before, the FFT will be used here to understand the redundant patterns in the data. Because the FFT works better on LTI (linear and time invariant) systems, it cannot be applied with the raw bitcoin price (which is not stationary!). Therefore we will apply it on the stationary bitcoin price.

# fft
price_fouried = tf.signal.fft(price_centered)
T = 1/24 # sampling interval in days
N = price_fouried.shape[0]
frequencies = np.linspace(0, 1 / T, N)

In the below figure, there is no clear evidence of a pattern there, although we see evidence for important frequency ranging from 1 to 1.9 cycles per day, with a little peak at 1.52. This means that the bitcoin price can “generally” be explained by a sinusoid with a period of ~15.8 hours.

### plot
fig, axes = plt.subplots(2, figsize=(12, 8))
axes[0].plot(frequencies[:N // 2], tf.abs(price_fouried)[:N // 2] * 1 / N)
axes[0].set_title('FFT magnitude')
axes[1].plot(frequencies[:N // 2], tf.math.angle(price_fouried)[:N // 2])
axes[1].set_title('FFT phase')
axes[1].set(xlabel='cycles per day', ylabel='amplitude')
if figure_dir:
plt.savefig(os.path.join(figure_dir, "fft.png"))
plt.show()
plt.close()
png

Another way to analyse seasonality on a non-stationary data is to compute its spectrogram (derived from a time-frequency analysis).
A spectrogram is a visual representation during time of a signal’s spectrum of frequencies. It is commonly used (for example by spleeter) to extract voice from audio signals. The spectrogram can be computed using a short-fourier transform, which basically runs a fourier transform on a short window, sliding through all the data.

Here, we will use a window size of 48 samples (hours), with a step of 1 and 62 frequency components.

# tensorflow provides a fast implementation of the fast fourier transform.
stft = tf.signal.stft(price, frame_length=48, frame_step=1, fft_length=125, pad_end=True)
spectrogram = tf.abs(stft).numpy()
### plot
# inspired from https://www.tensorflow.org/tutorials/audio/simple_audio
# convert to log scale and transpose so that the time is represented in the x-axis (columns).
fig, axes = plt.subplots(2, figsize=(12, 8))
max_time = np.max(time_shifted)
axes[0].plot(time_shifted, price)
axes[0].set_xlim([0, max_time])
axes[0].set_title('non-stationnary bitcoin price')
log_spec = np.log(spectrogram.T)
axes[1].pcolormesh(time_shifted, np.arange(log_spec.shape[0]), log_spec)
axes[1].set_xlim([0, max_time])
axes[1].set_title('Spectrogram (short-fft)')
axes[1].set(xlabel='time (h)', ylabel='frequencies')
if figure_dir:
plt.savefig(os.path.join(figure_dir, "spectrogram.png"))
plt.show()
plt.close()
png

Looking at the figure, whenever there are big changes in the data (for example Dec. 2017), there is a much higher magnitude response. Generally speaking, it seems that the FFT looks like a white noise whenever the time.

Conclusion

In the light of the properties that we saw above, one thing can be said with certainty; predicting bitcoin price is no easy task because of its time dependency.

Hopefully we found a way to simplify the process, by removing the non-stationary component of the data (so it does not depend on time anymore). This allowed us to analyse redundant patterns in the data and we found that such a pattern exists. The recurrent patterns are interesting, because they can be latter used as a new feature into a predictive model (think of adding the time of day into a weather prediction model for example).

These findings opened to us new ways to get an accurate predictive model for the bitcoin price, but this is another story…

To go further

I really suggest you to read the book from Hyndman, it covers all best practices for time-series forecasting as well as coding examples. The online version is available here.

Acknowledgement

Thanks to Vahid Zarifpayam for the review of this post.
Credits goes to Bitprobe.

--

--

Loic Tetrel
Analytics Vidhya

Geek and data science enthousiaste. Health science worker. Co-funder of bitprobe.io. Check my website https://ltetrel.github.io/