# Using Quandl Bitcoin Data to Build a Time Series Forecast in Python

Anyone interested in trading Bitcoin obviously would love to know future prices of Bitcoin. There are a number of statistical approaches one could take to predict the price in an hour, a day, or a week. For example, one could build a predictive model that takes in a bunch of variables and spits out a number. However, achieving high accuracy in your model can be extremely challenging. This is because there are so many variables and models to choose from that picking the best one usually requires a strong quantitative and statistical background. So a good first step for anyone interested in trading Bitcoin with a quantitative approach would be to examine the overall trend of the price data. This is where time series are useful.

Many time series forecasts require stationarity. This means the model will have constant mean, variance, and auto-correlation over time. Unsurprisingly, this is extremely rare, and if you have looked at any Bitcoin charts it is clear that the raw data does not follow this assumption. But that’s O.K. There are a number of transforms we can make to the data that will produce a model which follows the stationarity assumption.

So for this tutorial, we will examine Bitcoin low prices provided by the Quandl API. To inspect stationarity, we will produce data visualizations as well as test statistics which will provide a more concrete answer to whether or not the data is stationary.

Let’s fire up Python now. First, we need to import the necessary libraries. I like to put all of the libraries I will need at the top of the script for organizational value.

import quandl

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from statsmodels.tsa.stattools import adfuller

Next, we will need to set up our API key so we can actually access data from Quandl. If you have not done so already, sign up for a free account at Quandl.com. This will generate an API key for you to use below.

quandl.ApiConfig.api_key = "YOUR KEY HERE"

Now pull the data and check out the first three rows.

data = quandl.get("BITSTAMP/USD")

data.head(3)

One of my favorite parts about pulling data from Quandl is that the indices for the data frames are the actual dates. This means the Date column is already a datetime object in Python which makes it much easier to create a time series.

For this time series model, I will only be examining the Low column. So if you decide to plot another variable, the transforms I use may not be appropriate for your data. But I hope by the end of this I will have given you the tools to choose the best transform for whichever variable you decide to plot.

plt.plot(data['Low'])

plt.title('Lows')

plt.show()

This does not look very stationary. Let’s explore further by plotting the rolling mean and standard deviation. We will use pandas built in rolling_mean and rolling_std function.

rollmean = pd.rolling_mean(data['Low'], window=12)

rollstd = pd.rolling_std(data['Low'], window=12)

plt.plot(data['Low'], color='blue',label='Original')

plt.plot(rollmean, color='red', label='Rolling Mean')

plt.plot(rollstd, color='black', label = 'Rolling Std')

plt.legend(loc='best')

plt.title('Rolling Mean & Standard Deviation')

plt.show()

This looks like it violates our stationarity assumption because there is not a constant mean over time. However, simply looking at visualizations is not enough to determine stationarity. We can acquire more concrete results with a little statistical analysis.

We will utilize the Dickey-Fuller test. Without going into too much detail about the underlying statistical machinery at play here, this is basically testing the null hypothesis that the data is not stationary. If the test statistic is less than the critical value, we can reject the null hypothesis and then conclude that the data is stationary.

I will be testing at a 95% confidence interval.

print('Results of Dickey-Fuller Test for Raw Low Data:')

dftest = adfuller(data['Low'], autolag='AIC')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p- value','#Lags Used','Number of Observations Used'])

print(dfoutput)

print('Critical Value at 5%', list(dftest[4].items())[1][1])

And we get:

Results of Dickey-Fuller Test for Raw Low Data:

Test Statistic 5.757827

p-value 1.000000

#Lags Used 23.000000

Number of Observations Used 1286.000000

dtype: float64

Critical Value at 5% -2.86379009066

The test statistic is not less than the critical value at a 95% confidence interval so we cannot reject the null hypothesis. This means the model is not stationary and we need to do some sort of transformation to make it stationary.

I’m going to try a log transform using numpy’s built in log function.

log_low = np.log(data['Low'])

plt.plot(log_low)

plt.title('Log Transform of Lows')

plt.show()

This is slightly better than using raw Low data, but it still does not look stationary. Just to make sure, let’s run the Dickey-Fuller test again.

print('Results of Dickey-Fuller Test for Log Transform of Lows:')

dftest = adfuller(log_low, autolag='AIC')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

print(dfoutput)

print('Critical Value at 5%', list(dftest[4].items())[1][1])

Test Statistic 2.903331

p-value 1.000000

#Lags Used 8.000000

Number of Observations Used 1301.000000

dtype: float64

Critical Value at 5%: -2.863764118428576

Again, the test statistic is not less than the critical value so the model is still not stationary.

Since we are dealing with de facto stocks, let’s try an exponential weighted moving average on our log transform. This gives more recent data higher weight in the model. Then let’s overlay this EWMA onto the log and original data.

expwighted_avg = pd.ewma(log_low, halflife=12)

plt.plot(log_low)

plt.plot(expwighted_avg, color='red')

plt.title('EWMA of Log Transform')

plt.show()

Now see what happens when we subtract the weighted moving average from the log transform.

ts_log_ewma_diff = log_low - expwighted_avg

plt.plot(ts_log_ewma_diff)

plt.show()

This definitely looks like it could have equal mean and variance over time! Let’s plot the rolling mean and standard deviations too, to get a more thorough look.

rollmean = pd.rolling_mean(ts_log_ewma_diff, window=12)

rollstd = pd.rolling_std(ts_log_ewma_diff, window=12)

plt.plot(ts_log_ewma_diff, color='blue',label='Original')

plt.plot(rollmean, color='red', label='Rolling Mean')

plt.plot(rollstd, color='black', label = 'Rolling Std')

plt.legend(loc='best')

plt.title('Rolling Mean & Standard Deviation of Difference Between Log and EWMA of Log')

plt.show()

And look at that, the mean and standard appear to be constant.

Of course, we want to run the Dickey-Fuller test to be sure.

print('Results of Dickey-Fuller Test for Difference of Log Transform and EWMA:')

dftest = adfuller(ts_log_ewma_diff, autolag='AIC')

dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

print(dfoutput)

print('Critical Value at 5%', list(dftest[4].items())[1][1])

Results of Dickey-Fuller Test for Difference of Log Transform and EWMA:

Test Statistic -4.639663

p-value 0.000109

#Lags Used 8.000000

Number of Observations Used 1301.000000

dtype: float64

Critical Value at 5% -2.86376411843

And there we have it, the test statistic is less than the critical value at the 95% confidence level. This means we can reject the null hypothesis that the model is not stationary and conclude that it is stationary. Now we are able to run time series forecasts with this model because it fulfills a major assumption necessary for many time series.