Can we predict Bitcoin price using Google Trends?

QFX Research
QFX Research
Published in
8 min readMar 18, 2019

We want to use hourly search trend data to see if the Bitcoin price of the next 30 minutes (t+30), next 60 minutes (t+60), next 120 minutes (t+120) and next 240 minutes (t+240) can be predicted using the data known at time t+0.

We downloaded the Google Trends data for various keywords such as “Bitcoin”, “BTC”, “Crypto” or “Cryptocurrency” and want to see if there is any correlation with future price, and if so, whether or not we can use the informational advantage to achieve some trading profit.

# Load data
quotes = pd.read_csv('resampled_minute.csv.gz',
parse_dates=[0],
index_col=[0],
compression='gzip')['bid_close']
g_trends = pd.read_csv('btc_google_trends.csv',
parse_dates=[0],
index_col=[0])['value']

We will split our data into two parts. Data until January 1st, 2018 will be used for exploration and model selection and data after this date will be used to check the generalization of our techniques. Let’s plot hourly data for XBTUSD and Google Trends for Bitcoin.

# Plot Prices
quotes.resample('1h').ohlc()['close'][:'2017-12-31'].plot(title='XBTUSD 1 hour')
# Plot Google trends
g_trends['2016-01-02':'2017-12-31'].plot(title='Google Trends for Bitcoin')

First impression is that there is an interesting relationship between XBTUSD prices and Google Trends for Bitcoin. The first step in the analysis of this relationship would be to make both time series stationary. We will achieve this by calculating their returns.

def calculate_returns(prices):returns = np.log(prices).diff()return returns[1:]# Calculate hourly returns
price_hourly_returns = calculate_returns(quotes.resample('1h').ohlc()['close'])
trend_hourly_returns = calculate_returns(g_trends)

Now let’s plot price and trend returns

price_hourly_returns[:'2017-12-31'].plot(title='XBTUSD hourly returns')
trend_hourly_returns['2016-01-02':'2017-12-31'].plot(title='Google trend returns')

We see different volatility patterns for prices and Google trends. While price volatility peaks during the period XBTUSD price itself peaked, Goggle trends volatility decreased once prices began to skyrocket and Bitcoin became a household name

Our aim is to determine if search trend data can be used as a signal to predict Bitcoin prices 30 minutes, 1 hour, 2 hours and 4 hours ahead in time. We will try to approaches:

  • Examine the correlation between search trend data and XBTUSD prices
  • Build a model which predicts XBTUSD prices given search trend data

We will start with the first approach, by calculating correlations for several rolling window sizes. Comparing correlations for different rolling windows will also help us with feature engineering for the second approach.

We will prepare a separate dataset for each prediction horizon

# Generate separate dataset for each prediction horizon
PREDICTION_HORIZONS = [30, 60, 120, 240] # in minutes
data = dict()price_minute_returns = calculate_returns(quotes)
for ph in PREDICTION_HORIZONS:

# Calculate rolling returns
ph_rolling_returns = price_minute_returns.rolling(ph, min_periods=ph).sum()

# Shift rolling returns time index backward in time and match with search trend returns
ph_rolling_returns.index = ph_rolling_returns.index - pd.Timedelta(str(ph) + 'm')
ph_data = ph_rolling_returns.to_frame().join(trend_hourly_returns, how='inner')
ph_data.columns = ['price_return', 'trend_return']
data[str(ph)] = ph_data.dropna()

Below, we plot rolling correlations between XBTUSD and Google trends for each prediction horizon and each rolling window size. The dotted lines in the plots mark the confidence intervals for the null hypothesis that correlation is equal to 0. Correlations above or below these confidence interval lines are statistically significant.

ROLLING_WINDOWS = [72, 168, 336, 720, 2160]  # in hoursfor ph, df in data.items():
for window in ROLLING_WINDOWS:

conf_interval = 1.96 / np.sqrt(window - 3)

plt.figure()
df['price_return'][:'2017-12-31'].rolling(window).corr(df['trend_return'][:'2017-12-31']).plot()
plt.axhline(y=0, color='r', linestyle='-')
plt.axhline(y=conf_interval, color='r', linestyle='--')
plt.axhline(y=-conf_interval, color='r', linestyle='--')
plt.title('Rolling {} hour correlation between Google search trends '
'and {} minute returns'.format(window, ph))
plt.show()

We see that short rolling window correlations are noisy, but the 2160 hour (3 months) rolling window correlations have been negative for almost the entire period since May 2017.

We will try 2 approaches to feature engineering:

  • Simply use lagged Google trend returns for the past 24 hours. In this approach we try to directly exploit the correlation between Google search trends and XBTUSD
  • Generate features based on Google trend returns for past 24 hours
def generate_lags(trend_returns):

lags = 24

trend_lags = OrderedDict()
trend_lags['lag' + str(0)] = trend_returns
for lag in range(1, lags):
trend_lags['lag' + str(lag)] = trend_returns.shift(lag)

lag_df = pd.DataFrame(trend_lags)

return lag_df
def generate_features(trend_returns):

window = 24

roll_mean = trend_returns.rolling(window).mean().values
roll_std = trend_returns.rolling(window).std().values
roll_skew = trend_returns.rolling(window).skew().values
roll_kurt = trend_returns.rolling(window).kurt().values

feature_df = pd.DataFrame(dict(mean=roll_mean,
std=roll_std,
skew=roll_skew,
kurt=roll_kurt),
index=trend_returns.index)

return feature_df

Financial prices are characterised by time dependence and regime changes. Data that is 2–3 years old might not be relevant to predicting prices today as conditions might have changed fundamentally. It does not make much sense to split the data randomly into train, validation and test parts.

I have chosen the following scheme to validate the results of the the models: I have set rolling windows of different lengths. I iterate through each observation in the dataset forward in time, and on each iteration I train a model with data that goes back in time as far as each rolling window allows. I then predict the price returns for the following period and move one step forward.

In this scheme all data points(except historical data used for the first prediction) are both used for training and validation/test. I have reserved data before 2018–01–01 for exploration and model selection and data after this date to check the performance of the selected model

I will use two algorithms to model returns on Google search trend data: xgboost and ols regression

MODELS = dict(xgb=XGBRegressor, ols=LinearRegression)def run_models():

results = dict()
for prediction_horizon, df in data.items():

# Lag data
X_lag = generate_lags(df['trend_return'])
y_lag = df['price_return'][~X_lag.isnull().any(axis=1)]
X_lag = X_lag[~X_lag.isnull().any(axis=1)]

# Features data
X_feat = generate_features(df['trend_return'])
y_feat = df['price_return'][~X_feat.isnull().any(axis=1)]
X_feat = X_feat[~X_feat.isnull().any(axis=1)]

for algo, model in MODELS.items():

for window in ROLLING_WINDOWS:

forecast_lag = np.zeros(X_lag.shape[0])
forecast_feat = np.zeros(X_feat.shape[0])
# Start iteration from the end of the longest rolling window
for i in range(max(ROLLING_WINDOWS), X_lag.shape[0]):

x_train_lag = X_lag.iloc[(i-window):i]
y_train_lag = y_lag[(i-window):i]
x_new_lag = X_lag.iloc[[i]]

x_train_feat = X_feat.iloc[(i-window):i]
y_train_feat = y_feat[(i-window):i]
x_new_feat = X_feat.iloc[[i]]

if algo == 'rf' or algo == 'xgb':
regr_lag = model(n_estimators=60, n_jobs=6).fit(x_train_lag, y_train_lag)
regr_feat = model(n_estimators=60, n_jobs=6).fit(x_train_feat, y_train_feat)
else:
regr_lag = model().fit(x_train_lag, y_train_lag)
regr_feat = model().fit(x_train_feat, y_train_feat)

prediction_lag = regr_lag.predict(x_new_lag)
prediction_feat = regr_feat.predict(x_new_feat)

forecast_lag[i] = prediction_lag[0]
forecast_feat[i] = prediction_feat[0]

forecast_lag = forecast_lag[max(ROLLING_WINDOWS):]
forecast_feat = forecast_feat[max(ROLLING_WINDOWS):]


forecast_df_lag = pd.DataFrame(dict(returns=y_lag[max(ROLLING_WINDOWS):],
forecast_returns=forecast_lag),
index=X_lag.index[max(ROLLING_WINDOWS):])
forecast_df_feat = pd.DataFrame(dict(returns=y_feat[max(ROLLING_WINDOWS):],
forecast_returns=forecast_feat),
index=X_feat.index[max(ROLLING_WINDOWS):])
results[algo + '_lag_' + str(prediction_horizon) + '_' + str(window)] = forecast_df_lag
results[algo + '_feat_' + str(prediction_horizon) + '_' + str(window)] = forecast_df_feat
print('{} model with rolling window {} completed forecasting for {} minute returns'.format(algo,
window,
prediction_horizon))
return results
#res = run_models()# Use saved forecast instead of running forecasting function. It takes long
with open('result.pickle', 'rb') as handle:
res = pickle.load(handle)
def calculate_loss(y, y_pred):
mae = np.mean(np.abs(y - y_pred)) / np.mean(np.abs(y))
return mae

Let’s calculate and plot performance on train data for each algorithm/feature type combination. Shorter term returns are smaller in magnitude than longer term returns and are bound to have smaller prediction errors. To enable unbiased comparison of model performance on different prediction horizons, we divide mean absolute error by mean absolute returns.

xgb_lag_train_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
xgb_feat_train_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
xgb_lag_test_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
xgb_feat_test_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
ols_lag_train_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
ols_feat_train_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
ols_lag_test_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
ols_feat_test_loss = np.zeros((len(ROLLING_WINDOWS), len(PREDICTION_HORIZONS)))
for k, v in res.items():

algo, meth, ph, wind = k.split('_')
row_ind = ROLLING_WINDOWS.index(int(wind))
col_ind = PREDICTION_HORIZONS.index(int(ph))

if algo == 'xgb':
if meth == 'lag':
xgb_lag_train_loss[row_ind, col_ind] = calculate_loss(v['returns'][:'2017-12-31'],
v['forecast_returns'][:'2017-12-31'])
xgb_lag_test_loss[row_ind, col_ind] = calculate_loss(v['returns']['2018-01-01':],
v['forecast_returns']['2018-01-01':])
elif meth == 'feat':
xgb_feat_train_loss[row_ind, col_ind] = calculate_loss(v['returns'][:'2017-12-31'],
v['forecast_returns'][:'2017-12-31'])
xgb_feat_test_loss[row_ind, col_ind] = calculate_loss(v['returns']['2018-01-01':],
v['forecast_returns']['2018-01-01':])
elif algo == 'ols':
if meth == 'lag':
ols_lag_train_loss[row_ind, col_ind] = calculate_loss(v['returns'][:'2017-12-31'],
v['forecast_returns'][:'2017-12-31'])
ols_lag_test_loss[row_ind, col_ind] = calculate_loss(v['returns']['2018-01-01':],
v['forecast_returns']['2018-01-01':])
elif meth == 'feat':
ols_feat_train_loss[row_ind, col_ind] = calculate_loss(v['returns'][:'2017-12-31'],
v['forecast_returns'][:'2017-12-31'])
ols_feat_test_loss[row_ind, col_ind] = calculate_loss(v['returns']['2018-01-01':],
v['forecast_returns']['2018-01-01':])
train_loss = [xgb_lag_train_loss, xgb_feat_train_loss, ols_lag_train_loss, ols_feat_train_loss]
titles = ['XGB lags train loss', 'XGB features train loss', 'OLS lags train loss', 'OLS features train loss']
X = ROLLING_WINDOWS.copy()
Y = PREDICTION_HORIZONS.copy()
Y, X = np.meshgrid(Y, X)
for i in range(len(train_loss)):

Z = xgb_lag_train_loss
fig = plt.figure(figsize=(15,10))
ax = Axes3D(fig)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(np.min(Z), np.max(Z))
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))
ax.set_xlabel('Rolling window size (hours)')
ax.set_ylabel('Prediction horizon (minutes)')
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.title(titles[i])
plt.show()

Looking at the plots above, there are several things to note:

  • All combinations of algorithms/features achieve similar performance
  • The best performance is achieved by xgboost with calculated features
  • Best results are achieved for long rolling windows and 4 hour returns

Let’s now take the best performing algorithm and try to create a simple profitable strategy. We will use the results from predicting 4 hour returns with xgboost and 2160 hour(3 month) rolling train data with calculated features. Let’s plot forecast returns vs actual returns

use_res_train = res['xgb_feat_240_2160'][:'2017-12-31']sns.set(rc={'figure.figsize':(15, 15)})
sns.regplot(use_res_train['forecast_returns'], use_res_train['returns'], lowess=True)

We have a positive correlation between forecast returns and actual returns. Looking at the plot, if we set a threshold of approximately 0.01 in both directions, we shouldn’t suffer huge losses, while at the same time capture substantial upside. That is, we will go long if we forecast a return >= 1% and will go short if we forecast a return <= -1%. Let’s try it.

strategy_returns_train = np.zeros(use_res_train.shape[0])
long_train = (use_res_train['forecast_returns'] >= 0.01).astype(int)
short_train = -(use_res_train['forecast_returns'] <= -0.01).astype(int)
strategy_returns_train += long_train * use_res_train['returns']
strategy_returns_train += short_train * use_res_train['returns']
plt.figure(figsize=(15, 10))
plt.plot(use_res_train.index, np.cumsum(strategy_returns_train))
plt.title('Cumulative Strategy Returns')
plt.show()
plt.figure(figsize=(15, 10))
plt.boxplot(strategy_returns_train)
plt.title('Strategy Returns')
plt.show()

Returns are not huge, but we see an upward sloping curve, especially after May 2017 as our correlation results suggested.

And now for the most important part, we need to check how robust our strategy is to regime changes. We know that crypto currencies went into bear market after the start of 2018.

use_res_test = res['xgb_feat_240_2160']['2018-01-01':]
strategy_returns_test = np.zeros(use_res_test.shape[0])
long_test = (use_res_test['forecast_returns'] >= 0.01).astype(int)
short_test = -(use_res_test['forecast_returns'] <= -0.01).astype(int)
strategy_returns_test += long_test * use_res_test['returns']
strategy_returns_test += short_test * use_res_test['returns']
plt.figure(figsize=(15, 10))
plt.plot(use_res_test.index, np.cumsum(strategy_returns_test))
plt.title('Cumulative Strategy Returns')
plt.show()

Results are not much different from the period before January 2018, which gives us some confidence that this strategy might continue to perform well in the future. Let’s see some statistics on the performance of this strategy for the period after January 2018.

print('Number of long trades: %d' % sum(long_test))
print('Number of short trades: %d' % sum(np.abs(short_test)))
print('Total return from long trades: %.3f' % np.sum(long_test * use_res_test['returns']))
print('Total return from short trades: %.3f' % np.sum(short_test * use_res_test['returns']))
print('Total strategy return: %.3f' % np.sum(strategy_returns_test))
daily_returns = pd.DataFrame(dict(strategy_returns=strategy_returns_test), index=use_res_test.index).resample('1d').sum()
print('Annualized Sharpe ratio: %.2f' % (np.sqrt(252) * np.mean(daily_returns.values) / np.std(daily_returns.values)))
Number of long trades: 138
Number of short trades: 176
Total return from long trades: 1.902
Total return from short trades: 2.959
Total strategy return: 4.861
Annualized Sharpe ratio: 3.38

--

--