Predicting Bitcoin Price with News using Python

Simplicity is key.

Federico Riveroll
The Startup
7 min readNov 6, 2019

--

Goal

In this tutorial, we’ll use this Bitcoin vs USD dataset.

The above dataset contains a daily summary of prices where the CHANGE column is the percentage change of the last price of the day (PRICE) with respect to the first (OPEN).

Goal: To make things simple, we’ll focus on predicting if the price will rise (change > 0) or fall (change ≤ 0) the following day. (So we may potentially use the predictions ‘in real life’).

Prerequisites

  • Have Python 2.6+ or 3.1+ installed
  • Install pandas, sklearn and openblender (with pip)
$ pip install pandas OpenBlender scikit-learn

Step 1. Get the Bitcoin data

Let’s import the libraries we’ll use:

import OpenBlender
import pandas as pd
import json

Now let’s pull the data through the OpenBlender API.

First, we’ll define the parameters (in this case it’s just the id of the Bitcoin dataset):

# It only contains the id, we'll add more parameters later.parameters = { 
'id_dataset':'5d4c3af79516290b01c83f51'
}

Note: You need have to create an account on openblender.io (free) and add token as such (you’ll find it in the ‘Account’ tab):

parameters = {     
'token':'your_token',
'id_dataset':'5d4c3af79516290b01c83f51'
}

Now let’s pull the data into a Dataframe ‘df’:

# This function pulls the data and orders by timestampdef pullObservationsToDF(parameters):
action = 'API_getObservationsFromDataset'
df = pd.read_json(json.dumps(OpenBlender.call(action,parameters)['sample']), convert_dates=False,convert_axes=False) .sort_values('timestamp', ascending=False)
df.reset_index(drop=True, inplace=True)
return df
df = pullObservationsToDF(parameters)

Let's take a look

df.head()

Note: The values may vary as this dataset is updated daily.

Step 2. Prep the data

First, we need to create our prediction target which is if ‘change’ will increase or decrease. To do this let’s add a target threshold for success over zero to our parameters:

parameters = {
'token':'your_token',
'id_dataset':'5d4c3af79516290b01c83f51',
'target_threshold':{'feature':'change', 'success_thr_over': 0}
}

If we pull the data from the API again:

df = pullObservationsToDF(parameters)
df.head()

The ‘change’ feature was replaced by a new feature: ‘change_over_0’ which is 1 if ‘change’ was positive and 0 if else. This will be our target for machine learning.

If we want to predict observations for ‘tomorrow’ we can’t use information from the same day, so let’s add a one period lag on the target.

parameters = { 
'token':'your_token',
'id_dataset':'5d4c3af79516290b01c83f51',
'target_threshold':{'feature':'change','success_thr_over' : 0},
'lag_target_feature':{'feature':'change_over_0', 'periods' : 1}
}
df = pullObservationsToDF(parameters)
df.head()

This simply aligned the ‘change_over_0’ with the data from the previous period (day) and changed its name to ‘TARGET_change_over_0’.

Let’s take a look at correlations:

target_variable = 'TARGET_change_over_0'
df = df.dropna()
df.corr()[target_variable].sort_values()

They are linearly uncorrelated and very unlikely to be useful.

Step 3. Get Business News data

After searching for correlations on OpenBlender I found a dataset of Fox Business News which helps generate good predictions on our specific target.

What we want is a way to convert the ‘title’s into numerical features by counting repetitions of words and groups of words per news item and time-blend them to our Bitcoin dataset. This is simpler than it sounds.

First, we need to create a TextVectorizer for the ‘title’ feature of the news:

action = 'API_createTextVectorizer'vectorizer_parameters = {
'token' : 'your_token',
'name' : 'Fox Business TextVectorizer',
'sources':[{'id_dataset' : '5defce899516296bfe37c366',
'features' : ['headline', 'title']}],
'ngram_range' : {'min' : 1, 'max' : 2},
'language' : 'en',
'remove_stop_words' : 'on',
'min_count_limit' : 2
}

We’ll create a vectorizer so we can get all the features as numerical word-token counts. Above, we specified the following:

  • name: We’ll name it ‘Fox Business TextVectorizer’
  • anchor: The id of the dataset and the name of the features to include as source (in this case only ‘title’)
  • ngram_range: The min and max length of the set of words which will be tokenized
  • language: English
  • remove_stop_words: So it eliminates stop-words from the source
  • min_count_limit: The minimum of repetitions to be considered a token (one time occurrences rarely help)

Let’s run it:

res = OpenBlender.call(action, vectorizer_parameters)
res

Response:

{  
'message' : 'TextVectorizer created successfully.'
'id_textVectorizer' : '5e4db6ce95162919d5b59a71',
'num_ngrams': 4270
}

The TextVectorizer was created and it generated 4270 n-grams with our configuration. We’ll use the generated id later: 5e4db6ce95162919d5b59a71

Step 4. Blend the news to the Bitcoin dataset

Now, we want to time-blend the features with our Bitcoin data. This basically means to join the two datasets using the timestamp as key. Let’s add the blend to our original parameters for pulling data:

parameters = { 
'token':'your_token',
'id_dataset':'5d4c3af79516290b01c83f51',
'target_threshold' : {'feature':'change','success_thr_over':0},
'lag_target_feature' : {'feature':'change_over_0', 'periods':1},
'blends':[{'id_blend':'5e4db6ce95162919d5b59a71',
'blend_type' : 'text_ts',
'restriction' : 'predictive',
'specifications':{'time_interval_size' : 3600*12 }}]

}

What we’re specifying above is the following:

  • id_blend : The id from our textVectorizer
  • blend_type : ‘text_ts’ so it knows it’s a text and timestamp blend
  • restriction : ‘predictive’, so that it doesn’t blend news from the future to each observation, only news that happened before
  • blend_class : ‘closest_observation’ so that it blends the closest observations in time
  • specifications : the maximum time to the past from which it will bring observations in seconds which in this case is 12 hours (3600*12). This only means that every Bitcoin price observation will be predicted with news from the past 12 hours

Finally, we’ll just add a date filter starting on 20th of August because that’s when the Fox News dataset initiated, and ‘drop_non_numeric’ so we only get numbers:

parameters = { 
'token':'your_token',
'id_dataset':'5d4c3af79516290b01c83f51',
'target_threshold' : {'feature':'change','success_thr_over':0},
'lag_target_feature' : {'feature':'change_over_0', 'periods':1},
'blends':[{'id_blend':'5e4db6ce95162919d5b59a71',
'blend_type' : 'text_ts',
'restriction' : 'predictive',
'blend_class' : 'closest_observation',
'specifications':{'time_interval_size' : 3600*12 }}],
'date_filter':{'start_date':'2019-08-20T16:59:35.825Z',
'end_date':'2019-11-04T17:59:35.825Z'},
'drop_non_numeric' : 1

}

Note: I specified the 4th of November as ‘end_date’ because that's the day I wrote it, but you can change the date you’re reading it.

Let's pull the data again:

df = pullObservationsToDF(parameters)
print(df.shape)
df.head()

(57, 2115)

Now we have over 2K features with all of the tokens, and we have 57 observations.

Step 5. Apply ML to predict target

Now we finally have the cleansed dataset exactly as we need it with the lagged target and the blended numerical data.

Let’s take a look at the top correlations with ‘Target_change_over_0’:

There are several correlated features now. Let’s separate or train and test sets chronologically, so we can train with previous observations and test on future ones.

X = df.loc[:, df.columns != target_variable].values
y = df.loc[:,[target_variable]].values
div = int(round(len(X) * 0.29))
# We take the first observations as test and the last as train because the dataset is ordered by timestamp descending.
X_test = X[:div]
y_test = y[:div]
print(X_test.shape)
print(y_test.shape)
X_train = X[div:]
y_train = y[div:]
print(X_train.shape)
print(y_train.shape)

We have 40 observations to train with and 17 to test with.

Now, we’ll import the needed libraries:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics

Now, let’s fit the RandomForest and make the predictions:

rf = RandomForestRegressor(n_estimators = 1000)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

To make it easier to understand, let’s put the predictions and the y_test into a Dataframe:

df_res = pd.DataFrame({'y_test':y_test[:,0], 'y_pred':y_pred})
df_res.head()
df_res.head()

Our real ‘y_test’ is binary but our predictions are floats, so let’s round them assuming if they are higher than 0.5 we would predict a price increase and lower than 0.5 a decrease.

threshold = 0.5
preds = [1 if val > threshold else 0 for val in df_res['y_pred']]

Now, for better understanding of our results, let’s get the AUC, the confusion matrix and the accuracy score:

print(roc_auc_score(preds, df_res['y_test']))
print(metrics.confusion_matrix(preds, df_res['y_test']))
print(accuracy_score(preds, df_res['y_test']))

We got 64.7% of the predictions correct with a 0.65 AUC to back it.

  • 9 times we predicted a decrease and it decreased (correct)
  • 5 times we predicted a decrease and it increased (wrong)
  • 1 time we predicted an increase and it decreased (wrong)
  • 2 times we predicted an increase and it increased (correct)

--

--

Federico Riveroll
The Startup

M.S. D.S. & Mathematics | Co-founder @OpenBlender | Master's Professor of Data Science