LSTM Model predicting Bitcoin with Tweet Volume & Sentiment

Paul Simpson
Nov 19, 2018 · 5 min read

Keras, sklearn, LSTM, Pandas, Seaborne, Matplotlib, re

Goal of the project was, to explore the options available to create a model that could predict the price action over a select period of time. The variables that I decided to use were related to the sentiment on twitter. Sentiment is based on 2 different elements, polarity and sensitivity. An overlooked metric that I have included is Tweet Volume which plays a major role or is a negative indicator, in which I will explain a little further through this post.

0. Pre-Requisites

Python packages: Keras, sklearn, LSTM, Pandas, Seaborne, Matplotlib, re

IDE or Notebook

Data: Twitter Data Stream for x period, BTC price

1. Data Cleaning & Modelling

1.1 Cleaning twitter dataset

Dropping ‘text’ column, as it is no longer useful after carrying out the sentiment analysis. Also dropping the ‘name’ column. The next important step is then truncating the dateTime, known as ‘floor’ method in python. Which will round all the times to the hour.

Next importantly to create a new feature ‘Tweet vol’ by aggregating and grouping all the tweets by hour.

Finally ending up with the Twitter dataset:

Twitter dataset.head()

1.2 Cleaning BTC dataset

Drop Columns (High, Open, Low, Volume Currency, Weighted Price)

Final_df created using Inner Join with pandas merge function, merging on the dateTime column to create the data.

fig 0.1 Final-df

…Very Important Stage

Finally, the step after any analysis. This is converting the data to make it digestible for the LSTM model. Which means converting it from a time-series data into supervised sequence, a 3D array as such with normalized variables. I have additionally added 3 hours of lagged features, with the target variable of course being Close price

Fitting Model Parameters

The training test split was 200hrs training out of 294 total

Model parameters trained on 50 epochs, validation split of 0.2 and batch size was 12

Compiled with loss function MAE and Optimizer adam


2. Exploratory Data analysis, Key findings & Results

Beginning with a correlation plot of all the features that were included in creating the final data set for the model. This will give a clearer indication of the features that may be more important than others. Which then helps derive an insight into how the LSTM predicts.

fig 1. Correlation plot of final features

What is quite clear from the plot that Tweet volume has a very high correlation with BTC trade volume. The other 2 features that have a noticeable correlation are Tweet Volume and Polarity. I will look into this further by producing 2 charts mapping these features together to give a better insight below.

fig 2. Polarity x BTC Trade volume by hour

So, in fig 2 this demonstrates the correlation between Polarity of tweets and the BTC Volume being traded. Where the Higher the polarity the lower the trade volume and vice versa. With high trade volumes between 18:00 and 20:00 GMT.

fig 3. BTC trade volume x tweet volume by hour

Demonstrated in fig 3 is the hourly correlation between tweet volume and BTC trade volume which speaks for itself displaying the main working hours within the Europe. With the highest Tweet and trade volume anywhere between 15:00 and 20:00 GMT.

fig 4. Time series data of all components

fig 4 gives and insight to the whole times-series data where you can eyeball some small patterns within the data.

fig 5. Real Price v predicted price

fig 5 the important part, showing how the model performed in predicting the final 90 hours of the data with a 3 hour lag introduced. To note: Blue line represents Real price over this period and the green line represents the predicted Price.

Model Performance on testing a follows:

Test MSE: 5498.255Test RMSE: 74.150

3. Conclusion

What do my results tell me?

That I have been able create a reasonable model. There are definitely correlations out there. But as many know correlation does not mean causation. Is it possible that tweets are then a reaction to the price, yes. However, this could this happen in patterns that the LSTM could then help make predictions.

Is it possible to predict Bitcoin price action?

In some sort of manner which direction it takes over a few hours. With enough data yes. Unfortunately I was only able to obtain 1.4M Tweets over a 2 week period, and would like to see the results in which someone has a much larger dataset for these features. In turn having much more conclusive results.

Anything I would have done differently?

  1. Have more Data!

For anyone with any questions, queries or ideas please reach out to me. Also find the Kaggle kernel for this project here.

Paul Simpson

Written by

MSc Data Science, Consultant @ EY

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade