Keras, sklearn, LSTM, Pandas, Seaborne, Matplotlib, re
Goal of the project was, to explore the options available to create a model that could predict the price action over a select period of time. The variables that I decided to use were related to the sentiment on twitter. Sentiment is based on 2 different elements, polarity and sensitivity. An overlooked metric that I have included is Tweet Volume which plays a major role or is a negative indicator, in which I will explain a little further through this post.
Python packages: Keras, sklearn, LSTM, Pandas, Seaborne, Matplotlib, re
IDE or Notebook
Data: Twitter Data Stream for x period, BTC price
1. Data Cleaning & Modelling
1.1 Cleaning twitter dataset
Dropping ‘text’ column, as it is no longer useful after carrying out the sentiment analysis. Also dropping the ‘name’ column. The next important step is then truncating the dateTime, known as ‘floor’ method in python. Which will round all the times to the hour.
Next importantly to create a new feature ‘Tweet vol’ by aggregating and grouping all the tweets by hour.
Finally ending up with the Twitter dataset:
1.2 Cleaning BTC dataset
Drop Columns (High, Open, Low, Volume Currency, Weighted Price)
Final_df created using Inner Join with pandas merge function, merging on the dateTime column to create the data.
…Very Important Stage
Finally, the step after any analysis. This is converting the data to make it digestible for the LSTM model. Which means converting it from a time-series data into supervised sequence, a 3D array as such with normalized variables. I have additionally added 3 hours of lagged features, with the target variable of course being Close price
Fitting Model Parameters
The training test split was 200hrs training out of 294 total
Model parameters trained on 50 epochs, validation split of 0.2 and batch size was 12
Compiled with loss function MAE and Optimizer adam
2. Exploratory Data analysis, Key findings & Results
Beginning with a correlation plot of all the features that were included in creating the final data set for the model. This will give a clearer indication of the features that may be more important than others. Which then helps derive an insight into how the LSTM predicts.
What is quite clear from the plot that Tweet volume has a very high correlation with BTC trade volume. The other 2 features that have a noticeable correlation are Tweet Volume and Polarity. I will look into this further by producing 2 charts mapping these features together to give a better insight below.
So, in fig 2 this demonstrates the correlation between Polarity of tweets and the BTC Volume being traded. Where the Higher the polarity the lower the trade volume and vice versa. With high trade volumes between 18:00 and 20:00 GMT.
Demonstrated in fig 3 is the hourly correlation between tweet volume and BTC trade volume which speaks for itself displaying the main working hours within the Europe. With the highest Tweet and trade volume anywhere between 15:00 and 20:00 GMT.
fig 4 gives and insight to the whole times-series data where you can eyeball some small patterns within the data.
fig 5 the important part, showing how the model performed in predicting the final 90 hours of the data with a 3 hour lag introduced. To note: Blue line represents Real price over this period and the green line represents the predicted Price.
Model Performance on testing a follows:
Test MSE: 5498.255Test RMSE: 74.150
What do my results tell me?
That I have been able create a reasonable model. There are definitely correlations out there. But as many know correlation does not mean causation. Is it possible that tweets are then a reaction to the price, yes. However, this could this happen in patterns that the LSTM could then help make predictions.
Is it possible to predict Bitcoin price action?
In some sort of manner which direction it takes over a few hours. With enough data yes. Unfortunately I was only able to obtain 1.4M Tweets over a 2 week period, and would like to see the results in which someone has a much larger dataset for these features. In turn having much more conclusive results.
Anything I would have done differently?
- Have more Data!
- Carry out some statistical hypothesis tests on the data.
- Created a Volatility index and added this as a feature
For anyone with any questions, queries or ideas please reach out to me. Also find the Kaggle kernel for this project here.