CRYPTOIntel

Digging Deep Into The Crypto World

SFU Professional Computer Science
14 min readApr 14, 2019

--

The Team: Tushar Chand Kapoor, Syed Ikram, Mehak Parashar

1. Motivation and Background

The economy is moving inevitably towards a digital ecosystem — the new and the most volatile digital currency is cryptocurrency which continues to conquer the world of finances. We analyzed the google trends data over the years and the price of various cryptocurrencies. We found that out that both of them are resonating with each other which shows that are numerous factors determining the growth of cryptocurrency. With the increasing interests of people in cryptocurrency markets, cryptocurrency will be the next prominent currency in the world. It can also be seen that major banks like Citibank, financial institutions like JP Morgan and technical giants like Google and IBM have started investing in the blockchain and cryptocurrency technologies. Therefore, a paradigm shift in the financial transactions is towards the digital ledger internal to the system itself. Cryptocurrency is the future of currency which will be used for trading in a decentralized manner.

Several people have a difficult time gaining insights about the cryptocurrencies whether to buy or sell due to high fluctuations, which cryptocurrency to buy, what is the present price of cryptocurrency, what is the rest of the world thinking about the cryptocurrencies. As per our findings, there is no one-stop web dashboard that is providing all the answers to different questions related to the massive cryptocurrency world.

2. Problem Statement

As we began to brainstorm in order to explore the cryptocurrency world, we asked ourselves four questions:

1)Can we predict the future price of different cryptocurrencies using historical data and news sentiment polarity for the top cryptocurrencies?

2) How to understand the price and transaction behavior of the fluctuating cryptocurrency market and see if there is any correlation for the change?

3)How can we do visualization on live and streaming data and visualize the effects of different parameters that affect cryptocurrency prices?

4)Can we visualize cryptocurrency exchanges worldwide by volume in real-time?

Proposal:We propose to answer all of these questions using an interactive web dashboard related to cryptocurrency ecosystem so that the people are well informed about what is happening in the cryptocurrency market and thus helping them in making an informed decision about the cryptocurrency of their choice in this modern economy.

Furthermore, we also plan to visualize the following using the D3 library (not applicable to every point) as part of our interactive dashboard:

  • Live fluctuations of the cryptocurrencies prices.
  • Live Order-book L2 snapshot.
  • Live twitter sentiment analysis on streaming tweets of top cryptocurrencies.
  • Market cap of cryptocurrencies around the world.
  • Social media analysis for each of the cryptocurrency.
  • Stream latest news articles buzz and perform live sentiment analysis.
  • Build and embed future price prediction for both numerical historical data and news sentiments polarity.
  • Dynamic world cloud on the web front for the live news articles related news articles to cryptocurrencies

3. Data Science Pipeline

In this section, we describe our data science pipeline, which is shown in figure 1.We are collecting live data from various sources available which are related to cryptocurrencies and their information from social media, news data, google trends, etc.

Figure 1 — Data Science Pipeline

Different stages of the pipeline are described as follows:

3.1 Data Collection

  • Cryptocompare:
    Crptocompare.com API provides all the information related to cryptocurrencies, making it one of the biggest data source of our project. We made a python script which automatically collects the data from its API, which further implements a scrapper module to fetch data from the URLs we got from Crptocompare. The data collected from cryptocompare.com related to cryptocurrencies have numerous data points, for example, its price, volume, exchanges, trades, market cap, etc.
  • Twitter Streaming API and Tweepy:
    The social media trends of any topic can be figured out from twitter. People from various parts of the world express their views on Twitter, it can be said that Twitter has all the information what is happening now in any part of the world. We made a python module to collect real-time Twitter data related to various cryptocurrencies by accessing their streaming API.
  • News API:
    We collected the latest news from NewsAPI related to cryptocurrency market in general and about specific cryptocurrencies. These news articles along with their headings were fetched live for the sentiment analysis.
  • Google Trends Data:
    The number of Google searches depicts the popularity of a particular topic on Google, for this Google uses a metric known as interest over time, a value of 100 in the data is the peak popularity for the term and zero being the inverse of the same. Furthermore, as it can be seen the figure 2, the price of the bitcoin is almost in resonance with Google trends interest over tine.
Figure 2 — Google Trends vs Bitcoin Price

3.2 Data Cleaning and Data Storing

The data collected from various sources was cleaned by filtering, removing null and duplicate values, removing outliers, stemming, stop-words removed, and then large amounts of data was stored in Cassandra which was served as the data source for the front end, furthermore, some of the streaming was in the JSON format. There are python scripts running as crontask at the backend which is doing constant filtering of live data which is served is in two ways, one this data is being fed to price prediction Keras model and secondly the data generated by these python files are served to the sentiment analysis module.

3.3 Analysis & Visualization

To get a better understanding of the cryptocurrency market we did some initial visualizations of data to get a grasp of unforeseen spikes in the price trends of various cryptocurrencies which in further helps us to make modules for the user so they can better understand the trends of the cryptocurrencies. We performed various live and interactive visualization of data like word cloud, topic modeling, correlation matrix, live sentimental analysis (figure 3) and global volume 3D representation of Bitcoin using D3, genism, three.jsand plotly.

Figure 3 — Live Sentiment Analysis

3.4 Machine learning

We predicted the next day prices of various cryptocurrencies. We did two types of prediction: first, by using the historical OHLCV data using 2 layers LSTM and second using OHLC and the news sentiment polarity, both of the neural networks are trained via Keras.

3.5 Interactive Dashboard

All the exciting insights that we got of the cryptocurrency world. We amalgamated them in our interactive dashboard-CRYPTOIntel which has all the findings and results that any users can be benefitted from. Furthermore, this dashboard extends to an android application inheriting all the visualizations and features of the front end.

4. Methodology

4.1 EDA

The market value price of cryptocurrencies has been rapidly oscillating every day even for the oldest cryptocurrency in the market bitcoin. The data analysis for market price on top ten cryptocurrencies currently shows that when the bitcoin was in the initial stages the price was very low. Bitcoin reached the peak market value price around 19,000 USD in 2017 as shown in figure 4. The other cryptocurrencies have been in the market very recently and the prices are not very low when compared to the bitcoin. Bitcoin Cash and Ethereum are among other top currencies that show rapid growth in the last couple of years. It can also be seen the price value of bitcoin going down after 2017, which questions whether Bitcoin is a bubble?

Figure 4 — Cryptocurrency Market Price (USD)

Figure 5 represents the price fluctuations that is the difference between the opening and closing price for top cryptocurrencies. The year 2017 was a breakthrough since the price of bitcoin was increasing rapidly during this time frame.

Figure 5 — Change in price of different cryptocurrencies

Plots in figure 6 show the change in the price of the top ten cryptocurrencies over 24 hours and the pie chart representing the market capitalization of different cryptocurrencies.

Figure 6–24 Hour Change Trends and Market Cap

4.2 Neural Networks (Machine Learning)

As discussed before we have to neural networks in place for prediction, which are discussed in detail below.

4.2.1 Numerical Model
Cryptocurrencies are the future of currencies. With its increase in popularity, more and more people want to invest in it. To forecast the price of the next day based on historical data we used Deep Neural Networks framework. Particularly LSTM Long (short-term memory), a type of Recurrent neural network from Keras as they have been proven to work really well for regression problems. The price of the cryptocurrencies depends upon a lot of factors, we used close, high, low, open, volumefrom, volumeto and we used OHLC to calculate the average price for that day and used it as an input to our model. Since we considered market volume as a factor to be used in our model, which had the highest value among all the variables. We normalized the data using scikit-learn’s minmax normalization module with a standard deviation of one. Since predicting the price is of time series type, we shifted each of the average prices we calculated, to one step forward in time and removed the nan values. After the data was ready, we split the data into train set (80%) and test set (20%). After transforming the data into three-dimensional shape, we used Keras framework to build, train predict the average price of the cryptocurrency.

Figure 7 — Train and Test Loss

In our LSTM we used 80 neurons, 2 layers and trained our model for 50 epochs using ADAM Optimizer as an optimization method since it adapts the learning rates based on the average of the first moment and the average of the second moment while training the model and the Mean Absolute Error loss as a loss function as shown in figure 7.

We validated the model on the test set and calculated Root Mean Square Error on both training and testing set results.

4.2.2 Numerical + Sentiment Model (For Bitcoin only)
Does News speculations play an important role when there is an increase or decrease in the price of Bitcoin? And Can we predict the future price using numerical historical along with news sentiment? To answer this, we scrapped the new articles related to bitcoin from 2017 and performed NLP techniques for cleaning the title and body of the news articles and calculated its sentiment polarity using Spark’s MLib. We summed the sentiment over all the news articles for that day and finally concatenated to the historical data from 2017. This sentiment polarity was a new feature which was added. The same above process was followed to train and test model. We were then successfully able to predict the price of the bitcoin using OHLCV and news sentiment polarity.

4.2.3 Topic Modeling
Knowing what people talking about cryptocurrencies and understanding their problems and opinions is a critical aspect. To answer this, we performed topic modeling using LDA (Latent Dirichlet Allocation) from Gensim package on huge news articles which were scrapped to uncover the hidden structure from the collection of news articles to discover the trends in the social media news. Data cleaning was performed to remove punctuations, extra spaces, and stopwords [2]. We performed text pre-processing using spacy(spaCy) and later lemmatization. Finally, trained the data set on the LDA model for the top 10 top topics. This was visualized using pyLDAvis package for the interactive chart and was embedded in the web front end as shown in figure 8. We finally visualized an interactive topic modeling of various models where the user can predefine the topic to be selected and also adjust the alpha value.

Figure 8 — Topic Modeling

4.3 Tools Used

  1. Spark — For News sentiment analysis
  2. Pandas — EDA, pre-processing and machine learning
  3. D3 — Front end
  4. Three.js & Oimo.js — WEBGL graphics
  5. Keras — Neural Network Models
  6. Gensim — Topic Modeling
  7. Matplotlib- — EDA
  8. Plotly — EDA
  9. NLTK — News and Twitter sentiment Analysis
  10. Scikit-learn — Data normalization
  11. PHP — Web Front Dynamic Framework
  12. Android Studio — Android Application

5. Evaluation

5.1 EDA

EDA provided a good grasp of the data. The analysis of cryptocurrency data helped the user learn better about the data.

5.2 Machine Learning

5.2.1 Prediction
Predicting the prices of various cryptocurrencies using OHLCV and news sentiments using LSM gave the precision, recall and f1 score as:
Model Evaluation:
• Precision:
0.62
• Recall:0.57
• F1Score:0.60
• MeanSquaredError:0.05

Figure 9 shows the predicted price of various cryptocurrencies for March 31st, 2019. We ran our model on March 30th, 2019 using the data of previous 5 days. As it can be seen in the figure the predicted price is very close to the trends showing the validity of the model.

Figure 9 — Prediction of various cryptocurrency prices on March 31st, 2019

Figure 10 shoes, predictions generated by the Keras model crontasks which runs everyday 00:00 am PST, for predicting next days price.

Figure 10 — Live predictions of on the web front

5.2.2 Topic modelling
Topic modeling of top10 on the live news relating to cryptocurrencies which is also interactive to use. We trained the data set on the LDA model using gensim and pyLDAvis.

5.2.3 Live Bitcoin Global Transaction Nodes
The transaction nodes of the cryptocurrency exchanges all over the globe are live over a world map on the web-front end.

5.2.4 Live Transaction Size
It is an interesting visualization of live of the global Bitcoin transactions happing live in the world. In the WebGL graphics each bitcoin transaction volume is represented on a plane surface with cube representing the last mined block and balls indicating the transaction volume with color indicating their sizes. Figure 11 shows the glimpse of how the visualization is looks in the web-front.

Figure 11 — Live Bitcoin Transaction Volume

The above modules are easy to use and interactive for the user. It can be used by the user in order to get a deeper understanding of the cryptocurrency market.

6. Data Product

Our data product is the CRYPTOIntel web dashboard and android application that has a collection of all the information interactively in graphical form. CRYPTOIntel consists of all the big data and machine learning techniques put together in a dashboard that consist of all the information about the cryptocurrency market. The dashboard is available at this link: http://nml-cloud-20.cs.sfu.ca/cryptointel/. On a broader level, we have 5 main modules in our data product which are as follows:

  • Live Market Trends Visualization
    Figure 11 shows the landing page of our data product which shows several live visualizations related to various cryptocurrencies we have done. A user can better understand the data through this visualization which include 24-hour market trends through various graphs, market trades relating to various cryptocurrencies, L2 order book snapshot, live crypto prices, live news with sentiment and a module for live price conversion.
Figure 11 — Realtime Visualization of Cryptocurrency World
  • Live Sentimental Analysis
    It is a two-part module which includes live sentiment analysis of streaming tweets relating various cryptocurrencies and the live sentiment classification of news articles in real-time as shown in figure 12.
Figure 12 — Live Sentiment Analysis
  • Prediction
    We used LSTM to predict the next day price of various cryptocurrencies using sentiment analysis and OHLCV data. The prediction is shown in figure 14 for Bitcoin.
Figure 14 — Prediction by Sentiment Analysis and OHLCV data
  • Global Bitcoin Transactions
    This is also a two-part module which includes two major visualizations. The first part shows the reachable Bitcoin transaction nodes globally occurring in real-time as shown in figure 15 [1]. The second part includes the live transactions relating to Bitcoin in real-time shown in an interesting way, as shown in figure 11 above.
Figure 15 — Live Global Transaction Nodes
  • Android Application
    We have extended our dashboard to the Android application as well which provides the same level of graphical representation as the web application, but catered for a mobile view as shown in figure 16.
Figure 16 — Android Application

Here is the video explaining our data product:

7. Lessons Learnt

Taking on a project diving into the cryptocurrency world made to learn and think as data scientists which gave not only insights about many of the things but how to overcome the challenges. Here we will be discussing our learnings in regards to this project:

  • We learned how can we use the power of natural language processing to the give the users up to date news sentiment trend of the cryptocurrency market.
  • As we were making our price prediction using news articles module the original idea was to use the number of articles as the feature to our module. After the implementation it turned that was not giving good performance for price prediction, so we decided to go with taking the news sentiment polarity of the news articles as one of our features and it turned out to be a better model than before.
  • During the designing of the front end of the project, there were numerous important design decisions that were taken as part of the design implementation, the most involving task was to make a live data pipeline for the front end module, especially for the google trends data. There were many options which do scrapping via python at the backend and serve as an API to the front end but all the methods had some kind of kind lag hindering with the user experience. The solution which came into the picture was using javascript to make continuous JSON calls. Secondly, for Google trends data we are executing the python module via PHP calls to fetch the data from Google trends.
  • There were many options that could be taken up for the central database storage but we went with Cassandra. We first tried with MySQL which was easy to set up and had a direct driver for PHP, but there were some performance issues we faced with that was, first, since we are using pyspark it was working faster with Cassandra than the MySQL. Secondly, it didn’t provide fault tolerance for our data, given the nature of the project we need a fault-tolerant database. Considering all of these Cassandra came to be a better choice, given its distributed nature with SFU’s reliable Cassandra server we were able to set up a good data server for our project.

8. Summary

In summary, CyptoIntel is one stop dashboard to answer all the queries of the user about cryptocurrencies. Starting from the basic general information about the cryptocurrencies to predicting the prices of the cryptocurrencies. The users can gain knowledge about various cryptocurrencies using several interactive visualizations which are driven by users so that the users can better understand the trends and make an informed decision. For example, several graphs representing the price trends, the up-down arrow denoting the price fluctuations, the interactive word cloud where the user can select the number of words and shuffle the words for a better view. Interactive topic modeling in which the user can select the topic and the alpha value. With these several visualizations, we aim to create a dashboard which can give an end to end understanding of various cryptocurrencies to the users because it is easier to understand using the visualizations than the numbers. The data from several sources were put together to form these interactive visualizations.

References

[1] Bitnodes Crawler — https://github.com/ayeowch/bitnodes

[2] Topic Modeling with Gensim (Python) — https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

--

--