Cryptocurrency Sentiment Data

Aaron Li

Published in

Qokka

3 min readApr 1, 2020

We released a cryptocurrency sentiment dataset: https://github.com/Qokka/crypto-sentiment-data

This dataset includes daily sentiment and price data of 2019 for cryptocurrencies that

are listed on Coinbase, or
have a high market cap with an active online community for 2+ years.

In Qokka | Crypto (https://crypto.qokka.com), we continuously analyze community discussions from Reddit and Telegram for over 1500+ blockchain projects which have tradable cryptocurrencies. We include a community only if it is well established (e.g. /r/bitcoin on Reddit) or is officially managed by the project (e.g. https://t.me/algorand on Telegram). We use third-party data sources such as Coinmarketcap to collect and verify the status of the communities.

Most projects only have one active community. For example, Bitcoin only has Reddit. Some projects have both a Reddit community and an active Telegram group (e.g. EOS).

Here are some highlights for our data:

Data Collection

We are one of the very few collecting authentic community discussion data on Reddit and Telegram. Whereas others only include Twitter data. As we all know that blockchain and cryptocurrency communities are mostly on Reddit, Telegram and Discord (coming soon).

We built our own large scale real time crawl to collect all relevant data on various community platforms.

Data Cleaning

We only analyze informative discussion data, uninformative data, such as memes and emojis, are excluded. We also transform discussion data into a format that emphasiezs more than what and how things are described and discussed.

Data Analysis

Same as the crawlers, we also built our own systems, algorithms and machine learning models for sentiment analysis, topic modeling and interest profiling. Whereaes others are relying on third-party services, such as IBM Watson, Google NLP, Amazon Comprehend and Social Market Analytics, which are not designed for analyzing conversational community discussions.

For sentiment analysis:

We train our models using data where labels are readily available.
We use deep learning methods that recognize semantic compositions in sentences.
We predict sentiment of each sentence using 5 classes with a score of 1 to 5: very negative, negative, neutral, positive, very positive. This means our methods won’t just look out for keywords in sentences (e.g. good, bad, great). Instead, it looks at the overall structure and relationships between the words to some extent, and in principle shouldn’t be confused by complex structures like double negative. However it may still be confused in less common cases (e.g. sacarcisms).
We then combine the scores of sentences into a score between 1 to 5 for the discussion.
In the end, when we compute the sentiment for a time range (e.g. for an hour), we use a combination of metrics (e.g. positive percentage, negative percentage, number of discussions) and compute a score between 0 to 1. Note that, this score is not normalized across cryptocurrencies, or across different time ranges for one cryptocurrency. We leave that at the discretion of our users since they might need the flexibility to normalize the scores differently.

Have fun with our dataset! Please also join our community on Discord: https://discordapp.com/invite/UyrH3QK . Ask us questions and connect with others.

Cryptocurrency Sentiment Data

Data Collection

Data Cleaning

Data Analysis

Written by Aaron Li