A quantitative research on bitcointalk.com forum posts

Published in

Wolverine Blockchain

4 min readApr 6, 2018

Source (https://www.linkedin.com/pulse/non-technical-introduction-lda-topic-models-hotel-reviews-fatih-akici/)

1. Introduction

In the crypto asset space, textual information usually contains much nuance about the project. Such textual information includes the project whitepaper, tweets, community discussion, and even the website of the project.

However, the textual information is also highly unstructured and usually has to assessed from a qualitative method. After all, it does not take too much effort to throw a bag of buzzwords, create a bloated whitepaper, and let the investors to figure it out themselves. Thus the researcher group at Wolverine Crypto Trading decided to conduct a quantitative research on the textual information for various crypto assets, namely, we used the Latent Dirichlet Allocation (LDA) model (which uses statistical methods to group words into topics) to generate a topic list for the textual information.

This research is in collaboration with Professor Andrew Wu at the Ross School of Business and is inspired by Professor Wu’s prior works, Word Power: A New Approach for Content Analysis and Deciphering Fedspeak: The Information Content of FOMC Meetings.

2. Data

The research group scraped bitcointalk.org forum and collected all of the users information and posts for the new coin announcement threads, netting 2.6 million posts over 1142 crypto assets.

For example, we scraped the Electroneum announcement thread, which includes 6093 comments made by the community and Electroneum developers. Within the thread, the developers and community members often discuss the project development timeline, exchange-listing schedules, and marketing efforts. Analyzing these communications can give us a better idea on the current state of the project and its future success rate.

However, reading and comprehending these posts manually is an extremely dull process. It’s a repetitive and cognitive task, and therefore the research group decided that an automated, quantitative method could be valuable for extracting the hidden insights within the pieces of information.

3. Analysis

The first step is to pull all the posts made by the original poster of the thread, which is also the developer for the said crypto asset.

Then we chop the paragraphs into sentences and words. This process is called tokenization, and each resulting word is called a token. Then we clean the text by removing stop words, which are words that do not contribute too much meaning to the sentence but are there solely for the grammatical purpose.

The last step is to feed the tokenized words into the LDA model. This process can be repeated multiple times because the LDA model has a “number of topics” parameter, which is determined by the researcher on how many topics should the algorithm generate for the given text

4. Preliminary Result

As a preliminary test run, the researchers analyzed a randomly selected list of seven coins. For each coin, 10 topics were generated and each topic shows the top 10 most relevant word.

Source (https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html)

Each paragraph represents a topic and words that belong to the topic. Researchers still have to infer the topic behind each set of words. For example, the first topic of Amber Coin includes words like “dividend” and “company”; thus this paragraph is likely tied to founding a business/offering equity token.

Please download the full report here.

Generated report of Amber Coin using LDA model

5. Next Step

Our immediate next step is to create a report template with a better readability. Currently the algorithm only generates a very crude report as shown above, with each line representing a topic and words that belong to the topic. A tabular format might be more appropriate for describing the list of topics and the key words that belong to each of the topic.

In the long term, we want to link the topic analysis to each coin’s price and volume information. For example, we want to find which sets of words make a crypto project more likely to succeed (as represented by a healthy trading volume after ICO). Or alternatively, which set of topics are buzz words that do not have a huge impact on the price or volume information on the crypto asset.

Ultimately, our goal is to help the average investor to filter through the large set of information and focus on the pieces of information that are truly important to the project.

6. Get in Touch

The project leader, Dingan Derek Chen, can be reached at dinganc@umich.edu.

A quantitative research on bitcointalk.com forum posts

Written by dc