Topic Modelling of Blockchain Token Whitepapers using Latent Dirichlet Allocation

What separates between a good and a bad crypto project?

Aziz Budiman
Data And Beyond
8 min readJun 21, 2023

--

Photo by Traxer on Unsplash

Note: Information of crypto and blockchain whitepapers were extracted as of September 2019 from the Whitepaper Database. Also note the outcome of this project is based on trivial assumptions and should not be taken as any form of financial advice.

Introduction

To the moon! 🚀 The sacred words when the price of a cryptocurrency token has skyrocketed from fractions of a cent to hundreds and thousands of dollars. Cryptocurrencies, despite being a highly risky and volatile asset, still remains an attractive venture amongst young investors. There are over 3000 tokens listed on CoinMarketCap as of the start of 2023. But how do we know which crypto projects are in for the long term or suspiciously going bust?

In this project, we will apply web scraping and natural language processing (NLP) techniques using Python libraries BeautifulSoup and gensim to extract and model textual content of official whitepapers from the list of blockchain token as of September 2019. The main objective is to identify the credibility of the token whitepaper and predict the probability of the project and team going to liquidate based on the historical prices retrieved using CoinGecko API. For this article, we will be covering the topic modelling portion.

What is a Whitepaper?

In layman terms, a whitepaper represents the summary of the blockchain or token projects. The main contents are the funding, the team behind the project, and the roadmap for the next few years. Some of these teams also have a GitHub repository to showcase the token source code and technical documentation such as the smart contracts and types of transactions executed. More importantly, the content that crypto investors tend to look out for is the token use case or commonly known as tokenomics. This will include the initial coin offering (ICO) , total supply, and token price.

Project Workflow for Topic Modelling:

  1. Extracted list of tokens and URLs of whitepapers from website
  2. Parsed whitepapers that are either in PDF or Word document to build a whitepaper corpus
  3. Applied textstat Python library to assess readability of each whitepaper
  4. Utilized NLP libraries such as nltk and sklearn feature extraction to perform stopwords removal, lemmatization, and tokenization to generate corpus word cloud and dictionary
  5. Applied Latent Dirichlet Allocation (LDA) from the gensim library and build a topic model from the pre-processed whitepaper corpus

Challenges and Bottlenecks

In any data science project, the ultimate challenge lies in the data extraction and cleaning process. Not everyone likes to clean stuff but somehow you need it to distill accurate insights from your data. Below are some of the bottlenecks faced during the project delivery.

Photo by Darwin Vegher on Unsplash
  1. Invalid URLs: More than 60% of the URLs on the whitepaper database were either broken or invalid.
  2. Different file formats: There were whitepapers in other unknown formats that were unable to be downloaded and parsed into a corpus. Hence, we had to only extract whitepaper in PDF or Word document.
  3. Different written languages: Not all whitepapers are written in English and most translation mechanisms were not able to translate some of the technical concepts mentioned in these whitepapers. As a result, only whitepapers written in English were considered.
  4. Steep learning curve: The team have to pick up complex concepts in blockchain technology and smart contracts within a short period of time. Neither of us had a strong technical knowledge and had to quickly research to gain a factual understanding. For this project, we did not factor in the specific use cases of each token such as interoperability and zero-knowledge rollups.

Out of over 2800 whitepapers downloaded, approximately 880 of them were able to be parsed into a corpus for pre-processing.

Accessing Readability using Flesch Reading Ease

After pre-processing the dirty data, we moved to create a readability score for each of the token whitepaper. This will determine whether majority of the contents can be understood in plain and layman English. We will be applying the Flesch Reading Ease from the textstat library to create the readability score variable in the dataset. For further details on its derivation and calculation, you may refer to the link below.

Code to generate readability scores

#Import libraries
from textstat import flesch_reading_ease
from itertools import groupby
from string import punctuation


#Removing duplicate punctuations in corpus
punc = set(punctuation)
def remove_dup_punc(x):
new_text = []
for k,g in groupby(x):
if k in punc:
new_text.append(k)
else:
new_text.extend(g)
return ' '.join(new_text)

#Create new column remove punc
data = data.sort_index()
data['remove_punc'] = data['text'].apply(remove_dup_punc)

#Use textstat.flesch_reading_ease
data['readibility'] = data['remove_punc']
.apply(flesch_reading_ease)
scores = pd.cut(data['readibility'],
bins=[-22000, -300, -200, -100, 0, 30, 70, 100, 200, 300],
include_lowest=True)
ax = scores.value_counts(sort=False).plot.bar(color="y", figsize=(12,8))
plt.show()
Readability Scores based on Flesch Reading Ease. Image from author.

The bar chart indicates that majority of the whitepapers are readable for high school students and above. Some outliers can be found where the overall content of the whitepaper is either less than half a page or written in a non-English language.

Word Cloud of Token Whitepaper Corpus

As with most text analytics projects, we generate a word cloud to identify the most common word used in corpus. For this project, we excluded obvious words such as ‘whitepaper’, ‘user’, and ‘blockchain’ in an attempt to see other unique words which could be meaningful before the topic modeling.

Code to extend stopword removal and generate word cloud

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


all_words = “,”.join(list(data[‘processed_text’].values))

stopwords = STOPWORDS

stopwords.extend([“whitepaper”,”user”,”blockchain”,”contract”,
”use”,”network”])

wordcloud = WordCloud(background_color=“white”, max_words=1000,
contour_width=3,contour_color=‘steelblue’,stopwords=stopwords)
.generate(all_words)

wordcloud.to_image()
Word cloud from Blockchain Whitepaper. Image from author.

The word cloud indicates that the majority of the content in the whitepapers focuses on developing platform and systems for decentralized transactions and services. At the time of developing this projects, tokens were emphasising on the application of consensus mechanisms and development of smart contracts for borderless transactions.

Generating Polarity and Subjectivity Scores

Unlike customer reviews or newspaper articles, whitepapers are not suitable for sentiment analysis. The context are usually objective with a neural standpoint on project roadmap, organizational structure, and actual use case of the token. In this case, we will be deriving the polarity and subjectivity score of each whitepaper using the textblob library. Polarity scores aim to detect any emotional content in the whitepaper whereas subjectivity determines whether the context within the whitepaper is factual or opinionated.

Code to generate polarity and subjectivity scores

from textblob import TextBlob

def polarity(x):
return TextBlob(x).sentiment.polarity

def subjectivity(x):
return TextBlob(x).sentiment.subjectivity

data['polarity_score'] = data['processed_text'].apply(polarity)
data['subjectivity_score'] = data['processed_text'].apply(subjectivity)

data.head()
Polarity and subjectivity scores from corpus. Image from author.

Topic Modelling

Now comes the interesting part of the project which is the topic modelling. After the hard work of pre-processing and tokenizing the corpus, we will utilize the gensim library to create word vectors and build the vocabulary as well as the dictionary to create the bag-of-words.

#import Gensim libraries and pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec, CoherenceModel, LdaModel
from gensim.parsing.preprocessing import STOPWORDS
from pprint import pprint


# Create a word vector from the corpus
model = Word2Vec(size = 200, window = 10, min_count = 50, sg =0)
model.build_vocab(data['text_corpus'])

dictionary = gensim.corpora.Dictionary(data['text_corpus'])

#Filter out words that occur less than 50 whitepapers
dictionary.filter_extremes(no_below=50,no_above=0.6)
print('Total Vocabulary Size:', len(dictionary))

#Transforming corpus into bag of words vectors
bow_corpus = [dictionary.doc2bow(doc) for doc in list_words]

#10 topics
lda_model = gensim.models.ldamodel.LdaModel(corpus = bow_corpus,
id2word = dictionary, num_topics = 10,
random_state =42, update_every = 1,
chunksize=100,passes=10,alpha='auto',
per_word_topics=True)
10 Topics Derived from Whitepaper Corpus. Image from author.

Apart from generating the topic clusters, we also derived the perplexity and coherence score from the LDA model. Perplexity is a performance metric to determine how well the LDA model predicted the text or topic whereas coherence refers to the quality of the topics. To know more about these metrics, you may refer to the article below by Shashank Kapadia.

Code for generating perplexity and coherence scores

# Compute Perplexity
print('\nPerplexity:', lda_model.log_perplexity(bow_corpus))

# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model,
texts = data['text_corpus'],
dictionary = dictionary, coherence = 'c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score:', coherence_lda)
Perplexity and Coherence Scores of LDA model. Image from author.

From the visualization, the 10 topics derived had varying weights in terms of percentage of tokens. In topic 1, the word “consensus” was mentioned frequently followed by “layer” and “storage”. This would mean that the majority of token whitepaper contain the common use cases on consensus mechanisms such as Proof-of-Work (PoW) and Proof-of-Stake.

The perplexity score of -7.53 indicate that the LDA model performs well on the data and is able generate the topic clusters accurately. As for the coherence score using the CV method, a score of 0.50 show that the quality of topic is relatively good with no repeated terms in each topic cluster.

Conclusion

Being an ambitious and highly tedious project coupled with time constraints, the team managed to pull it off and deserved a pat on the back. The latter parts of the project which is on the predictive modelling could have been better with tuning of the hyperparameters and employing regularization methods.

If you enjoyed reading my content:

  1. Give this article a clap 👏 and let’s connect Aziz Budiman for stories and blogs for all things Data, Artificial Intelligence, Math, and FinTech.
  2. Show your support and buy me a coffee perhaps? It’s okay if you are unable to do so at this point of time
  3. Feedback and comments are welcomed as this is a platform to learn from one another.
  4. Let’s connect: LinkedIn | GitHub

--

--

Aziz Budiman
Data And Beyond

Curious about AI, Fintech, and Math while sipping coffee from a tumbler. Leisure runner and guitarist.