TL;DR: Use of NLP (spaCy and Gensim) for topic modelling of Hacker News favorites links scraped with Selenium.
I love Hacker News. With just a daily glance you can be up to date in technology, start-ups, etc. It is a link aggregator where users can upvote the links they like the most. Simple and effective.
Not long ago I checked that it had a lot of “upvoted” links (I often use it as a bookmarker) and I realized that it might be interesting to analyze the links and determine what my main interests were.
Could NLP “magic” be done with something?
I want to answer the following questions:
- Do I have similar or disparate interests?
- In how many topics can I classify them?
- What technologies seem to interest me the most?
- How many “upvotes” links do I have?
You know the saying: “Divide and conquer”. I split the job in the following three parts:
- Grab the “favorites” (upvoted) from my Hacker News account.
- Scrap each favorite with Selenium
- Pre-processing and topic modelling with Spacy and Gensim
The project has been developer in a Jupyter notebooks and is available in this Github repository.
Grab the favorites from Hacker News
To scrap Hacker News is very pretty straightforward.
BeautifulSoup4 (BS4) libraries. The method is closely coupled to the current structure of Hacker News web pages.
Good thing it seems that it doesn’t change much.
We use a
requests.sessionto handle the session after login (cookies).
The main loop iterates over each upvoted link and extracts: the url, title, #comments and #points (karma).
The links ares stored in “items” list.
In my account I have just over 200 upvoted (favorites) links.
NOTE: This code need an update if the “upvoted” template changes.
Scrapping the favorites
Once all the favorites links are obtained it’s time to scrap them.
I want the real HTML so we need to use another tool: Selenium.
If you don’t know Selenium, it’s a tool designed to run test in Web browser automatically.
For Selenium to work it is necessary to use a webdriver to control the browser that will make the calls.
Firefox and Chrome are the main browsers used with Selenium.
For this project I use Chrome because I have had better experiences than with Firefox.
We use the headless mode. This way it doesn’t open a new window and it also allows you to run it on remote servers without a graphical interface (GUI).
The initialization operation takes some time (a few seconds). To avoid this initialization on every call, I have created a simple class that manages its initialization and use so that it is only initialized when necessary.
In the initialization it is specified that we want it in headless mode. We also specify a timeout of 10 for the DOM to become stable. This gives more guarantees of getting the code that a normal user who visits the web actually sees and experiences.
The main method,
scrape, makes the headless browser go to the url and return the source code. If there are any errors, it closes the browser.
With our scraper class ready, we can start to scrap the links.
For 206 links in my case the time used has been 10 minutes.
Pre-processing and topic modelling
The most interesting part arrives (for a NLP fan!)
In this moment we have a lot of HTML code and we need to pre-process and extract good information from them.
The first challenge appears: How do we extract the relevant information from the HTML code? What tags and sections do we look at?
After several tests I created a method that collects the most important information in my opinion: title, headings and rest of visible text.
The method removes the sections (HTML tags) that are not considered as content containers:
script, nav, footer , etc.
Then just extract text from paragraphs, cells, list item, etc.
ScrapedWebVitaminedis a special class with methods to retrieve the contents in different flavours.
We use a custom spaCy pipeline to process the scraped content to convert into features.
Two custom stages are added based on Jonathan Keller’s work .
The lemmatizer extract the lemma from the words to avoid verbs forms. “Stop words” remove tokens that are stop and punctuation words.
Before running the pipeline, we add more stop words. These stop words are very common in all the docs to process.
Ready to set-up our NLP pipeline
The pre-processing is very pretty straightforward.
Just before we start, we remove the possible emojis that are in the texts.
It is time to go from text to numeric model. We generate a vocabulary vector where each word has an unique index number.
For the topic modelling we use the Latent Dirichlet Allocation (LDA) model. This model uses bag-of-words (bow) model, where we have a set of (word-index, frequency) pairs for each document (our corpus).
One of the questions raised was in how many topics can I classify my interest.
The LDA model uses a fixed number of topics to work. How can we find out this number?
One possible solution is to try different values and use a performance measure.
For the performance measure we will use coherence (the higher the better) .
The test range goes from few topics (somewhat unlikely) to approximately 1/3 of the total favorite links analyzed.
In my case, after execute the loop (It took just over 2 minutes).
If I plot the results
And the winner is… 15 topics!
Not so bad, I thought I had less focus :)
If we re-run the LDA process for just 15 topics, we get the following:
Although we can appreciate the main topics, here is an interactive visualization tool,
pyLDAvis,with which we can do a better analysis.
pyLDAvis uses PCA to obtain the “main axes” where the topics move.
We can observe which topics are more related and which are less related.
It is also interesting to see the size of each topic.
I seem to have answered the main questions raised.
I have similar interests that can be classified into 15 main topics for a total of 206 links.
The keyword “use” predominates. It is normal considering that most of the links are tutorials or how-to.
The main topics that can be inferred are:
- Software development (tools, libraries)
- Infrastructure: Kubernetes, GCP, AWS, …
- Hardware stuff.
- ML and AI.
- Startups news.
However I consider the following weak points:
- The scraping and the features extraction are the key. Without good data, there is nothing.
- Pre-processing. Some words or punctuation marks have slipped in that shouldn’t be there.
- The content itself has high cohesion in the domain of IT. This means that the cohesion is not very high nor the topics are very differentiated. LDA is probably not the best model for this case.
To improve the results I can think of the following improvements:
- Build a knowledge graph. It is important to capture the subjects, actions and objects involved. This can better determine the type of content.
- Use of Bigrams and trigrams. Instead of using only the main keywords, it can be more clarifying to use pairs or trio of words.
- Better scraping and feature extraction. It has been mentioned before. Without good information, there is nothing to do.
- Use of other algorithms like as lda2vec. The idea is to better capture the relationships as word2vec provides.
I will probably write again if I apply some of these possible enhancements.
Remember that all source code is available in the GitHub repository.
Please, if you liked it, give it a round of applause. And if you want to know more about DevOps, Kubernetes, Docker, etc … follow me :)
 “Building a Topic Modeling Pipeline with spaCy and Gensim” by Jonathan Keller @ towards data science
 “Evaluate Topic Models: Latent Dirichlet Allocation (LDA)” by Shashank Kapadia