Member-only story
Data Science Trends 2016–2021
Trend detection on Towards Data Science posts
Today I scraped 41739 medium post titles from the publication Towards Data Science. Based on these titles, I wanted to try and find out which topics have been trending over time, as well as which topics are trending today.
In order to perform the trend detection analysis I wanted to use some clever way of giving context to all this data; essentially a way to convert all the blog post titles into a numerical format which preserves the context of the titles. A good way to go about this kind of challenge is usually to use a pre-trained model from the huggingface library. Since the goal of the current analysis is to cluster similar titles together under different “topics”, I opted for using the sentence-transformers
semantic textual similarity model stsb-roberta-large,
which is trained on paraphrases specifically to create 768-dimensional embeddings for each sentence/title which is optimized for minimizing the difference between semantically similar sentences. This results in a total of 41739 sentence embeddings, i.e. a matrix with dimensions 41739×768.
We’re interested in similarities between each of the titles, which is typically done by calculating the cosine similarity between each title embedding, i.e. if we have two sentences A and B, which get embedded into…