Sitemap
Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Member-only story

Data Science Trends 2016–2021

5 min readMay 31, 2021

--

Screenshot by author — live visualization at end of post

Today I scraped 41739 medium post titles from the publication Towards Data Science. Based on these titles, I wanted to try and find out which topics have been trending over time, as well as which topics are trending today.

In order to perform the trend detection analysis I wanted to use some clever way of giving context to all this data; essentially a way to convert all the blog post titles into a numerical format which preserves the context of the titles. A good way to go about this kind of challenge is usually to use a pre-trained model from the huggingface library. Since the goal of the current analysis is to cluster similar titles together under different “topics”, I opted for using the sentence-transformers semantic textual similarity model stsb-roberta-large,which is trained on paraphrases specifically to create 768-dimensional embeddings for each sentence/title which is optimized for minimizing the difference between semantically similar sentences. This results in a total of 41739 sentence embeddings, i.e. a matrix with dimensions 41739×768.

We’re interested in similarities between each of the titles, which is typically done by calculating the cosine similarity between each title embedding, i.e. if we have two sentences A and B, which get embedded into…

--

--