Sign in

Data Scientist | Psychologist. Passionate about anything AI-related! Get in touch:

The advantages and pitfalls of common distance measures

Distance Measures. Image by the author.

Many algorithms, whether supervised or unsupervised, make use of distance measures. These measures, such as euclidean distance or cosine similarity, can often be found in algorithms such as k-NN, UMAP, HDBSCAN, etc.

Understanding the field of distance measures is more important than you might realize. Take k-NN for example, a technique often used for supervised learning. As a default, it often uses euclidean distance. By itself, a great distance measure.

However, what if your data is highly dimensional? Would euclidean distance then still work? Or what if your data consists of geospatial information? …

Image by the author. Icons made by Vectors Market and Freepik from Flaticon.

And how you can supercharge them.

Over the last few years, I have noticed it has become increasingly popular to dislike Jupyter Notebooks with many people stating you should switch from Jupyter to scripts (here, here, here, here, etc.).

Indeed, there are some disadvantages to using Jupyter Notebooks, but that does not mean you should ignore the trove of advantages that could help you become a more efficient Data Scientist!

Jupyter Notebooks can complement your workflow

Like with most tools it is a matter of using the tool for its intended purposes. …

Pros and Cons of working as a Data Scientist

Photo by wu yi on Unsplash

Let me start off by saying that I truly love the work that I am doing as a Data Scientist! I get to work on interesting technical problems that can highly impact people and businesses.

However, it is not all it's cracked up to be. There are quite a few people who have been transitioning to Data Science after it was called the sexiest job of the 21st century, only to become disillusioned with the field afterward!

In this article, I would like to guide you through the pros and cons of working as a Data Scientist. …

… and how to fix them!

One, perhaps underestimated, aspect of any data-related job is presenting and visualizing your results. Communicating the data that you have at your disposal can be incredibly difficult. With that comes the possibility of accidentally creating misleading graphs.

Although most of us know about the many issues pie charts can present (here, here, and here), there are many ways charts could be misleading.

Not all pie charts are bad! Image by the author.

To bring this into perspective, I have found myself creating misleading charts in the past and have to be careful of not doing that still!

And I would argue that most people have this problem. …

Image by the author.

NLP — Getting Started

An in-depth guide to topic modeling with BERTopic

Every day, businesses deal with large volumes of unstructured text. From customer interactions in emails to online feedback and reviews. To deal with this large amount of text, we look towards topic modeling. A technique to automatically extract meaning from documents by identifying recurrent topics.

A few months ago, I wrote an article on leveraging BERT for topic modeling. It blew up unexpectedly and I was surprised by the positive feedback I had gotten!

I decided to focus on further developing the topic modeling technique the article was based on, namely BERTopic.

BERTopic is a topic modeling technique that leverages…


Introducing PolyFuzz, a framework for fuzzy string matching.

As a data scientist, you might be faced with tabular data that has at least one text-based column. Whether they are names, addresses, or company names, in my experience, these almost always need to be cleaned as they are often filled by people and therefore highly prone to errors.

This is where Fuzzy String Matching comes in. It is a collection of techniques that are used to find the best match between two sets of strings. Although there are many algorithms available, I could not for the life of me find a solution that integrates many of these algorithms.

Image by the author.


Getting Started, NLP

A minimal method for extracting keywords and keyphrases

Created by Wokandapix

When we want to understand key information from specific documents, we typically turn towards keyword extraction. Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text.

With methods such as Rake and YAKE! we already have easy-to-use packages that can be used to extract keywords and keyphrases. However, these models typically work based on the statistical properties of a text and not so much on semantic similarity.

In comes BERT. BERT is a bi-directional transformer model that allows us to transform phrases and documents to vectors that capture their meaning.

Created by the author with


Extracting informative words per class

In one of my previous posts, I talked about topic modeling with BERT which involved a class-based version of TF-IDF. This version of TF-IDF allowed me to extract interesting topics from a set of documents.

I thought it might be interesting to go a little bit deeper into the method since it can be used for many more applications than just topic modeling!

An overview of the possible applications:

  • Informative Words per Class: Which words make a class stand-out compared to all others?
  • Class Reduction: Using c-TF-IDF to reduce the number of classes
  • Semi-supervised Modeling: Predicting the class of unseen…

Image by the author.

Leveraging BERT and TF-IDF to create easily interpretable topics.

Often when I am approached by a product owner to do some NLP-based analyses, I am typically asked the following question:

‘Which topic can frequently be found in these documents?’

Void of any categories or labels I am forced to look into unsupervised techniques to extract these topics, namely Topic Modeling.

Although topic models such as LDA and NMF have shown to be good starting points, I always felt it took quite some effort through hyperparameter tuning to create meaningful topics.

Moreover, I wanted to use transformer-based models such as BERT as they have shown amazing results in various NLP…

The Quadrant for Need of Psychological Skills in Data-driven Professions. Image by the author.

The Intersection of Psychology and Data Science

In one of my previous articles, I talked about transitioning from psychology (or any social science) to data science. The focus was mostly on the skills one needed to gain to become a fully-fledged data scientist.

However, what if you already had made the transition? What could you do to leverage your existing psychological knowledge as a data-driven professional? It would be such a shame to throw-away years of studying and ignore all that you have learned!

I truly believe that psychologists have specific skills that can be used to become great data scientists! …

Maarten Grootendorst

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store