Many algorithms, whether supervised or unsupervised, make use of distance measures. These measures, such as euclidean distance or cosine similarity, can often be found in algorithms such as k-NN, UMAP, HDBSCAN, etc.
Understanding the field of distance measures is more important than you might realize. Take k-NN for example, a technique often used for supervised learning. As a default, it often uses euclidean distance. By itself, a great distance measure.
However, what if your data is highly dimensional? Would euclidean distance then still work? Or what if your data consists of geospatial information? …
Over the last few years, I have noticed it has become increasingly popular to dislike Jupyter Notebooks with many people stating you should switch from Jupyter to scripts (here, here, here, here, etc.).
Indeed, there are some disadvantages to using Jupyter Notebooks, but that does not mean you should ignore the trove of advantages that could help you become a more efficient Data Scientist!
Jupyter Notebooks can complement your workflow
Like with most tools it is a matter of using the tool for its intended purposes. …
Let me start off by saying that I truly love the work that I am doing as a Data Scientist! I get to work on interesting technical problems that can highly impact people and businesses.
However, it is not all it's cracked up to be. There are quite a few people who have been transitioning to Data Science after it was called the sexiest job of the 21st century, only to become disillusioned with the field afterward!
In this article, I would like to guide you through the pros and cons of working as a Data Scientist. …
One, perhaps underestimated, aspect of any data-related job is presenting and visualizing your results. Communicating the data that you have at your disposal can be incredibly difficult. With that comes the possibility of accidentally creating misleading graphs.
To bring this into perspective, I have found myself creating misleading charts in the past and have to be careful of not doing that still!
And I would argue that most people have this problem. …
Every day, businesses deal with large volumes of unstructured text. From customer interactions in emails to online feedback and reviews. To deal with this large amount of text, we look towards topic modeling. A technique to automatically extract meaning from documents by identifying recurrent topics.
A few months ago, I wrote an article on leveraging BERT for topic modeling. It blew up unexpectedly and I was surprised by the positive feedback I had gotten!
I decided to focus on further developing the topic modeling technique the article was based on, namely BERTopic.
BERTopic is a topic modeling technique that leverages…
As a data scientist, you might be faced with tabular data that has at least one text-based column. Whether they are names, addresses, or company names, in my experience, these almost always need to be cleaned as they are often filled by people and therefore highly prone to errors.
This is where Fuzzy String Matching comes in. It is a collection of techniques that are used to find the best match between two sets of strings. Although there are many algorithms available, I could not for the life of me find a solution that integrates many of these algorithms.
When we want to understand key information from specific documents, we typically turn towards keyword extraction. Keyword extraction is the automated process of extracting the words and phrases that are most relevant to an input text.
With methods such as Rake and YAKE! we already have easy-to-use packages that can be used to extract keywords and keyphrases. However, these models typically work based on the statistical properties of a text and not so much on semantic similarity.
In comes BERT. BERT is a bi-directional transformer model that allows us to transform phrases and documents to vectors that capture their meaning.
In one of my previous posts, I talked about topic modeling with BERT which involved a class-based version of TF-IDF. This version of TF-IDF allowed me to extract interesting topics from a set of documents.
I thought it might be interesting to go a little bit deeper into the method since it can be used for many more applications than just topic modeling!
An overview of the possible applications:
Often when I am approached by a product owner to do some NLP-based analyses, I am typically asked the following question:
‘Which topic can frequently be found in these documents?’
Void of any categories or labels I am forced to look into unsupervised techniques to extract these topics, namely Topic Modeling.
Although topic models such as LDA and NMF have shown to be good starting points, I always felt it took quite some effort through hyperparameter tuning to create meaningful topics.
Moreover, I wanted to use transformer-based models such as BERT as they have shown amazing results in various NLP…
In one of my previous articles, I talked about transitioning from psychology (or any social science) to data science. The focus was mostly on the skills one needed to gain to become a fully-fledged data scientist.
However, what if you already had made the transition? What could you do to leverage your existing psychological knowledge as a data-driven professional? It would be such a shame to throw-away years of studying and ignore all that you have learned!
I truly believe that psychologists have specific skills that can be used to become great data scientists! …