Semantic Similarity Calculations Using NLP and Python: A Soft Introduction

Tanner Overcash
5 min readMar 21, 2023

This article covers at a very high level what semantic similarity is and demonstrates a quick example of how you can take advantage of open-source tools and pre-trained models in your Python scripts. I hope you like the word ‘similarity’ because you’re about to read it a thousand times.

Follow along with the code, available here.

Similar Houses? Semantic Similarity? Get it? Photo by Maria Orlova: https://www.pexels.com/photo/similar-houses-in-highland-valley-covered-with-dense-forest-4946680/

Introduction

Semantic Similarity is a field of Artificial Intelligence (AI), specifically Natural Language Processing (NLP), that creates a quantitative measure of the meaning likeness between two words or phrases. At a high level, this is done by converting words, sentences, or phrases into a vector — a mathematical representation of that word, sentence, or phrase— using a process called sentence embedding. Using a function of similarity (great post about different functions of similarity here by Ashutosh Kumar), these embeddings are used to find that quantitative measure of similarity.

This measure of similarity can be attributed to different aspects of the subjects you’re comparing, so be sure you fully understand what your end goals are. For example, some similarity models compare the lengths of words or phrases to determine a similarity measure. Others work best for comparing words in specific lexicons…

--

--

Tanner Overcash

Data Science, Geospatial Analysis, and Software Engineering enthusiast. Owner/Founder of Until We All Come Home, LLC.