Unlocking the Power of Text Classification with Embeddings

4 min readNov 1, 2023

Introduction

Have you ever found yourself drowning in a sea of text data, struggling to make sense of it all?

A couple of days ago, I had to classify a massive amount of sentences into distinct categories, and it felt like an impossible task.

Manual classification was both time-consuming and error-prone.

The common way is to train a classifier using deep learning or craft a prompt for ChatGPT.

The breakthrough came when I discovered the power of using embeddings for text classification.

By converting sentences into numerical vectors and calculating average locations (centroids) for each category, I found a very efficient method.

Today, I’ll walk you through my discovery of using embeddings for text classification.

Algorithm

Embrace the world of text classification with embeddings.

This approach leverages the power of natural language processing (NLP) to convert sentences into numerical vectors (embeddings), allowing for more accurate and efficient classification.

For example, let’s classify sentences into categories, such as concrete or abstract, to help improve writing by encouraging a mixture of both types of sentences.

Algorithm details:

Collect and pre-classify sentences
Convert sentences into a numerical vector (embedding) using pre-trained NLP model (like BERT)
calculate the average location (centroid) of the embeddings for each category
This average location represents the central point of each category based on the pre-classified samples
classify new sentences

sentences are also converted into embeddings
measure its distance to the average locations (centroids) of the categories
the new sentence is classified based on its proximity to these centroids

Real-world Example

Pre-classified Sentences

Concrete: “The cat sat on the mat.”

Abstract: “Freedom is the right to choose.

Generating Embeddings

Using a pre-trained NLP model, convert these sentences into numerical vectors (embeddings).

Calculating Average Locations

Find the central point (average location) of the embeddings for concrete and abstract sentences.

Classifying a New Sentence

New Sentence: “John works every day.”

Convert this new sentence into an embedding.

Measure its distance to the average locations of concrete and abstract sentences.

Figure out whether it is closer to the abstract or concrete centroid.

Code

The training phase. Calculate the embeddings and centroids.

Classify a new instance.

Visual the embeddings

Use a dimensionality reduction algorithms t-SNE to reduce the embeddings dimensions (768D) to 2D.

Then, use matplotlib to visualize the reduced embeddings into a 2D space to provide insights into their separability and similarity.

Remember to set the perplexity to a low value when initializing the t-SNE object.

The perplexity parameter in t-SNE is related to the number of nearest neighbors that are used in manifold learning algorithms. The default perplexity value is 30, for this example.

If the perplexity is set higher than the number of samples, the algorithm cannot execute properly because it’s like asking the algorithm to consider more neighbors than available samples.

Another consideration is that after concatenating embeddings, the shape of the result is (6, 1, 768).

We need to remove this extra dimension using the numpy.squeeze() function, which will remove single-dimensional entries from the shape of an array.

Then the resulting shape is (6, 768).

By reducing the dimensions using t-SNE (t-Distributed Stochastic Neighbor Embedding), we can create compelling visualization

Conclusion
Text classification with embeddings is a game-changer for anyone dealing with large amounts of textual data.

Whether you’re a technical writer aiming to improve your content or a data scientist working on sentiment analysis, this technique can save you time and boost accuracy.

Say goodbye to manual classification headaches and hello to the future of text analysis.

Try it out for yourself and witness the magic of embeddings transforming your text classification tasks.

If you like this article, share it with others ♻️ That would help a lot ❤️