Training a text classification model with INSTRUCTOR Embeddings

Abdullah Mubeen
spark-nlp
Published in
6 min readJan 4, 2024

Introduction

This article will teach you how to train any model, particular a text classification model using INSTRUCTOR embeddings with the Spark NLP library using Python, more about why I chose these particular technologies later!

View the notebook in colab here if you want to jump straight into it!

BONUS: We will also talk a bit about how pipelines in Spark NLP are made and in the end, I will mention some useful resources to get you started working in Spark NLP to solve your NLP problems easily!

Why INSTRUCTOR?

My fascination with INSTRUCTOR embeddings began with their universal applicability, one of the most compelling aspects of INSTRUCTOR embeddings is their versatility. Their design allows them to be applied in virtually any scenario, making them a Swiss Army knife in the world of NLP.

More about INSTRUCTOR

In recent deep learning for text tasks like classification, scalability is a challenge, particularly for varied tasks such as topic detection, plagiarism, and quality checks. InstructOR embeddings address this by enabling a single model to generate task-specific embeddings from the same document, streamlining the process and aligning with trends like InstructGPT and ChatGPT.

What does InstructOR do?

InstructOR stands for Instruction-based Omnifarious Representations.

‘Omnifarious’ refers to its ability to handle diverse varieties and forms

Let’s break it down and see how it works.

  • The model’s embeddings are the way it represents a document.
  • The model is given instructions/descriptions of the task the embeddings will be used for, along with the document that it should embed.
  • Since a single document can be represented uniquely for each task (instruction) that is provided, it means that we can solve a large variety of tasks.

Ultimately, InstructOR is capable of generating task-specific embeddings for a given document.

Image by authors of the paper: https://arxiv.org/pdf/2212.09741.pdf

Why SparkNLP?

Spark NLP built by John Snow Labs, a leading NLP tool, supports over 100 languages. It has models that excels in tasks such as NER, POS, Tokenization, and Sentence Similarity, along with advanced functionalities like relation extraction, Assertion Status, De-identification, etc.. Offering user-friendly, scalable, state-of-the-art technology. The entire list of models can be found here.

Now lets get into training the model

Admittedly, not everyone has access to high-end machines for training NLP models. For this reason, I opted for a cloud-hosted service. While there are numerous providers, each with their own advantages and drawbacks, I chose Colab for this project. The decision was influenced by the manageable size of my data and Colab’s user-friendly experience, and collaboration capabilities, along with its sufficient computational capabilities for my needs.

Install Spark NLP

I decided to use Python for my code as it’s memory efficient and easy to use, Spark NLP also supports Java and Python and a Scala version of this tutorial can be found at the end.

After installing we can import the required functions.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from pyspark.ml import Pipeline
import pandas as pd

spark = sparknlp.start()

Load Data

The next step is to load the actual data, I have divided it into train and test splits and uploaded it to GitHub for ease of access.

!wget https://raw.githubusercontent.com/abdullahmubeen10/ClassifierDL_Training/main/test.csv
!wget https://raw.githubusercontent.com/abdullahmubeen10/ClassifierDL_Training/main/train.csv

test_df = pd.read_csv('/content/test.csv').drop("Id", axis='columns')
train_df = pd.read_csv('/content/train.csv').drop("Id", axis='columns')

We have our data loaded now let’s take a quick peek

Train DataFrame

Train model with INSTRUCTOR

The next step is building the pipeline.


documentAssembler = DocumentAssembler() \
.setInputCol("Comment") \
.setOutputCol("document")

embeddings = InstructorEmbeddings.pretrained() \
.setInputCols(["document"]) \
.setInstruction("Represent the sentences for categorical text classification: ") \
.setOutputCol("instructor_embeddings")

classsifierdl = ClassifierDLApproach()\
.setInputCols(["instructor_embeddings"])\
.setOutputCol("class")\
.setLabelColumn("Topic")\
.setMaxEpochs(20)\
.setBatchSize(32)

pipeline = Pipeline().setStages([
documentAssembler,
embeddings,
classsifierdl
])

This pipeline, designed for text classification, begins straightforwardly with the DocumentAssembler. This essential component in Spark NLP converts raw text into a structured format, setting the stage for advanced processing. In this instance, it takes ‘Comment’ as its input and produces a ‘document’ output, preparing the data for further analysis.

Following the DocumentAssembler, we introduce InstructorEmbeddings. InstructorEmbeddings specialize in transforming the text into specific embeddings based on provided instructions: ‘Represent the sentences for categorical text classification.’ This step is crucial for generating meaningful, context-rich embeddings tailored to our classification needs.

The final piece of the pipeline is the ClassifierDLApproach, a sophisticated classifier within the Spark NLP ecosystem. It’s adept at handling textual data.

Okay now, let’s go ahead and train the model.

pipelineModel = pipeline.fit(train_df)

Let’s see how it performs on test data

preds = pipelineModel.transform(test_df)
The results were sorted

Now let’s use Sklearn’s matrices to measure INSTRUCTOR’s performance

classification_report(preds['result'], preds['Topic'])
Pretty good!

Considering we only used 17.25% of the entire dataset, it’s pretty amazing.

Let’s train another Classifier model using a different embeddings and compare the results

let’s use SparkNLP’s Universal Sentence Encoder

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

The details are described in the paper “Universal Sentence Encoder”.

Here’s a similar pipeline to the one we used for INSTRUCTOR but now for UniversalSentenceEncoder

documentAssembler = DocumentAssembler() \
.setInputCol("Comment") \
.setOutputCol("document")

USE_embeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")

classifier = ClassifierDLApproach() \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("category") \
.setLabelColumn("Topic") \
.setMaxEpochs(20)\
.setBatchSize(32)

USE_pipiline = Pipeline().setStages([
documentAssembler,
USE_embeddings,
classifier
])

We Fit and Transform this pipeline exactly as we did with INSTRUCTOR and feed it the same data

USE_pipelineModel = USE_pipiline.fit(train_df)
USE_preds = USE_pipelineModel.transform(test_df)

Now let’s use Sklearn’s matrices to measure It’s performance

classification_report(USE_preds_df['result'], USE_preds_df['Topic'])
Pretty similar!

The results are actually quite similar, but INSTRUCTOR seems to have higher edge!

Lets visualize these results for a better comparison

The presented bar chart delineates a side-by-side comparison of both models in terms of Precision, Recall, and F1-Score across the disciplines of Biology, Chemistry, and Physics.

The line graph illustrates the performance in Precision, Recall, and F1-Score for each discipline (Biology, Chemistry, Physics) and overall model performance (Accuracy, Macro Avg, Weighted Avg).

References :

--

--