Scikit-LLM: Sklearn Meets Large Language Models

6 min readMay 25, 2023

Scikit-LLM is a game-changer in text analysis. It combines powerful language models like ChatGPT with scikit-learn, offering an unmatched toolkit for understanding and analyzing text. With scikit-LLM, you can uncover hidden patterns, sentiment, and context in various types of textual data, such as customer feedback, social media posts, and news articles. It brings together the strengths of language models and scikit-learn, enabling you to extract valuable insights from text like never before.

Official GitHub Repository — https://github.com/iryna-kondr/scikit-llm

All examples are taken directly from official Repository.

Let’s get started!

Install Scikit-LLM

Start by installing Scikit-LLM, the powerful library that integrates scikit-learn with language models. You can install it using pip:

pip install scikit-llm

Obtain an OpenAI API Key

As of May 2023, Scikit-LLM is currently compatible with a specific set of OpenAI models. Therefore, it requires users to provide their own OpenAI API key for successful integration.

Begin by importing the SKLLMConfig module from the Scikit-LLM library and add your openAI key:

# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig

# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")

# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

As stated in their GitHub repository —

If you have a free trial OpenAI account, the rate limits are not sufficient (specifically 3 requests per minute). Please switch to the “pay as you go” plan first.
When calling SKLLMConfig.set_openai_org, you have to provide your organization ID and NOT the name. You can find your here: https://platform.openai.com/account/org-settings

Zero Shot GPTClassifier

One of the cool things about ChatGPT is its ability to classify text without needing to be specifically trained for it. All it requires are descriptive labels.

Introducing ZeroShotGPTClassifier, a class in Scikit-LLM that lets you create such a model just like any other scikit-learn classifier.

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn
X, y = get_classification_dataset()

# defining the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# fitting the data
clf.fit(X, y)

# predicting the data
labels = clf.predict(X)

Not only that, Scikit-LLM makes sure that the response it receives actually contains a valid label. If it doesn’t, Scikit-LLM will pick a label randomly, considering the probabilities based on how frequently the labels appear in the training data.

In simpler terms, Scikit-LLM handles the API stuff and guarantees you get usable labels. It even fills in if a response is missing a label, choosing one for you based on how often it appeared in the training data.

What if you don't have labelled data?

Here’s the interesting part — you don’t even need labeled data to train the model. You just need to provide a list of candidate labels:

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn for prediction only

X, _ = get_classification_dataset()

# defining the model
clf = ZeroShotGPTClassifier()

# Since no training so passing the labels only for prediction
clf.fit(None, ['positive', 'negative', 'neutral'])

# predicting the labels
labels = clf.predict(X)

Isn’t that cool? You can train a classifier without explicitly labeled data, simply by specifying the potential labels.

As stated in their GitHub Repository —

In zero-shot classification, the effectiveness of the classifier depends on how the label itself is structured. It should be expressed in natural language, descriptive, and self-explanatory.
For example, in a semantic classification task, transforming a label from “<semantics>” to “the semantics of the provided text is <semantics>” could be beneficial.

Multi-Label Zero-Shot Text Classification

Performing Multi-Label Zero-Shot Text Classification is easier than you might think:

# importing Multi-Label zeroshot module and classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

# get classification dataset from sklearn 
X, y = get_multilabel_classification_dataset()

# defining the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the model
clf.fit(X, y)

# making predictions
labels = clf.predict(X)

The only difference you find in zeroshot an multi label zero shot is when you create an instance of the MultiLabelZeroShotGPTClassifier class, specifying the maximum number of labels you want to assign to each sample (here: max_labels=3)

What if you don't have labelled data (Multi Labels case)?

In the example provided above, the MultiLabelZeroShotGPTClassifier is trained with labeled data (X and y). However, you can also train the classifier without labeled data by providing a list of candidate labels instead. In this case, y should be of type List[List[str]].

Here’s an example of training without labeled data:

# getting classification dataset for prediction only
X, _ = get_multilabel_classification_dataset()

# Defining all the labels that needs to predicted
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety"
]

# creating the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the labels only
clf.fit(None, [candidate_labels])

# predicting the data
labels = clf.predict(X)

Text Vectorization

Text vectorization is a process of converting text into numbers so that machines can understand and analyze it more easily. In this case, the GPTVectorizer is a module from Scikit-LLM that helps convert a piece of text, no matter how long it is, into a fixed-size set of numbers called a vector.

# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer

# Creating an instance of the GPTVectorizer class and assigning it to the variable 'model'
model = GPTVectorizer()  

# transorming the
vectors = model.fit_transform(X)

Applying the fit_transform method of the GPTVectorizer instance to the input data X fits the model to the data and transforms the text into fixed-dimensional vectors. The resulting vectors are then assigned to the variable vectors.

Let’s demonstrates an example of combining the GPTVectorizer with the XGBoost Classifier in a scikit-learn pipeline. This approach allows for efficient text preprocessing and classification:

# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Creating an instance of LabelEncoder class
le = LabelEncoder()

# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)

# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)

# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]

# Creating a pipeline with the defined steps
clf = Pipeline(steps)

# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)

# Predicting the labels for the test data 'X_test' using the trained pipeline
yh = clf.predict(X_test)

Text Summarization

GPT is really good at summarizing text. That’s why they have a module in Scikit-LLM called GPTSummarizer. You can use it in two ways: on its own or as a step before doing something else (like reducing the size of the data, but with text instead of numbers):

# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer

# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset

# Calling the get_summarization_dataset function
X = get_summarization_dataset()

# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)

# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.
# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)

Please note that the max_words hyperparameter acts as a flexible limit for the number of words in the generated summaries. It is not strictly enforced beyond the provided prompt. This means that in certain situations, the actual number of words in the generated summaries may slightly exceed the specified limit. In simpler terms, while max_words sets a rough target for the summary length, the summarizer may occasionally produce slightly longer summaries depending on the context and content of the input text.

If you have any query feel free to ask me!