Scikit-LLM: NLP with ChatGPT in Scikit-Learn

5 min readMay 14, 2023

Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.

Introduction

Classification and labelling are common tasks in natural language processing (NLP). In traditional machine learning workflows these tasks would involve collecting labeled data, training a model, deploying it in the cloud, and making inferences. However, this process can be time-consuming, requiring separate models for each task, and not always yielding optimal results.

With recent advancements in the area of large language models, such as ChatGPT, we now have a new way to approach NLP tasks. Rather than training and deploying separate models for each task, we can use a single model to perform a wide range of NLP tasks simply by providing it with a prompt.

In this article we will explore how to build the models for multiclass and multi-label text classification using ChatGPT as a backbone. To achieve this, we will use the scikit-LLM library, which provides a scikit-learn compatible wrapper around OpenAI REST API. Hence, allowing to build the model in the same way as you would do with any other scikit-learn model.

Preparing the environment

As the first step we need to install scikit-LLM python package.

pip install scikit-llm

Next we need to prepare our OpenAI API keys. In order to create a key please follow these steps:

Go to OpenAI platform and sign in with your account.
Click “Create New Secret Key” to generate a new key. Make sure to store the key, since as soon the window with the key closes, you will not be able to reopen it anymore.
Additionally, you will need an organization ID that can be found here.

Now we can configure scikit-LLM to use the generated key.

from skllm.config import SKLLMConfig
SKLLMConfig.set_openai_key("<YOUR_KEY>")
SKLLMConfig.set_openai_org("<YOUR_ORGANISATION>")

Use Case #1: Multiclass Reviews Classification

We will have a look at a very common task: text sentiment prediction. The dataset consists of movie reviews. The possible sentiments are positive, neutral or negative. The sample of the dataset can be seen below:

+-----------------------------------------------------------------------------+----------+
| Review                                                                      |  Label   |
+-----------------------------------------------------------------------------+----------+
| I was absolutely blown away by the performances in 'Summer's End'.          |          |
| The acting was top-notch, and the plot had me gripped from start to finish. | Positive |
| A truly captivating cinematic experience that I would highly recommend.     |          |
+-----------------------------------------------------------------------------+----------+
| I was thoroughly disappointed with 'Silver Shadows'.                        |          |
| The plot was confusing and the performances were lackluster.                | Negative |
| I wouldn't recommend wasting your time on this one.                         |          |
+-----------------------------------------------------------------------------+----------+
| 'The Last Frontier' was simply okay.                                        |          |
| The plot was decent and the performances were acceptable.                   | Neutral  |
| However, it lacked a certain spark to make it truly memorable.              |          |
+-----------------------------------------------------------------------------+----------+

We would need to initialize ZeroShotGPTClassifier that takes model name as a parameter. In our example we will use gpt-3.5-turbo model (default ChatGPT). The list of the possible models can be found here. Afterwards, we train the classifier using fit()method and predict the labels by calling predict()method. Scikit-LLM will automatically query the OpenAI API and transform the response into a regular list of labels. Additionally, Scikit-LLM will ensure that the obtained response contains a valid label. If this is not the case, a label will be assigned randomly (with label probabilities being proportional to label occurrences in the training set).

Note: as we are using zero-shot text classification, where the model does not see any prior training examples, it is crucial that the labels are expressed in natural language, are descriptive, and self-explanatory.

from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

# demo sentiment analysis dataset
X, y = get_classification_dataset() 

clf = ZeroShotGPTClassifier(model = "gpt-3.5-turbo")
clf.fit(X, y)
labels = clf.predict(X)

In the example above we passed labelled training dataset to the classifier. This is done solely for making the API scikit-learn compatible. In fact, X is not used during training at all. Moreover, for y it is sufficient to provide candidate labels in an arbitrary order. Therefore, even if no labelled training data is available, the model can still be built (as shown below).

from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

X, _ = get_classification_dataset()

clf = ZeroShotGPTClassifier(model = "gpt-3.5-turbo")
clf.fit(None, ['positive', 'negative', 'neutral'])
labels = clf.predict(X)

Use Case #2: Multi-Label Reviews Classification

Another common NLP task is multi-label classification, meaning each sample might be assigned to one or several distinct classes.

+----------------------------------------------------------------+-------------------------+
| Review                                                         | Labels                  |
+----------------------------------------------------------------+-------------------------+
| The food was delicious and the service was excellent.          | Food, Service           |
| The hotel room was clean and comfortable.                      | Accommodation           |
| I loved the friendly staff and the beautiful decor.            | Service, Ambiance       |
| The movie was entertaining but the ending was disappointing.   | Entertainment           |
| The product arrived on time and was of great quality.          | Delivery, Quality       |
| The concert was electrifying and the band was energetic.       | Entertainment, Music    |
| The customer support was helpful and quick.                    | Service, Support        |
| The book had an engaging plot and well-developed characters.   | Literature, Storytelling|
| The hiking trail offered breathtaking views.                   | Outdoor, Adventure      |
| The museum had a wide collection of art and artifacts.         | Culture, Art            |
+----------------------------------------------------------------+-------------------------+

For this task we can useMultiLabelZeroShotGPTClassifier. The structure of the code remains the same with the only difference that each label in y is a list.

from skllm.models.gpt.classification.zero_shot import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

# demo dataset for multi-label classification
X, y = get_multilabel_classification_dataset()

clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
clf.fit(X, y)
labels = clf.predict(X)

Similarly to the ZeroShotGPTClassifier, it is sufficient if only candidate labels are provided. However, this time the classifier expects y of a type List[List[str]]. Since the actual structure or ordering of the labels is irrelevant, we can simply wrap our list of candidate labels into an additional outer list.

from skllm.models.gpt.classification.zero_shot import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset

X, _ = get_multilabel_classification_dataset()
candidate_labels = [
    "Quality", 
    "Price", 
    "Delivery", 
    "Service", 
    "Product Variety", 
    "Customer Support", 
    "Packaging", 
    "User Experience", 
    "Return Policy", 
    "Product Information"
]
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)
clf.fit(None, [candidate_labels])
labels = clf.predict(X)

Conclusion

Scikit-LLM is an easy and efficient way to build ChatGPT-based text classification models using conventional scikit-learn compatible estimators without having to manually interact with OpenAI APIs.