NLP Beginning

Amna Zafar
12 min readJun 14, 2024

--

There are two ways of setting up your working environment, using a Colab notebook or a Python virtual environment. Feel free to choose the one that resonates with you the most.

Using a Google Colab notebook

Using a Colab notebook is the simplest possible setup; boot up a notebook in your browser and get straight to coding!

If you’re not familiar with Colab, we recommend you start by following the introduction. Colab allows you to use some accelerating hardware, like GPUs or TPUs, and it is free for smaller workloads.

Write google colab notebook in browser and create a new notebook.

Click on new notebook and start your work.

Now run this command to install transformers library. What are transformers, what they do we’ll discuss it a bit later.

!pip install transformers

Using a Python virtual environment

If you prefer to use a Python virtual environment, the first step is to install Python on your system. Use this guide to get started. Once you have Python installed, you should be able to run Python commands in your terminal. You can start by running the following command to ensure that it is correctly installed before proceeding to the next steps: python --version. This should print out the Python version now available on your system.

In Python this is done with virtual environments, which are self-contained directory trees that each contain a Python installation with a particular Python version alongside all the packages the application needs. Creating such a virtual environment can be done with a number of different tools, but we’ll use the official Python package for that purpose, which is called venv.

First, create the directory you’d like your application to live in — for example, you might want to make a new directory called nlp-course at the root of your home directory:

mkdir ~/nlp-course
cd ~/nlp-course

From inside this directory, create a virtual environment using the Python venv module:

python -m venv .env

You should now have a directory called .env otherwise empty folder.

What is NLP?

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

The following is a list of common NLP tasks, with some examples of each:

  • Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not.
  • Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization).
  • Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words.
  • Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context.
  • Generating a new sentence from an input text: Translating a text into another language, summarizing a text.

Transformers, what can they do?

Transformer models are used to solve all kinds of NLP tasks, like the ones mentioned in the previous section. The most basic object in the Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

Pipeline Function

The pipeline() function is most high-level API of Transformers library. It regroups together all the steps to go from raw texts to usable predictions. Pipeline also includes all necessary pre-processing and post-processing to make output of model human-readable.

Input

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for this course my whole life.")

Output

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

Some of currently available pipelines are:

  • feature-extraction (get the vector representation of a text)
  • fill-mask
  • ner (named entity recognition)
  • question-answering
  • sentiment-analysis
  • summarization
  • text-generation
  • translation
  • zero-shot-classification

Let’s have a look at a few of these!

Zero-shot classification

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

Input

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)

Output

{'sequence': 'This is a course about the Transformers library',
'labels': ['education', 'business', 'politics'],
'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]}

This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

Text generation

Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

Input

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

Output

[{'generated_text': 'In this course, we will teach you how to understand and use '
'data flow and data interchange when handling user data. We '
'will be working with one or more of the most commonly used '
'data flows — data flows of various types, as seen by the '
'HTTP'}]

You can control how many different sequences are generated with the argument num_return_sequences and the total length of the output text with the argument max_length.

Using any model from the Hub in a pipeline

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the Model Hub and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like this one.

Let’s try the distilgpt2 model! Here’s how to load it in the same pipeline as before:

Input

from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2,
)

Output

[{'generated_text': 'In this course, we will teach you how to manipulate the world and '
'move your mental and physical capabilities to your advantage.'},
{'generated_text': 'In this course, we will teach you how to become an expert and '
'practice realtime, and with a hands on experience on both real '
'time and real'}]

You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains checkpoints for multilingual models that support several languages.

Once you select a model by clicking on it, you’ll see that there is a widget enabling you to try it directly online. This way you can quickly test the model’s capabilities before downloading it.

How do Transformers work?

History of Transformer Models

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.

An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.

Another example is masked language modeling, in which the model predicts a masked word in the sentence.

Mask Language Modeling

Transformers are big models

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on. Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources.

Transfer Learning

The idea of transfer learning is to leverage the knowledge acquired by model trained with lots of data on another task.

Example

The model A will be trained specifically for task A. Now let’s say you want to train a model B for different task. One option would be to train model from scratch but this could take lot of computation, time and data. Instead, we could initialize model B with same weights as model A transferring the knowledge of model A on task B.

Training from scratch requires more data and computation to achieve comparable results

In this example a BERT model is trained on the task of recognizing if two sentences are similar or not.

Fine-tuning, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait — why not simply train the model for your final use case from the start (scratch)? There are a couple of reasons:

  • The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
  • Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
  • For the same reason, the amount of time and resources needed to get good results are much lower.

Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.

Encoder Models

The most popular example of encoder-only architecture is BERT. Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

Decoder Models

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving text generation.

Sequence-Sequence Models

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

Why do LLMs work so well?

  1. Potential explanation: emergent abilities!
  2. An ability is emergent if it is present in larger but not smaller models.
  3. Not have been directly predicted by extrapolating from smaller models.
  4. Performance is near-random until a certain critical threshold, then improves heavily.

Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance.

We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

Reinforcement Learning

Reinforcement Learning with Human Feedback (RLHF) is an approach in machine learning where human feedback is incorporated into the reinforcement learning (RL) process to improve the performance and alignment of an AI model with human preferences and values. This method is particularly useful for training complex models, such as large language models, to perform tasks in ways that are more aligned with human expectations and ethical considerations.

Here are the key components and steps involved in RLHF:

1. Reinforcement Learning (RL):

  • In traditional RL, an agent learns to make decisions by interacting with an environment and receiving rewards or penalties based on its actions. The goal is to maximize cumulative rewards over time.
  • The agent explores the environment, takes actions, and learns from the consequences of those actions to improve its performance.

2. Human Feedback:

  • Instead of relying solely on predefined reward functions, RLHF incorporates feedback from humans to guide the learning process.
  • Human feedback can be in the form of explicit ratings, rankings, demonstrations, or corrections provided by human evaluators.

3. Training Process:

  • Initial Training: The model is initially trained using standard RL techniques or supervised learning on a dataset.
  • Human Feedback Collection: Humans interact with the model, providing feedback on its performance. For example, humans might rate the quality of the model’s responses, rank multiple outputs, or correct mistakes.
  • Feedback Integration: The feedback is used to update the model’s policy or reward function. This can be done through various techniques, such as reward modeling, where a separate model predicts human feedback and guides the main model’s training.

4. Iterative Improvement:

  • The process is iterative: the model is continuously refined based on new human feedback, leading to gradual improvements.
  • Each iteration involves collecting feedback, updating the model, and evaluating its performance to ensure it aligns better with human preferences.

5. Applications:

  • RLHF is widely used in training large language models, like those developed by OpenAI, to ensure the outputs are more useful, safe, and aligned with human values.
  • It is also used in robotics, where human feedback can help guide the learning process for physical tasks that are difficult to specify with explicit reward functions.

--

--