Prompt Engineering: (Part I:) In-context learning with GPT-3 and other Large Language Models

10 min readSep 19, 2022

How are you doing guys? hope you are doing well.
Today in this blog we’ll see about Prompt Engineering. Before we are going to see it, I think it is good to see some concepts about the following topics:

Embedding:

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding capture some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models. In order to understand it more let’s see word embeddings.

Word embedding is a representation of words in the form of real-valued vectors, where words that have the same meaning have a similar representation. First, words are transformed into vocabulary (dictionary, a list of unique words with their corresponding indexes) then converted to one-hot encoding vector, finally to vectors of continuous real value numbers in a predefined vector space. It normally involves a mathematic embedding from a high-dimensional sparse (A vector where most indices are zero) vector space (e.g., one-hot encoding vector space, in which each word takes a dimension) to a lower-dimensional dense (A vector where all elements are fully represented with numbers/no 0’s) vector space. Hence, embedding is a low-dimensional vector that captures a lot of syntactic and semantic information of words and their relationships.

The word-embedding processes a text just like this: First, each word in the vocabulary is decoded in the one-hot encoding. E.g. in the sentence “አንተ ግን ያው አንተ ነህ (ante gin yaw ante neh)”, (meaning, “but you are you”), the vocabulary (or unique words) are (ነህ, አንተ, ግን, ያው). To create a vector that contains the encoding of the sentence, the one-hot vectors (sample shown in table A.1) for each word should be concatenated.

Table 1.1 (a) Word Vocabulary with 4 unique words *** (b) One –Hot-Encoding Vector

A 4D dense Word embedding

Table 1.2 Word embedding of a sentence “አንተ ግን ያው አንተ ነህ”

However, a one-hot encoded vector is sparse and inefficient. Hence, we move to the second step, encoding each word using a unique number. Recalling the example above, assign 1 to “ነህ”, 2 to “አንተ”, 3 to “ግን”, and 4 to “ያው” and so on. We can then encode the sentence “አንተ ግን ያው አንተ ነህ” as a dense vector like [2, 3, 4, 2, 1]. This approach seems efficient because instead of a sparse vector, a dense one is used. However, there are still problems with this approach. The integer-encoding is arbitrary that does not capture any relationship between words and they can be challenging for a model to interpret.

Finally, an embedding is used as a dense vector of floating-point values that is a trainable parameter for the model. A sample embedding is depicted on table A.2 Word embedding is mostly 8-dimensional for small datasets, up to 1024-dimensions for large datasets, but embedding vectors of size 200 or 300 are usual. A larger-dimensional embedding can capture more relationships between words but takes more data to learn. Different word embedding models are commonly used in NMT and rely on deep learning techniques, such as Word2Vec, BERT, RoBERTa … etc..

Fine-tuning:

Finetuning means taking weights of a trained neural network and use it as initialization for a new model being trained on data from the same domain (often e.g. images).

It is used to:

speed up the training
overcome small dataset size

There are various strategies, such as training the whole initialized network or “freezing” some of the pre-trained weights (usually whole layers).

When we are working with a Deep Learning task, say, one that involves training a Convolutional Neural Network (Covnet) on a dataset of images, our first instinct would be to train the network from scratch. However, in practice, deep neural networks like Covnet have a huge number of parameters, often in the range of millions. Training a Covnet on a small dataset (one that is smaller than the number of parameters) greatly affects the Covnet’s ability to generalize, often result in overfitting.

Therefore, more often in practice, one would fine-tune existing networks that are trained on a large dataset like the ImageNet (1.2M labeled images) by continue training it (i.e. running back-propagation) on the smaller dataset we have. Provided that our dataset is not drastically different in context to the original dataset (e.g. ImageNet), the pre-trained model will already have learned features that are relevant to our own classification problem.

In-context learning:

Informally, in-context learning describes a different paradigm of “learning” where the model is fed input normally as if it were a black box, and the input to the model describes a new task with some possible examples while the resulting output of the model reflects that new task as if the model had “learned”. While imprecise, the term is meant to capture common behavior that was noted in the GPT-3 paper by OpenAI as a phenomenon that GPT-3 displayed with surprising consistency.

In modern language models, tokens later in the context are easier to predict than tokens earlier in the context. As the context gets longer, loss goes down. In some sense this is just what a sequence model is designed to do (use earlier elements in the sequence to predict later ones), but as the ability to predict later tokens from earlier ones gets better, it can increasingly be used in interesting ways (such as specifying tasks, giving instructions, or asking the model to match a pattern) that suggest it can usefully be thought of as a phenomenon of its own. When thought of in this way, it is usually referred to as in-context learning.

Emergent in-context learning was noted in GPT-2 and gained significant attention in GPT-3

Simply by adjusting a “prompt”, transformers can be adapted to do many useful things without re-training, such as translation, question-answering, arithmetic, and many other tasks. Using “prompt engineering” to leverage in-context learning became a popular topic of study and discussion

At least two importantly different ways of conceptualizing and measuring in-context learning exist in the literature. The first conception, represented in Brown et al., focuses on few-shot learning of specific tasks.

All topics are summarized below in table 1.

Table 1: The difference between embedding, fine-tunning, and in-context learning

Few-shot, one-shot, and zero-shot learning

There are several use cases for machine learning when data is insufficient. N-shot learning is when a deep learning model can be trained to classify an image/text using not more than five images/few texts. An N-shot learning field includes an ’n’ number of labelled samples of each ‘K’ class. The entire support set ‘S’ includes N*K total samples. N-shot learning can be divided into three categories: zero-shot learning, one-shot learning and few-shot learning. The choice of application between the three depends upon the availability of training samples. N-shot learning is also used when the dataset is huge and labelling the data can prove to be costly. Or, when several samples are available, it could be hard to add specific features for each task.

Zero-shot learning

Zero-shot learning is the challenge of learning modelling without using data labelling. Zero-shot learning involves little human intervention, and the models depend on previously trained concepts and additional existing data. This method reduces the time and effort that data labelling takes. Instead of giving training examples, zero-shot learning gives a high-level description of new categories so that the machine can relate it to existing categories that the machine has learned about. Zero-shot learning methods can be used in computer vision, natural language processing and machine perception.

Zero-shot learning essentially is made up of two stages: training and inference. In training, the intermediate layer of semantic attributes are captured, then in the inference stage, this knowledge is used to predict categories among a new set of classes. At this level, the second layer models the relationship between the attributes and the classes and fixes the categories using the initial attribute signature of the classes. For example, if a child is asked to recognize a Yorkshire terrier, he or she may know it is a type of a dog, with added information about it from Wikipedia.

With a growing amount of research in instances where the model uses as little data as possible with fewer annotations, zero-shot learning has found applications in critical areas like healthcare for medical imaging and COVID-19 diagnosis using chest x-rays as well as for unknown object detection used in autonomous vehicles. Hugging Face transformers use zero-shot classification for more than 60 per cent of its transformers.

One-shot learning

One-shot learning performs classification tasks using past data. Facial recognition technology, including facial verification and identification, usually uses one-shot learning. Facial recognition systems learn face embedding, which is a rich low-dimensional feature representation. One-shot learning has been using the Siamese network approach. Eventually, Siamese networks were compared to comparative loss functions, after which the triplet loss function was proven to be better and the FaceNet system began using them. Contrastive loss and triplet loss functions are now used for high-quality face embeddings, which have become the foundation for modern facial recognition.

Few-shot learning

Few-shot learning, also known as low-shot learning, uses a small set of examples from new data to learn a new task. The process of few-shot learning deals with a type of machine learning problem specified by say E, and it consists of a limited number of examples with supervised information for a target T. Few shot learning is commonly used by OpenAI as GPT3 is a few-shot learner.

A study in 2019 titled ‘Meta-Transfer Learning for Few Shot Learning’ addressed the challenges that few-shot settings faced. Since then, few-shot learning is also known as a meta learning problem.

There are two ways to approach few-shot learning:

Data-level approach: According to this process, if there is insufficient data to create a reliable model, one can add more data to avoid overfitting and underfitting. The data-level approach uses a large base dataset for additional features.
Parameter-level approach: Parameter-level method needs to limit parameter space and use regularization and proper loss functions to resolve the overfitting problem in few-shot learning. This will generalize the limited training samples. This approach can also improve model performance by directing it to the extensive parameter space. A parameter-level approach is useful because a smaller amount of training data will not give reliable results.

The three types of machine learning model learning methods are summarized as follows.

Table 2: The difference and similarity between Zero, One, and Few-shot learning

Prompt Engineering:

Prompt engineering is a natural language processing (NLP) concept that involves discovering inputs that yield desirable or useful results. Prompting is the equivalent of telling the Genius in the magic lamp what to do. In this case, the magic lamp is DALL-E, ready to generate any image, you wish for.

Prompting was not a developed feature by AI experts. It was an emergent feature. In short, by developing these huge machines learning models, prompting became the way to have the machine execute the inputs.

Prompt engineering is a process used in AI where one or several tasks are converted to a prompt-based dataset that a language model is then trained to learn.

The motivation behind prompt engineering can be difficult to understand at face value, so let’s describe the idea with an example. Imagine that you are establishing an online food delivery platform and you possess thousands of images of different vegetables to include on the site. The only problem is that none of the image metadata describes which vegetables are in which photos.

At this point, you could tediously sort through the images and place potato photos in the potato folder, broccoli photos in the broccoli folder, and so forth. You could also run all the images through a classifier to sort them more easily but, as you discover, training the classifier model still requires labeled data.

Characteristics of a good prompt

In short, the requirements of small test data, good prediction accuracy and faster output are some of the characteristics of a good prompt.

What are prompt design principles?

The prompt design principles for predictable multi-core architectures includes Hardware and Hardware validation, Computer systems organization (parallel architectures, embedded and cyber-physical systems), Software and its engineering, Software organization and properties, Software system structures (Embedded and real-time system software).

Conclusion

In this blog we have seen about embedding, fine-tunning, in-context learning and the three types of ML model learning methods (zero, one and few shot learning). In the next post we’ll see the demonstration of LLMs API.
See you there!
Have a nice time.

Reference

The PROMPT design principles for predictable multi-core architectures | Proceedings of th 12th International Workshop on Software and Compilers for Embedded Systems (acm.org)

Extrapolating to Unnatural Language Processing with GPT-3’s In-context Learning: The Good, the Bad, and the Mysterious | SAIL Blog (stanford.edu)