A gentle introduction to the world of large language models (LLMs)

Pakhapoom Sarapat
SCB DataX
Published in
6 min readDec 22, 2023

Large language models (LLMs) have gained considerable attention since the release of the phenomenal chatbot, ChatGPT by OpenAI in 2022. People around the world are excited and surprised by its human-like responses which have not been seen from earlier language models. The hype of the LLMs has continued since then and everyone keeps talking about them, but what are they exactly?

What is an LLM?

LLM is a machine learning model that takes text as an input and returns text, code, numbers, tables, or diagrams as an output depending on the nature of the tasks that the LLM is designed for. However, machine learning models in general take numbers or other symbols that can be manipulated with mathematical operations as an input, but in the case of an LLM (or language models in general), an additional step to convert text as numbers is required. Components such as tokenizers, encoders and embeddings are behind LLMs that help to transform text into a numerical form that can be processed by computers.

Traditionally, an LLM is trained by masking the last word in a sentence and asking the model to predict the missing word, meaning that the input is a sentence excluding the last word and the output is the last word. This approach aims to encode common patterns of natural language into a machine learning model so that it can recognize structures in the language such as grammar and has some further understanding of the language. The resulting models are usually called pre-trained or foundation models due to their very large nature in terms of the size of training data and the number of model parameters.

The training regime of the pre-trained models is crucial because it can be used as a starting point to develop language models for any downstream tasks which greatly save time and computational resources. Many techniques to train foundation models have been proposed, such as not only masking the last word, but randomly masking words in a sentence, or providing a certain word as an input and letting the models to predict what sentence or context should look like.

What if LLMs are trained by another way?

The training process has been continuously developed and improved to achieve better benchmarks over the years. Until 2022, a group of researchers proposed an alternative technique to train pre-trained models, called meta-learning (Brown et. al, 2022). The idea is to feed examples of input data and the corresponding expected output data for each specific task to a model, such as arithmetic calculations, word corrections, or machine translations. This process to teach a model to learn each task is called in-context learning as displayed in Figure 1. The approach allows models to learn from a wide variety of examples during the training process.

Figure 1: Meta-learning scheme which composes of several learning tasks or so-called in-context learning. Reprinted from (Brown et al., 2020).

Traditionally, we may need to fine-tune foundation models, which is a process to adjust the models to serve a downstream task more efficiently. However, it is not always successful to obtain better performance from the fine-tuned models because the approach requires sufficiently large data to overcome the distribution learned from the massive training dataset and that large data may not be available sometimes.

By training LLMs with meta-learning, we may ignore the fine-tuning step as meta-learning enables models to adapt to various tasks in the training process. Therefore, it is recommended to use in-context learning with a few examples. In-context learning is called as zero-shot learning if no examples are provided, one-shot learning if a single example is provided, and few-shot learning otherwise. Figure 2 shows examples of different types of in-context learning.

Figure 2: Examples of zero-shot, few-shot and few-shot methods used in in-context learning variations. Reprinted from (Brown et al., 2020).

Does it work?

One of the results from models trained with meta-learning and tuned with in-context learning is given in Figure 3. It shows the accuracy against the number of examples for a task that removes extra characters from a certain input text. For example, the model is expected to return the text “SCB DataX” if “S$C%B! D#a!t#aX” is given as an input.

Figure 3: Performance comparison of a task to remove special characters from text by using different in-context learning methods and model sizes. Reprinted from (Brown et al., 2020).

The figure shows an increasing trend for every model (marked with different colors and types of lines), meaning that the accuracy tends to be higher the more examples are provided. This improvement becomes more prominent when the model size (number of model parameters) is increased. Another observation is that one-shot learning shows much better accuracy than zero-shot learning, which is due to the drastic increase of the gap between examples 0 and 1 (which is displayed in the figure as 10 powered to 0).

Furthermore, bigger models (separated by color) perform more efficiently than smaller ones, and models with detailed prompt, which means inserting task description in the beginning of each original input, achieve higher accuracy than those containing only the examples and original input (separated by continuous and dashed lines). However, the difference diminishes when more examples are provided in the prompt.

To conclude the findings in Figure 3, there are a couple of ways to improve the model performance with in-context learning. We can (i) utilize LLMs with a higher number of model parameters to guarantee better performance, (ii) provide more examples to facilitate LLMs to recognize patterns and generate more reasonable responses, or (iii) design prompts that help LLMs to understand the context and follow the instruction given in the prompt more efficiently.

What’s next?

Adding more neural network layers on top of the original model architecture (option I) or more examples in the prompt (option II) are trivial ways to improve the models. A more interesting approach is to design prompts that lead to more desired answers (option III). This is called prompt engineering which is a powerful technique since it can help to improve model answers without model training or fine-tuning. Simply chaining instructions together with some related examples does the job decently. In the next article, we will take a dive into the realm of prompting.

Acknowledgements

I would like to acknowledge Jussi Jousimo for his invaluable proofreading and editing assistance, and your patience as well. I was torturing you with my broken English and pushing you to rely on the context of the surrounding sentences so many times. However, you can clarify all ambiguity and repair the grammar so that we can read the content so smoothly. I would also like to thank Nut Chukamphaeng for all those LLMs papers that you have been sending to me. They are all in my reading list waiting for the time to be read which I think it is supposed to be soon. Lastly, I would like to express my gratitude to our beloved editorial board that has been supporting and encouraging me to polish my writing skill.

Reference

· Language models are few-shot learners.

--

--