LLMs and Introduction to Generative A.I.

Mehmet Zahit ANGİ
Huawei Developers
Published in
11 min readNov 14, 2023

--

Large Language Models and Their Sizes

Introduction

In recent years, one of the most outstanding areas in artificial intelligence technologies has been large language models. Although we frequently encounter these models in our daily lives, gaining knowledge about the backgrounds and methods of these models requires time and effort. This article will serve as an introduction to large language models, and we will delve deeper into the principles of these models.

In the first part of our series, we will introduce large language models (LLMs). We will discuss Transformer architectures and delve into important topics for LLMs, such as “Prompt-Prompt Engineering” and “Configuration Parameters”. In the following articles, we will explore “Pre-Training” methods and discuss the “Fine-Tuning” techniques used after the “Pre-Training” phase.

Next, we will explore the “Transformer” and “Attention” mechanisms and some articles. Finally, we will review the article titled “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” which explains how these structures are utilized not only in the language domain but also in the field of computer vision.

The Rise of Large Language Models and Text Generation Before LLMs

Large Language Models (LLMs) have revolutionized the field of artificial intelligence, gathering significant attention in recent years as a type of generative artificial intelligence technology. These models, trained on vast amounts of data, have become increasingly popular and widely used in various fields due to their ability to generate human-like texts and perform complex tasks in language/text domains with high accuracy.

Before Large Language Models (LLMs), Recurrent Neural Networks (RNNs) were used for text generation. RNNs are artificial neural network structures designed to understand language and predict the next word in a sequence. However, RNNs have limited capacity for long sentences and large documents. To extract the meaning of a sentence or text, it is necessary to know the meanings of previous words. Words with multiple meanings or those whose meaning depends on the context of the entire text cause significant challenges for RNNs.

This is where “Attention” mechanisms come into play. Attention mechanisms focus on previous words that influence a particular word. This helps overcome the limitations faced by RNNs. Therefore, Attention mechanisms play a crucial role in large language models, allowing for a more effective understanding of context and meaning in language.

Transformer Architecture

As mentioned earlier, the Transformer architecture has revolutionized natural language tasks, pushing language models to a new level. This architecture typically consists of two main components: an encoder and a decoder. One of the strongest parts of these models is the “self-attention” mechanisms, which enable the understanding of relationships and contexts among words within a sentence.

Transformer Architecture

In these architectures, before feeding text to the model, it is necessary to tokenize the words and convert them into numerical representations. This allows the model to work not with the words themselves but with their numerical representations. Subsequently, the Embedding Layer transforms these numerical representations into high-dimensional vectors, encoding the meaning of each token (word). These vectors create a structure in the Embedding Space where each word has unique positions, make easier the mathematical understanding of language.

Word Embedding

Additionally, to handle situations where a word can have different meanings in different positions, vectors representing the position of each word in the text are created through mathematical operations. These vectors assist in better understanding the positional relationships of words within the text. Combined with the word vectors created in the embedding layer mentioned earlier, these vectors form an information table determining the meaning and position of each word. This table, contains positional encoding information along with word embedding information, is then passed to the self-attention layer.

Positional Embedding

“Attention Heads” are groups containing numerous “self-attention” mechanisms. They can learn different aspects of language in parallel and emphasize their relationships. For instance, one attention head may focus more on the syntactic structure for the meaning of a sentence, while another might pay more attention to word choice or the overall meaning of the sentence. In this way, different attention heads, by highlighting different features, contribute to a better understanding of the overall meaning of the sentence. In short, they enable a more comprehensive analysis of the text. The outputs from these processes are then sent to a “feed-forward neural network,” and probability scores are calculated for each token.

Multi-Headed Self-Attention and Feed Forward Network

After the “Feed-Forward” network, probability values are generated for each word in the dictionary through “Softmax” normalization. The most likely predicted tokens are then determined from these probability vectors. Various methods can be used for the selection process in this step.

Softmax Normalization

Text Generating with Transformer

Text Generating with Transformer

In the Transformer architecture, the encoder processes inputs, performs embedding to convert them into numerical representations, and then passes them to the “multi-headed attention” layer. This stage represents the structure and meaning of the input. On the other hand, the decoder, utilizing the semantic understanding capabilities of the encoder, generates new tokens in a loop starting from the initial token of the input. This generation process continues until a sequence-end token is predicted.

While Encoder-Decoder models are commonly used in sequence-to-sequence tasks like translation, some models use only the Encoder (e.g., BERT), or only the Decoder (e.g., GPT, BERT, LLaMA).

Prompting and Prompt Engineering

Prompting Example

Prompting is the process of providing an initial text to language models to generate the desired output.

Prompt Engineering involves designing and using parameters that influence how the model will continue or respond to a specific prompt (initial text). Particularly when we want our model to generate text on a specific topic or in a specific genre, we can shape the output of our model using prompting and prompt engineering.

By employing a method called “in-context learning,” models can better understand certain tasks by adding additional data or examples into the prompt. This method is highly effective in larger language models that perform well in zero-shot inference. However, smaller language models may have difficulty providing the desired results only through this method (without clear examples).

Zero-Shot Inference is the ability of a model to make sensible predictions on a new and previously unseen topic (without specific training on that topic), utilizing the general language and knowledge structure it has learned. For instance, if the model’s training dataset does not have examples related to a specific topic or if the model has not received specific training on that subject, it can still produce logical outputs when provided with text or inputs related to that topic. This is made possible by the model’s general language knowledge and overall learning capabilities acquired during the training phase.

On the other hand, One-Shot Inference is a method where a single example is added to the prompt to guide the model’s understanding. This method is particularly useful in improving performance for smaller models.

The point we need to pay attention to is as follows:

Imagine we have a large language model and a smaller language model. When we focus on these two models:

Prompt: When the question “Who are the gods in Greek mythology?” is asked,

Possible Zero-shot Inference Response: “Gods like Zeus, Hera, Poseidon, Athena, Apollo, Artemis, Hermes, and Ares are important figures in Greek mythology.”

Possible One-Shot Inference Response: “Gods such as Zeus, Hera, Poseidon, Athena, Apollo, Artemis, Hermes, and Ares play significant roles in Greek mythology.”

Similar or identical responses can be obtained. The crucial point to consider in this case is how the models arrived at this conclusion.

Large language models can generate responses about previously unencountered topics based on the general features of language and learned information. On the other hand, smaller language models require a specific example prompt to generate a response about a topic they haven’t encountered before. In short, one operates based on a specific example, while the other relies on general language abilities.

Zero-Shot Inference on Larger Models
Zero-Shot Inference on Smaller Models
A One-Shot Inference Example

Few-shot inference takes this approach further by incorporating multiple examples, allowing the model to learn from various instances. The model, leveraging these examples, gains a clearer understanding and produces more accurate responses. This enhances the model’s ability to learn and generalize from a limited number of examples.

Let’s focus on an example related to this topic:

Prompt: “Limited examples about the characteristics of dogs.”
Example Response: “Known as loyal, playful, protective, and loyal companions, dogs typically form close relationships with humans and come in various breeds.”

In this case, imagine that the model was trained with a limited number of examples, providing input about a few fundamental characteristics of dogs. The model, learning from these limited examples and generalizing, made an inference about the general features of dogs. This demonstrates how the few-shot inference method enables the model to develop a better understanding of a subject by using a limited number of examples.

In-Context Learning Example

The fine-tuning method aims to enhance a generally trained language model’s performance for specific tasks by subjecting it to a new training process with additional data. We will delve into this topic in more detail in subsequent writings.

This process typically involves the following steps:

1. Customization Data: The first step involves preparing a customized dataset for a specific task. For example, specific datasets can be collected for tasks such as translation, text generation, or sentiment analysis.

2. Fine-tuning Process: In this step, a generally pretrained language model (such as GPT-3) is often used, and this model is trained on the customization dataset for a more specific task. The model is retrained on this customization dataset to better perform a particular task.

3. Improving Performance: After the fine-tuning process, the model becomes more specifically trained for a particular task. This allows the model to perform better on that specific task. For example, when fine-tuning is done for a translation task, the model excels at translation processes.

Generative Configuration — Inference Parameters

Generative Configuration — Inference Parameters

In language models, some adjustments can be made to influence the style and structure of the generated responses. Certain configuration settings allow us to tailor the output to meet specific requirements. These settings help control how creative, diverse, or consistent the texts generated by language models will be.

Some configuration parameters include:

  • Max New Tokens Parameter: This parameter imposes a limit on the number of new words generated by the model. It is used to control text generation and prevent the unwanted lengthy texts.
  • Greedy Decoding: Greedy decoding is a simple method that selects the word with the highest probability in the next word prediction. While it is fast and straightforward, it can result in repeated words or sequences. It does not guarantee the best result in the long run as it only chooses the most probable word at each step.
  • Random Sampling: Random sampling provides diversity in generated texts by randomly selecting words based on the probability distribution, reducing the likelihood of word repetition. Unlike greedy decoding, random sampling does not always choose the most likely word, incorporating less likely words randomly, resulting in more creative text.
Greedy and random Sampling
  • Top-k Sampling is a probability distribution method used during text generation in large language models. In top-k sampling, the probability of each word is ranked, and a selection is made from the top k tokens with the highest probability. It does not always choose the word with the highest probability, allowing for the selection of different words. This increases randomness, provides diversity in generated texts, and prevents the production of uniform texts.
  • Top-p Sampling, also known as nucleus sampling, involves ranking the probability of each word during text generation. A threshold value is determined based on the sum of these probabilities. Words are selected until the threshold value is reached, forming a set of words. The word to be generated is then selected from this set. This method allows for more diverse text and helps avoid word repetition. It is commonly used to generate more natural and varied sentences in text production.
Top-k and Top-p Sampling
  • Temperature is a parameter that determines the creativity and consistency of the model during text generation. A higher value provides more randomness and diversity, but the generated texts may become less balanced and have semantic inconsistencies. On the other hand, a lower value leads to less diverse choices, meaning the model produces more reliable and predictable texts. However, texts generated with a lower temperature value may appear repetitive and dull, following a certain pattern. In summary, a higher temperature value increases creativity, while a lower value enhances consistency.
Temprature parameter

Conclusion

In this article, we delved into the introduction of large language models (LLM), discussing the innovations and revolutions they bring to the fields of artificial intelligence and natural language processing. Following that, we explored the transformer architecture and “attention” mechanisms used in large language models, attempting to understand the logic behind their operation. Additionally, we discussed Prompt Engineering and Configuration Parameters, exploring how we can enhance our models. In our next article, we will delve deeper into the pre-training stage of large language models.

Stay tuned for upcoming articles.

References

--

--