Better ML

Software Engineering for ML  — Implementation, Versoning, Operations

Exploring Large Language Models

Alex Punnen
Better ML
Published in
34 min readSep 1, 2023

--

This article is written primarily for self-study. So it goes broadly and also deep. Feel free to skip certain sections based on your interests or to seek the area that interests you.

Below are some of the questions that intrigued me or came up while trying to fine-tune LLMs. The article is an attempt to answer these and share this information with other curious.

Since LLMs are based on NeuralNet with Loss function, is not all training of LLMs supervised training? Why is it termed usually as unsupervised training?
Can you train an LLM in a very short sentence to illustrate how LLM training works in practice?
What is Masked and Causal LM?
Can you explain the intuition behind Transformer Architecture in a single picture?
What exactly is it meant by unsupervised training in LLM?
Why does the main architect of ChatGPT — Ilya Suverskar think of unsupervised training as the Holy Grail of machine learning?
What is meant by the Emergence/ Understanding of LLMs?

What are the use cases of LLMs?
Why are LLMs best suited as productivity assistants?
What is the Vector DB/Embedding pattern of information retrieval?
Can LLMs be used for things other than textual tasks? What is Causal reasoning?
What is the problem with LLMs?
Why do minds like Yan LeCun think current LLMs are hopeless?
Are LLMs Explainable, how can they be effectively used if they are not?

What is the need to fine-tune/re-train LLMs?
Why is it difficult to train LLMs?
How do Quanitsation and LoRA help in training large LLMs?
How does Quantisation and LoRA work?
What is an effective way to fine-tune pre-trained LLMs?
What is Instruct Tuning?
What is Self Instruct? How can we generate a high-quality training dataset for Instruct Tuning?

Yet to answer
Can you show how LLMs of varying capability can be hierarchically structured to create a complex automation with causal reasoning?
Why are we aiming to create human-like intelligence from LLMs or neural nets? Why does this seem eerily similar to creating bird-like flight back in time before the invention of the fixed-wing plane?

Since the article is quite long, I have structured this into three sections for better readability.

Section 1 will discuss the evolution of LLM training. The intention is to set the context for us to understand the magic, or more technically -emergence, that starts to happen when the model size increases above a threshold and when trained with huge data. The deep-dive sections illustrate these concepts in greater detail and depth, though they are also easy to follow by most programmers.

Section 2 will discuss briefly the popular use cases of LLMs, personal assistants, and chatbots with custom data via information retrieval patterns (vector space search with LLM augmentation). We will also explore seeds on how the mental model and NLU of models could become finally its more powerful use cases. In this context, we will explore one main limitation of the LLM model by contrasting the strengths of supervised training with a weakness of the LLM models — the lack of Explainability or difficulty in determining facts vs. hallucinations. We will explore how such systems have been very effectively used in computer systems by a hierarchy of controls, unreliable systems made reliable by a higher level control -our daily use of ChatGPT for example and how it can be extended to other use cases.

Section 3 will discuss some concepts related to training the LLMs on custom domains. We are targeting the domain understanding part in this, and how that is much more powerful than simpler vector space information retrieval patterns. This is easy in toy examples but practically not very easy with real data. We will explore how Quantisation techniques have opened up very large LLMs to the world, and how this coupled with the concepts of reducing training parameters has democratised LLM fine-tuning. We will explore the main technique of effective fine-tuning — Instruct tuning, and how to solve the biggest practical problem of Instruct tuning — the unavailability of quality Instruction training dataset with all the concepts we have gone through this far.

Future sections will discuss the concept of leveraging the understanding part of LLMs and using the hierarchy of controls in leveraging these powerful systems for augmenting AI/ML systems.

Introduction

Large Language Models (LLMs) present us with two obvious capabilities — a natural language interface for communicating with the model and a vast amount of knowledge stored very efficiently in the models- the whole of the internet textual data. The larger the models are, the better they get at both of these capabilities.

There is another capability that is not that obvious, but that could be the most powerful. It is implicit in the first capability and termed technically as NLU, Natural Language Understanding. To understand something, you need a model of that. For humans — a mental model. To understand language you need a model of language syntax and semantics. To understand a user question and to be able to answer that effectively the model needs an internal world model. There is a debate between thought leaders in this field on whether there is some internal world model learned by these LLMs, or if it seems to us as if it is present.

However their natural language understanding is so good, and their internal world model has enough information that they pass the Turing test as it was envisioned (https://www.nature.com/articles/d41586-023-02361-7) and also the Mini Turing Test based on Causal Reasoning proposed by current AI’s most famous critic Judea Pearl. We will come back to this topic later.

Looked at in one way, everyone knows what intelligence is; looked at in another way, no one does. Robert J. Sternberg, source

Still, this does not mean that they are intelligent in the real sense of the term, but we could more safely say, intelligent in understanding natural language. This is the key property that sets these models apart.

LLMs start as dumb automata, but somewhere in their training, they become smart enough to generalise their training to tasks for which they are not explicitly trained. This is what we can more freely term as “understanding”. As I mentioned before this is a highly debated subject and our aim is not to go deep into the debate but to try to learn.

Section 1: LLM Training and the Emergence of Understanding

We can understand this topic in a more fun way by following the history of some famous AI/ML systems.

Rule Engine/Tree Search

We can start with IBM DeepBlue the chess-playing supercomputer. By using the power of a Super Computer and custom chips it defeated Gary Kasparov — the chess grandmaster in 1997. However, there was no AI or NeuralNet involved then, it was TreeSearch. You could abstract this a bit and say that it was a Rule-based engine. The training data was hand-coded domain expertise, distilled into a set of rules. The algorithm optimised the choice of the next move from a vast but computable result set based on the current state of its world. It was clear however that it was impossible to do this rule-based programming for broader domains.

Supervised Training

A decade later in 2011, IBM Watson -designed for Question Answering, and specifically trained for Trivia QA played Jeopardy and won against a champion. There was a big hype around this as the next knowledge system that would revolutionise everything. The primary way the system was trained seems to be Supervised Training. That is data with labels on which the system was trained to pick the right answer or in this case the right question.

Supervised Learning works well. This is the bedrock of almost all machine learning used in production today. Given enough labelled data, AI/ML systems will learn to approximate any complex multivariate functions. They are excellent universal function approximators.

“All the impressive achievements of deep learning amount to just curve fitting..”. The Truing Award winner Judea Pearl famous critique of AI/ML system

The problem with Supervised Learning is that labelling huge amount of datasets for training huge models, need a lot of expensive and time-consuming human effort.

The best example of one of the largest labelled datasets and its implication is Imagnet. It was the huge amount of labelled image data collected as part of the ImageNet project that helped AlexNet introduced in 2012 (Ilya Sutskever, Alex Krizhevsky and Geoffrey Hinton) revolutionise computer vision, though even by 1998 Yann LeCun and others and others had introduced LeNet-based convolutional neural network for handwriting recognition.

Note Ilya Sutskever is also one of the founders of OpenAI and was instrumental in training the GPT models later.

Back to the story. The hype surrounding IBM Watson died down over the years as the limitations of the system became apparent. A NY Times article gives insight into some of the challenges of why it could not generalise well to other fields like IBM hoped, the lack of properly labelled data being the primary.

Reinforcement Learning

In 2016, Google DeepMind AlphaGo became very popular by defeating the champion Go player. This game is a much wider domain /strategy than Chess (impossible for rule engine/tree search type algorithms). The key here was Reinforcement learning (RL).

Here the training can be abstracted to making a random move, and if the move takes you closer to winning (though some loss calculation) make more of such moves and vice versa. And then they created agents and pitted the agents against each other, thereby playing probably thousands of years worth of games and getting good at the game.

A more complicated game than Go is Dota, and in 2019 a small company (at that time) called OpenAI defeated the current reigning Dota 2 champions

Here is an interesting snippet related to this event. The last word ‘scale’ is an interesting part which may suggest how they used this concept for their future work in GPT models

We started OpenAI Five to work on a problem that felt outside of the reach of existing deep reinforcement learning. .. We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale.

DeepDive — Tough RL looks like it can be used for everything (for example this is how a baby learns to walk or most organisms learn), outside of domains like games which have a very limited or controlled state space, it is very difficult to implement. For example in a self-driving car, any small state change in some previous step can contribute later positively or negatively. It boils down to implementing a loss function that can store and work temporarily. (https://stanford.edu/~ashlearn/RLForFinanceBook/book.pdf, Why backpropagation approximation is needed for RL https://stats.stackexchange.com/a/340657/191675)

So we covered supervised learning ( the bread and butter of AI/ML algorithms so far) and Reinforcement Learning — mostly in Video games and similar.

Unsupervised Learning — The Holy Grail of all Learning?

At OpenAI .. the hope was that if you have a neural network that can predict the next word, it’ll solve unsupervised learning. So back before the GPTs, unsupervised learning was considered to be the Holy Grail of machine learning.

Now it’s been fully solved, and no one even talks about it, but it was a Holy Grail. It was very mysterious, and so we were exploring the idea. I was really excited about it, that predicting the next word well enough is going to give you unsupervised learning. -Ilya Sutskever Interview

Before we go there exploring what he means by “Holy Grail”, let's step back to make the context clear and explore what engineers and data scientists usually mean when they say unsupervised learning.

The usual take on Unsupervised Learning

When we usually speak of unsupervised learning in ML, it is a few algorithms related to Clustering, for example, k-means clustering Dimensionality Reduction Principal Component Analysis, or Time series fitting. These are based on maths or matrix properties. Note that these are all pretty complex but if you are good at maths or if you put enough effort these can be clearly understood. In the case of Deep learning, the examples AutoEncoders, and Autoencoder are more interesting- here the target is the same as the input. That is given complex data (say a highly detailed image) train a shallow network to give similar output. For this, the network needs to learn some pattern in the data to compress this sufficiently. See more http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

In this article, we are focussing not on the above classic ML or DL use-case, but on unsupervised training for LLMs.

Again there is a more down-to-earth element to this. See below.

DeepDive: This is a bit of a misnomer as all machine learning needs some loss function to optimize and backpropagate; and for a loss, there should be some target to compute. In the context of LLM, unsupervised training is basically the automatic labelling of the target. Lets see how we can fine-tune or re-train a pre-trained model to predict something different, for example ‘Zoo’ instead of ‘City’ after ‘I love New York’ . The example training sentence will be ‘ I love New York Zoo’. The model is fed the first word ‘I’ and will output something. But the target is given as love and the CrossEntropy Loss is calculated with that target and minimized by training. Finally, after York the target Zoo . The label or target is basically the next token. This is what unsupervised training in LLM means. Whether it is for the small toy model NanoGPT or for the LLAMA2 model the loss used is basically Cross Entropy Loss. The softmax function is used as the final layer of a neural network to produce a probability distribution over classes. In our case, the classes are all the words in the vocabulary. So a loss function is needed to calculate the difference between the generated and expected probability distributions. So Cross-entropy loss is then applied to this probability distribution to measure the error between the predicted probabilities and the true class labels. To understand better we need to go slightly more deeper into the Transformer architecture.

DeeperDive: The models are generally designed either into — Causal Language Modela* or Masked language modela (MLM) where certain noise is introduced by masking certain words in a sentence (but the model is able to see the whole tokens in the sentence). These are also known as the Uni-directional model and Bi-directional models respectively. The third option is a combination called Prefixed Causal or Masked Language modelling — where a Task prefix string is present in the training data (example ‘Translate to French:”). The last one is made famous by the T5 type of model (FlanT5 — Finetune Langage Model for Text-to-Text Transformer). (There are other variations also)

There are differences in training these models. For Caual LM, the target is the same as the input shifted on position right. For Masked LM, we need to create a denoised training where the target is the real value of the masks.

*It is named Causal as it operates in a causal manner, predicting based on previous tokens and presumably based on the causal relationship of those tokens. How this relationship is found is the whole story of the Transformer Model and the role of “Attention” — the famous “Attention is all you need” paper that introduced Transformer Networks. It is not easy to explain this concept in a simple way. I highly suggest you watch this explanation video possibly multiple times — Intuition Behind Self-Attention Mechanism in Transformer Networks — YouTube .

We can say that in the process of learning the correct ‘next’ token to predict, three sets of weights per token are learned by backpropagating the loss — the Key, the Query and the Value weights. These form the base of “Attention” mechanism. The above video beuatifully explains this at this location.

The concept of Vector dot product is used to calculate the Value Vectors, which is the sum of the contribution of the dot product of Query and Key vectors. The intution is that similar vectors in the Vector embedding space will have a larger dot product value and higher contribution. The Weights are then adjusted via Backpropagation which means that the learned weights represent a new and better Contextual Vector Embedding space for the sentence. ( Key and Query weights are multi-dimensional and there are multiple attention heads, so it is not one vector space but many)

In Transformers, there are multiple attention heads, so, which attention head to weightage more can also be tuned via weights. Transformer network is this architecture, where the intuition of causal relationship between tokens is encoded as learnable weights in linear neural net layers. I have tried to explain this in a simplified image below

Approximate working of a single attention head in Transformer Network -Source Author

You can see here in more detail how a small pre-trained model, which has the highest probability of generating Ciy after I love New York can be fine-tuned to reduce loss to generate Zoo https://github.com/alexcpn/tranformer_learn/blob/main/LLM_Loss_Understanding.ipynb

As such there is nothing extraordinary here. We have seen the usual so-called unsupervised learning (in the context of LLM), the loss function, the training ( based on the usual gradient descent and backpropagation)

However when the network becomes large and training data becomes huge, there is some other phenomenon happening where other than just “curve fitting”, it can generalise the training to ‘understand’ the inherent structure (human language, programming language etc.) of the training data without explicitly being trained for that. This is the unsupervised training holy grail.

Ilya Sutskever describes this https://youtu.be/AKMuA_TVz3A?t=490

The Emergence of Understanding

The paper Language Models are Unsupervised Multitask Learners which introduced GPT-2 to the world based on Transformer architecture did something very extraordinary. That it proved emprically or practically that a sufficiently large LLM when trained with a sufficiently large dataset (CommomCrawl / WebText- cleaned up internet data) starts to “understand” the language structure. I am deliberately not using the less controversial term ‘generalize’ instead of ‘understand’ here to convey the meaning to all better.

Here is a Google/DeepMind research paper that is easy to read on this topic https://arxiv.org/pdf/2206.07682.pdf.

They show eight models where the emergent behaviour as measured by the accuracy over random selection for a few shot prompt tasks increases substantially with model scale. (Though model training quality is also a factor here)

The same authors have discussed this with better visuals (below)

https://hai.stanford.edu/news/examining-emergent-abilities-large-language-models

Note there may be other studies like this refuting this, and this debate has just started.

To really understand something, and not just be a probabilistic distribution generator, one needs essentially a learned internal world model as well -according to Ilya Sutskever. This makes it completely different from other forms of ML and AI.

To make this clear, here is a sample input and output from the ChatGpt4 model. The model has not been explicitly trained in automobile driving. But you can see that it has formed a world view or world model from its training data and can comprehensively reply.

Author and ChatGPT4

It is not clear how the emergence of understanding happens when training and model parameter scales, as what we simplify as a probability distribution function generator or dismiss as stochastic parrots may have more revelations in future research. The role of randomness, or its study/measure — probability, in complex interfacing systems, for example, as presented in The Blind Watchmaker related to evolution, and physics at the quantum level sometimes makes one reflect if there is any other way other than probability distribution functions to describe them. It needs more research and more structure to explain this more clearly.

Note that other signals that follow this property of language could also be effective candidates for LLM training, where instead of words the signal could be tokenised. An example could be music https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html. Similarly, structures that don’t follow this model like say DNA could be a hard fit for these types of models.

Coming back to the point — there is a race on how to tap this potential of LLMs in various fields. We will start with simpler ones and later with more expanded forms where the capability of reasoning of the model could be used in applications.

Section 2: Applications of LLMs

One primary use case is Productivity enhancement — Smart Assistant.

This is a very important and widely usable pattern for LLMs.This paper The Economic Potential of Generative AI -The Next Productivity Frontier (June 2023) by McKinsey& Company has projections on how this aspect of LLMs, applied in various formats can alter current work in different sectors would add trillions to the world economy.

Before we go over the specifics, there was a recent talk in which the author talks about how more reliable systems can be built from less reliable systems — quoting from the famous TCP/IP stack design. There is a layered design where the un-reliable/lossy IP layers in made reliable by the transmission control and retransmission logic (in case it detects packet loss) of the TCP layer.

The Assistant type of use-cases is also layered, where the weaknesses of LLMs like hallucinations are overcome by a higher and better control (usually a human)

It could also be a more cognitive model over lesser models. This is what makes this use case so widespread and robust. Software Engineers using Github Co-pilot is a classic example. The model inference is paired with better control who can effectively use the output, take the positives and fine-tune or discard the negatives (errors, hallucinations). The more skilled the human is, the more efficiently he/she can use these models. The high-efficiency gains of this approach as programming or a domain assistant are well known in the very short time since its introduction. The same is true when these assistants are developed for other fields, a recent example being Med-Palm2 by Google and its uncanny ability to help doctors in the medical field.

This need for better control is tied implicitly to the concept of Explainability.

Explainability and its impact.

We mentioned here the Assistant use case the layered approach and Med_Palm2. What is implicit in this is the concept of Explainability.

Explainability is a weakness in LLMs now.

Here is Yann LeCun’s take. He puts it rather bluntly

Auto-Regressive LLMs Suck!, They are good as assistants. They are not good at providing factual and consistent answers (due to hallucinations), taking into account recent information (outdated training always)

Auto-Regressive LLMs are doomed. They cannot be made factual and it is not fixable (without a major redesign) …

The lack of Explainability forces a higher level of control and lesser automation. We have explored the popular domain assistant use-case of this earlier, or via ChatGPT we experience daily the use-case of us humans being the higher level control.

But when we discuss AI/ML we aim for computer-based automation. To give an example in a non-LLM scenario and to provide some seeds of thought, let's explore how Convolutional Neural Networks have revolutionised Computer Vision and how they overcome this problem of Explainability effectively.

Not even the best computer vision models have any understanding of images. Consequently, it is very easy to fool these models with adversarial images to predict something else. A model trained on detecting - says some diseases from medical images cannot be trusted whether or not it is manipulated or picks up something in the image that presents itself as similar to an adversarial image. (Self-driving cars that rely on cameras also sometimes in border cases wrongly interpret -sometimes with fatal results)

A better control- here a doctor, needs to review this. However, if a doctor was to review minute detail of the images, there would not be much use for automating in the first place. This is where visual explainability is used in computer vision for such use cases comes into play. The doctor will need some explanation on why the model predicted as it did -Explainability of prediction. For computer vision, this can be visually represented using the algorithms of Grad-CAM. Below we can see that the model is indeed picking up the relevant features to predict that the image contains a French Horn. In the medical field where the subjects are not that evident, this helps in fast-tracking diagnosis with effective control.

Source -Author generated https://medium.com/data-science-engineering/how-to-train-your-cnn-a8c84d108a77

For other DL models that are based on data, Explainability refers to the feature combinations the model selected for its prediction. Explainability is a hard problem everywhere in ML, especially as DL is used when there are too many features in the first place for a rule-based engine or a human to correlate. But still, it is possible to some extent in supervised trained ML systems by removing features and checking how it impacts model output and other similar methods.

Until such techniques are invented in LLMs, there would need to be tight control as there is no way of identifying hallucinations. The other alternative is to use these in scenarios where some mistakes do not matter much. We will see one such use case in which we use LLMs to create training data to train LLMs in Section 3. There are many other use cases like better product recommendation engines and similar where few mistakes do not matter.

LMMs as Search engines (LLM augmented Information Retrieval)

This is an extremely popular and growing use case, with new players mushrooming frequently. One will wonder why this is so much better than traditional enterprise search based on keyword indexing techniques.

The main technical inution here is Vector Embedding and Similarity Search; and the main non-technical intuition of this from say other enterprise searches (example ElasticSearch) is that this leverages the concept of contextual word embeddings/sentence embeddings.

Whereas conventional search engines index every word (or the inverted index concept of ES), in this each sentence is grouped on how similar it is to other sentences. We need to go back to Section 1 and see how Transformers help in doing this via their Attention mechanism. In practice, this is done by computing the high-order embedding or the semantic embedding of the data via pre-trained models (popular SentenceTransformers) and some libraries like FAISS (Facebook AI Similarity Search) for fast similarity search-based retrieval of these embeddings concerning the similarly computed vector of the user query. Instead of, or also embedding FAISS or similar technologies are ever-growing Vector databases — Pinecone, Weviate, Milvus etc. See this Forbes article.

DeepDive — All documents are first split into components (sentences, paragraphs or even documents with URLs and some information as metadata) and converted to vector embeddings using a model like Sentence Transformers. (Think of them as floating in N-dimensional space (N is a very large number), and similar vectors are clumped together in this vector space).

Here is a Colab Notebook where SentenceTransformer and FAISS with just local storge is used for this pattern : https://colab.research.google.com/drive/1PU-KEHq-vUpUUhjbMbrJip6MP7zGBFk2?usp=sharing. Note Langchain has simpler wrappers over many libs and that is what we are using above. Direct way is bit more cryptic see this colab-notebook.

The semantic embedding of the chunks is used as the vector embedding. For example, in case you are using LamaCPPEmbedding you can see the logits getting picked from the model eval where the embedding of the last word seems to be picked to represent the sentence. langchain- > lmacpp(python) -> lamacpp (c++)

The advantage of this approach is that it is easier to compute the vector embeddings store and then use this data when compared to model fine-tuning. The disadvantage is the selection of the data is based on not understanding the user query but a split based on the words in the query with the sophistication of attention/ semantic embedding. The model has no “understanding” of the domain as we have explained earlier. Since a pre-trained model is used, it understands the query and can use the contexts from similarity search to create a meaningful answer.

Usually, this will suffice for a great many uses. But if you feel that the same amount of output finesse that you get out of say ChatGPT or Bard is elusive with this then the option is to fine-tune a model on your custom domain.

LMMs as Causal Reasoning Agents

Related to Explainability that we discovered before is the concept of Causality in ML ( SHAP, LIME, The Book of Why — Judea Pearl explains this concept more in-depth and is out of scope here). But in plain terms, it is cause-and-effect reasoning. Somewhere -probably in the Book of Why I read how possibly early humans associated cause and effect. One night they were doing some ritualistic dance and the next day they had rain. This repeated once or twice more and they associated these two facts. Association is the first step in the ladder of causal thinking but not the only one. As per Judea P., all our AI systems are in this step. ( I guess many of us humans are also still heavily here). The simple humans would keep dancing next time for rain. Now for our example ancestors, assuming that they did this dancing and every time it rained, it will be almost impossible for anyone to convince them otherwise; Except if they “did” not dance and still it rained. The first step can be represented as “Seeing”, the next “Doing” and the third “Imagining”. Hopefully, this makes us aware that Causal Reasoning needs some understanding and not just association ( as in association between tokens).

There is an interesting paper from Microsoft research that uses LLMs for Causal discovery

We find that LLM-based methods establish new state-of-the-art accuracy on multiple causal benchmarks.

I checked the same prompts in the paper with ChatGPT 4 and it gave better results than they had got earlier with the same model; which does not mean much, but still is illustrative of their capability.

Author and ChatGPT

Notice here that there was no mention of online and offline demographics etc. and the GPT4 model has picked out the confounders and reasoned not just on the data but based on its world view as well. As described by the authors, LLMs are not perfect in this and they make mistakes.

That is, though these systems themselves are hard to explain, they can be used as tools to discover Causal relationships between variables in a dataset. This illustrates their possible potential in higher-level reasoning tasks. Again this is not easy to illustrate with a simple use-case for me now, but this is an exciting field to research. I like especially this quote from the paper, which effectively side steps to prove or disprove arguments on whether they are mimics or not and to use the capabilities practically.

Irrespective of whether LLMs are truly performing causal reasoning or not, their empirically observed ability to perform certain causal tasks is strong enough to provide a useful augmentation for aspects of causal reasoning where we currently rely on humans alone.

Section 3: Fine Tuning on Custom Domain Data

All the popular models like GPT3/3.4/4 and LLAMA2 are trained primarily on the data scraped from the internet. Common Crawl, WebText, GitHub, StackOverflow etc: These are massive datasets of text and code that are crawled from the public web and a few curated like the QA dataset SQAD.

The worldview and information the model has learned are also based on this data. However, this means that if we have some domain-specific data that the model has not seen, then it won't be able on its own to answer questions related to such data. For example, most online portals are adding virtual assistants for their customers, banks, e-commerce, customer support etc. And a huge if not the majority of data in the world still lives outside of the internet in enterprises. We have seen in Section 2 how LLMs can help address information retrieval use cases based on Vector space embeddings. But what if our use case is more high level, It needs domain “understanding”, maybe some higher level reasoning tasks. This is where fine-tuning with custom data comes into play.

I am not able to provide a use case where higher-level reasoning can be used. There are a few simpler ones like training on custom issues and then asking it to reason on similar issues and possible solutions etc., but these are as of now not tested, So let’s stick with a simpler use-case Closed-Book QA. That is the model answers questions based on the knowledge it internally has.

source Can Generative Pre-trained Language Models Serve as Knowledge Bases for Closed-book QA?

The above is from a 2021 paper. This is already outdated in the sense of the number and size of models and training released. The authors with 2021 models could not achieve great results and the great results they found in some studies described could be attributed to the high train and test overlap in datasets. There are also a lot of tutorials on the internet that try to portray this with toy datasets. The real trouble is making the model ‘understand’ the data first and not just parrot it out, as without understanding it will parrot out the answer based on the similarity of the question in the training set, or both the question and answer. To prevent this, the authors have an intermediate step called ‘Recite’ where the model is made to recite/output the relevant passages and after that output the answer.

Just to be clear, there is no doubt now (2023), especially with GPT3/4, LLAMA2 and similar models about the feasibility of this use case, that a model that can understand the question, has some ability for causal reasoning, can memorize or better have a world model from its training data, and able to use both to create a well-formed answer to a user question.

Let’s see the difficulties one by one. First is the importance of the model size. This GIF from the Google AI blog illustrates this beautifully.

Only when the model size becomes sufficiently large does the model start “understanding” language and generalising tasks.

source https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html

It is relatively easy and cost-efficient to train or fine-tune a small model with our custom data, as the GPU and infrastructure requirements are very less. On the contrary, it needs huge fleets of GPUs and training infrastructure to load very large language models and fine-tune them (without quantisation) in a distributed way (e.g. DeepSpeed)

LLMs come in various sizes, based on the number of trainable parameters or weights. The smaller ones which have less than 1 billion parameters (GPT2 124 M, Bloom 560M, Flan-T5 783 M ) etc can be trained on a laptop GPU with 8 to 15 GB GPU RAM )

For quite some time, this is what I tried. I tried to overfit a small test data set on decoder models like GPP2-small, GPT-Medium, and Bloom and encoder-decoder models like Flan-T5, thinking somehow that the understanding we see in ChatGPT may come in some form if we train on these smaller models. ( less than one billion parameters). As per the paper, I tried both Causal training, where the model is presented with only previous tokens and Masked LM-based training, where the model is presented with full tokens, but a certain percentage of tokens are masked in random and the model has to predict it.

Fine-tuning Small Models is Easy but not Effective

DeepDive: I started with training a small model like GPT2 with a small data set. These are a few chapters from Project Gutenberg’s Manual of Surgery, by Alexis Thomson and Alexander Miles. I wanted to provide some information that is not common and can be checked to match the model’s output The training was done using both the HuggingFace Trainer way, as well as the direct way (inspired by get_batch of NanoGPT from Karpathy). We can see that the model’s loss becomes small very fast and it overfits very fast to the data. It generates as expected the next tokens as per its training data. Illustrated here in this notebook . Though it overfits nicely, it has no iota of “understanding”.

More details here https://medium.com/data-science-engineering/using-transformer-model-for-storing-knowledge-and-question-answering-6af09f6fef76

The next option was to fine-tune a large model with the data. However, this is extremely difficult to do and even if cloud-based solutions are provided, it would be pretty costly. (What OpenAI provides now is Instruct Fine-Tuning which we will cover later)

It takes months of GPU fleet time and a specialised library and infrastructure to distribute training across multiple GPUs needed to train LLMs.

The infrastructure, power, money and carbon footprint are so massive that only a few large organisations and institutes can really train large LLMs.

For example, even a relatively small model like the BigScience Bloom 3 Billion model, even when the weights are loaded in 16 Bit cannot be trained with A100 on ColabPro with 40GB GPU RAM ( the highest you can get) as it goes out of memory

Fine-Tuning Large Models via Qunaitsation and Parmeter Efficient Tuning

The solution to this is to reduce the size of the models so that they can fit a commodity GPU and then fine-tune them. There are two parts to this- Quantisation and Parameter Efficient Tuning.

Qunaitsation is the technique of reducing the model's memory size by representing the usual data type of each of the weights — that is FP32 or 32-bit floating point or full precision to half-precision FP16, or quarter precision INT8 or even less INT4

The real magic of this is that a laptop with a sufficient recent GPU (having Tensor Cores), can run the 7 billion Lamma2 pre-trained model open-sourced recently by Meta Research. Imagine the compressed knowledge and an NLU (Natural Language Understanding) model running on your local laptop. This is still a smallish model, but it’s still capable of understanding and has sufficient world knowledge embedded in it to be quite useful.

DeepDive: Quantisation is the algorithm to represent a highprecision number with a low precision number. There will be obviosly loss. Assume that we are converting a FP32 or FP16 numer to INT4. Now a 4 bit integer can basically represent (²⁴=16) numbers. Here is a very good explanation and the same coded here .Note that in real life the statistical properties of the weights are used for better effeciency. That is the set of innovations in running the large model in 4 bit mode for the forward pass in QLoRA paper. In a simple example, we know that weights are noramlised in neuralnets between -1 and 1. So we divide this into 16 equals parts via np.linspace which gives `[-1. -0.86666667 -0.73333333 -0.6 -0.46666667 -0.33333333 -0.2 -0.06666667 0.06666667 0.2 0.33333333 0.46666667 0.6 0.73333333 0.86666667 1. ]`. Assume that we have to represent 0.5678 in Int4, that will turn out to mathc the closest 0.6 that can be represented as 12 (index nuber) in Int4, with a precision loss of 0.6–0.5678. Note than this needs support in hardware via TensorCores in NVIDIA GPU

Here is a sample running in a free tier Colab notebook with T4 GPU and 15 GB GPU RAM with some initial tests for code review -llama2–7b-4bit-Inferernce.ipynb — Colaboratory (google.com)

Imagine what a model like this or better models in future could do if it could run in small servers or in cars, and leverage its causal reasoning and world model knowledge to supervise lower-level/specialist AI/ML systems.

Parameter Efficient Tuning consists of a set of methods by which the number of parameters to be fine-tuned is reduced considerably making even large models trainable on commodity hardware.

DeepDive: Low-Rank Adaptation or LoRA and Qunatised LoRA or QLoRA are two popular techniques in Parameter Efficient Tuning

“We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times …” https://arxiv.org/pdf/2106.09685.pdf.

LoRA is based on the concept of matrix Rank and Singular Value Decomposition (SVD) of a large matrix to three smaller ones. By reading this https://web.stanford.edu/class/cs168/l/l9.pdf I could get an approximate understanding of how it works. Two images explain it best assuming A is the large weight matrix, Using SVD, A can be approximated or compressed as Uk, Sk (Rank=r) and Vk matrices. This is what LoRA is doing at a high level, the model weights ( of Query and Value Attention layers of the Transformer network) are frozen and the very smaller adapter weights are added in lieu and trained via backpropagation

Source https://web.stanford.edu/class/cs168/l/l9.pdf

Note that in LoRA only the matrices U and V are used, as the idea is to approximate the weights. From the shaded portion, it should be clear, that only a fraction of the original weights are to be used. The main parameter in LoRa is ‘Rank’ which in essence is directly proportional to the size of the trainable parameters. Now QLoRA is a novel technique in which a Quanatised model (say loaded in 8-bit or 4-bit) is fine-tuned with LoRA techniques by Time Dettmers et others. He is also the author of the famous bitsandbytes library

So we have now a way to fit reasonably large models (7B or more) in a single GPU, via Quantisation and then train them in a parameter-efficient way via LoRa/QLoRa.

Take 1: Un-supervised Training Fine-tuning with QLoRa

Using the small training data and QLoRA, I tried first to train a large 7B Lamma2 model by feeding in the training text as is (Causal LM model training via UnSupervised learning). Note that this model was loaded in 4-bit, making it runnable on a single T4 GPU and trained with QLoRa.

With QLoRA only a fraction of the adapter weights are trained and summed with the existing frozen pre-trained weights of the model during inference.

Here is an illustrative Colab notebook. You can see that training the model with just the text as is, does not result in proper output to questions. The answers are not affected by the training data.

Take 2: Instruct Fine-tuning with QLoRa

Instruction Tuning concept is a higher-level training concept introduced by this paper FineTuned Language Models Are Zero shot Learners (FLAN)

We leverage the intuition that NLP tasks can be described via natural language instructions, such as “Is the sentiment of this movie review positive or negative?” or “Translate ‘how are you’ into Chinese.” We take a pre-trained language model of 137B parameters and perform instruction tuning …

We are trying to leverage Instruct Training to transform our training data into a set of instructions so as to make the model learn.

Since we use QLoRa we are effectively closely following this paper — QLORA: Efficient Finetuning of Quantized LLMs concerning the training data set, the format that the authors used to train their Gauanco model

<s>[INST] {user_instruction} [/INST] {model_response}</s> https://huggingface.co/datasets/mlabonne/guanaco-llama2/viewer/default/train?row=0

This is the format for the Llama2 model and will be different for others.

DeepDive: This Instruct data set is fed and learned using Supervised Learning. We use the Supervised Fine Tuner Trainer class the SFTTrainer. This does not do anything special other than support PEFT /QLoRA training configurations and handing over to the base Trainer class which is used for regular Causal training. As explained before when LLM experts speak of Unsupervised training, they mean more at a high level the implicit understanding that comes and not the actual training (which is based on labels/targets and Cross Entropy loss). The actual training is Supervised and Causal (or Masked) model-based.

One of the hardest problems of training is finding or creating a good quality data set to train.

In our case, converting the available training data set to the instruction data set. Since our use case is Closed Book QA, we need to convert this to a QA format. Using older NLP methods like NER (Named Entity Recognition) and then using that to create a QA dataset was not effective.

This is where the Self-instruct concept could be used

Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce SELF-INSTRUCT, a framework for improving the instruction-following capabilities of pre-trained language models by bootstrapping off their own generations.

SELF-INSTRUCT: Aligning Language Models with Self-Generated Instruction: https://arxiv.org/pdf/2212.10560.pd

However previous to Llama2, the best-performing model was the GPT 3/4 model via ChatGPT or its API and using these models to do the same was expensive.

The 7 billion model of Llama2 has sufficient NLU (Natural Language Understanding) to create output based on a particular format. Running this in 4-bit mode via Quantisation makes it feasible compute-wise to run this on a large data set and convert it to a QA dataset.

This was the prompt used. The context was a sliding window from the text dataset.

system_message = “””You are a smart AI assistant that can create question and answers based on the context””” Colab notebook

Some minimal parsing and finetuning were done on the output of the model, and we could generate a QA dataset of the format below.

<s>[INST] Source:8989REF What is the process of repair in living tissue? [/INST] Source:8989REF The process of repair in living tissue depends on an inherent power possessed by vital cells of reacting to the irritation caused by injury or disease. Cells of the damaged tissues undergo proliferative changes to restore the normal structure and configuration of the part. </s>
<s>[INST] Source:8989REF What is the extent to which different tissues can carry the recuperative process? [/INST] Source:8989REF The extent to which different tissues can carry the recuperative process varies. Simple structures such as skin, cartilage, bone, periosteum, and tendon have a high power of regeneration and can result in almost perfect restitution to the normal. More complex structures such as secreting glands, muscle, and the tissues of the central nervous system are but imperfectly restored, with simple cicatricial connective tissue taking the place of what has been lost or destroyed. </s>
<s>[INST] Source:8989REF What determines the type of tissue that can be replaced in a damaged part? [/INST] Source:8989REF Any given tissue can be replaced only by tissue of a similar kind. In a damaged part, each element takes its share in the reparative process by producing new material that approximates more or less closely to the original tissue. </s>
<s>[INST] Source:8989REF What is the role of vital cells in the process of repair? [/INST] Source:8989REF Vital cells, also known as stem cells, play a crucial role in the process of repair. They are responsible for reacting to the irritation caused by injury or disease and undergoing proliferative changes to restore the normal structure and configuration of the part. </s>
<s>[INST] Source:8989REF How does the reparative process vary between simple and complex tissues? [/INST] Source:8989REF The reparative process varies between simple and complex tissues. Simple structures such as skin, cartilage, bone, periosteum, and tendon have a high power of regeneration and can result in almost perfect restitution to the normal. More complex structures such as secreting glands, muscle, and the tissues of the central nervous system are but imperfectly restored, with simple cicatricial connective tissue taking the place of what has been lost or destroyed. </s>
<s>[INST] Source:8989REF Can the reparative process result in perfect restitution to the original tissue? [/INST] Source:8989REF Yes, in some cases, the reparative process can result in perfect restitution to the original tissue. For example, simple structures such as skin, cartilage, bone, periosteum, and tendon have a high power of regeneration and can restore the normal structure and configuration of the part. </s>
<s>[INST] Source:8989REF What is the role of cicatricial connective tissue in the reparative process? [/INST] Source:8989REF Cicatricial connective tissue plays a role in the reparative process by replacing what has been lost or destroyed in more complex structures such as secreting glands, muscle, and the tissues of the central nervous system. </s>

This was fed to the QLoRA-based fine-tuning (Colab Notebook)

We can see that the output from non fine-tuned 4bit quantised llama2 7 B model is pretty good. Colab Notebook

Note in the generated dataset I added a specific tag `Source:8989REF`. The idea was that via attention this token will be somehow associated with the text that we were training on. And then to use this hash somehow to tweak the prompt to control hallucination.

Something like “[INST] <<SYS>>\nYou are a helpful Question Answering Assistant. Please only answer from this reference Source:8989REF”

However, that turned out to be a very naive attempt.

Also, note that the generated QA missed transforming training data related to Professor Thiersch's method to a proper QA dataset. These and other improvements need to be experimented with, as well as to train with some completely new data that the model has not seen to test more effectively.

Update: Training with new data was done by writing an imaginary story with ChatGPT help and then creating an instruction tuning data set (colab notebook). The model was then trained and tested (colab notebook) with this generated instruct dataset. The results confirm that the model learns via Instruct tuning, not only the fed questions but other details and relations of the domain. Problems with hallucinations remain (Bordor, Lila characters who are not in the story). The LLama2 13B 4-bit fine-tuned model has better output than the 7B model.

---------------------------------------------------------------------------------
[INST] <<SYS>>
You are a closed book Question Answering Assistant.You answer from training data related to Source:8989REF: only. Else you say 'Not able to answer'
<</SYS>>

Source:8989REF: What was Visgars intention? [/INST]
--------------------------------------------------------------------------------
Source:8989REF: Visgar's intention was to have Elorna deliver a deadly poison to Igodo.

--------------------------------------------------------------------------------
[INST] <<SYS>>
You are a closed book Question Answering Assistant.You answer from training data related to Source:8989REF: only. Else you say 'Not able to answer'
<</SYS>>

Who was Bordor? [/INST]
--------------------------------------------------------------------------------
[INST] Bordor was a renowned scholar from Pentiagon.

--------------------------------------------------------------------------------
[INST] <<SYS>>
You are a closed book Question Answering Assistant.You answer from training data related to Source:8989REF: only. Else you say 'Not able to answer'
<</SYS>>

Explain more about the device? [/INST]
--------------------------------------------------------------------------------
<<INST>> The device is a mysterious artifact that has the power to manipulate time and space. It is said that whoever decodes the device will have unimaginable power.

--------------------------------------------------------------------------------
[INST] <<SYS>>
You are a closed book Question Answering Assistant.You answer from training data related to Source:8989REF: only. Else you say 'Not able to answer'
<</SYS>>

Explain more about the device? [/INST]
--------------------------------------------------------------------------------
<<INST>> The device is a mysterious artifact that has the power to manipulate time and space. It is said that whoever decodes the device will have unimaginable power.

--

--

Better ML
Better ML

Published in Better ML

Software Engineering for ML  — Implementation, Versoning, Operations

Alex Punnen
Alex Punnen

Written by Alex Punnen

SW Architect/programmer- in various languages and technologies from 2001 to now. https://www.linkedin.com/in/alexpunnen/

No responses yet