Diving into Foundation Models and Large Language Models : A Beginner’s Guide

Lakshmi Narayanan
10 min readMay 18, 2024

--

Welcome all curious minds to the fascinating world of Generative AI. I must say to those of you interested to know about Large Language Models and their generative ability but are not sure about where to get started, you have reached the right place. This is my humble attempt to share what I learnt/ am learning in this magical world.

2023 was the year of ChatGPT, it entered our lives like a storm and thence started its impact on everything that we did. Do you know what powers it ? Well, they are Large Language Models or LLMs as they are more commonly called.

But wait! Before going into the details of Large Language Models, we must understand what Foundationl Models are and how do they relate to Large Language Models or LLMs.

Foundation Model vs Large Language Models : the intuition

When we think of the term ‘foundation’ in the general sense, it signifies something that is basic , strong and most essential but at the same time also conveys some kind of ‘incompleteness’ — something that requires further work to complete and become more usable. You may have the same intuition for foundation models also.

The term “Foundation Model” was coined by Stanford Institute for Human-Centred Artificial Intelligence in 2021. Foundation models are meant to be “foundational” in nature. They are trained on huge volume of broad unstructured data(vast variety of data — not only text), mostly in self-supervised fashion and they are meant to be very generic. As a result, they are equipped to be used in a wide array of tasks — they can be finetuned for a variety of downstream tasks such as text, audio, video generation use cases. Some examples of foundation models are GPT3(for generating text) , DALL-E 2 (which can generate images when given a text prompt)etc.

Large Language Models belong to the family of Foundation Models. They can be thought of as a subset of Foundation Models which are meant to be used for specific tasks related to language. Please remember Foundation models are not restricted to language and they are built to serve tasks for other modalities also such as vision (image, video), sound(audio data such as speech related). machine signals etc.

There can be several use cases where foundation models can be finetuned to accomplish many tasks such as summarization , text classification, sentiment analysis information retrieval, text generation, speech transcription, image generation and many more. The point to highlight here is that foundation models are not just good at generation related applications but we are can very well use them for non-generative applications such as classification, prediction etc. For accomplishing these traditional use cases, we just need to finetune the foundation models with a small set of labelled examples specific to the problem and we can get much better results than what could be achieved with traditional models used for such problems and it requires far less labelled data that traditional models in that case.

As I mentioned earlier, Large Language Models(LLMs) have gained extreme popularity with the advent of ChatGPT in late 2021. ChatGPT is a Large Language Model application that was developed by finetuning the foundation GPT model for chat application for answering questions.

LLMs are foundational models which are trained on humungous amount of textual data . These models consists of billions and trillions of model parameters. In layman terms, we can think of model parameters as the skills that the model has acquired during the training process when it is fed with the training examples. More is the number of parameters, more is the skills that the model possesses but that again comes at higher inference cost . For the unaware, inference is the process of invoking the trained model for making the predictions/generations.

LLMs are thus capable of generating text and they can do other text based tasks too. LLM’s generative functionality comes from its ability to predict the most probable next word given a sequence of input text and the model finds the word based on all the examples that it has seen during the training process.

Again, citing example of ChatGPT, we would all have entered some instructions to write a poem given a context, request the chatbot to complete a story given some introduction, ask some questions and get answers , post an issue and look for suitable solutions, generate ideas of creative writing and I am sure many more of such use cases. The LLM is doing all that based on the data that it has been trained on and it’s just not huge volume of data but a large variety of data as well.

The diagram below illustrates how the age of Large Language Models had started from 2021 with GPT3 and from then on more and more larger models are being developed. For example, GPT4(released in 2023 — not shown in the picture below) is estimated to have around 1 trillion parameters.

Source : https://arxiv.org/abs/2310.05694

Most Popular LLMS

· GPT Models : Generative Pre-trained Transformer (GPT) family of models, having more than 175B parameters is developed by OpenAI. ChatGPT is the most popular GPT enabled application having the maximum worldwide users. There are two flavours of GPT models available now — GPT 3.5 Turbo and GPT 4 models.

· Gemini : Gemini family of models are from Google. There three types of models — Gemini Nano(~1.8 billion parameters), Gemini Pro, and Gemini Ultra. These models are developed to operate on different devices, from smartphones to dedicated server machines. These models can not only generate text but can handle images, audio, video, code, and other kinds of information.

· PaLM 2 : PaLM 2 is another LLM from Google having ~340 billion model parameters ,is proficient in handling many natural language tasks and it powers many of the Google’s AI features.

· Llama 2 :Llama 2 is a very popular open-source LLM which is free for research and commercial use. It serves as a base for many other LLMs. These models are from Meta, which is parent company of Facebook. Llama 2 comes in three variants with 7 billion, 13 billion, and 70 billion parameters respectively.

· Claude 2 : Claude 2, developed by a company called Anthropic is specifically designed to emphasize constitutional AI. It’s designed to be helpful, honest, harmless, and — crucially — safe for enterprise customers to use.

· Falcon: Falcon is another family of open source LLMs that have consistently performed well in the various AI benchmarks. The model comes from by the UAE’s Technology Innovation Institute (TII). It has models with up to 180 billion parameters and can outperform PaLM 2, Llama 2, and GPT-3.5 in some tasks. It’s released under a permissive Apache 2.0 license, so it’s suitable for commercial and research use.

· Mixtral 8x7B : Mistral’s Mixtral 8x7B , despite having significantly fewer parameters — 46.7 billion parameters (and thus being capable of running faster or on less powerful compute) outperforms many larger LLMs. It outperforms Llama 2 70B on most benchmarks with 6x faster inference . It matches or outperforms GPT3.5 on most standard benchmarks. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. It’s also released under an Apache 2.0 license, so it’s suitable for commercial and research use.

Application/ Tasks done using Large Language Models

Some common applications/tasks where these models can be used are:

· Content generation: LLMs are capable of creating high-quality content based on the instruction provided by the user (also called the prompt). Enterprises adopting LLMs can saves lot of time and resources by leveraging these models.

· Classification: LLMs can classify texts into some predefined categories, such as classifying articles into various categories such as sports, editorial, politics and many more, marking mails as spam or not spam, categorizing social media posts (harmful vs harmless content), classifying customer feedback etc.

· Summarization: LLMs are excellent tools which can used for summarizing long textual content and thereby can helps in extracting key information by providing concise summaries. In an age when we humans want to consume information as information nuggets, it is an extremely useful features. Documents such as research papers can now be easier to comprehend with the help of LLMs.

· Language Translation: Many businesses can benefit by leveraging the ability of LLMS to translate text accurately.

· Information Extraction: LLMs help to identify, and extract required specific pieces of information from large volume of unstructured data. This feature has become very popular as is called Retrieval Augmented Generation (RAG) that can search and extract relevant pieces of information from the end user’s knowledge base and present it in a easily consumable format to the user. This helps organizations to efficiently look up and gain insights from large volume of data.

· Chatbots: This has been probably the most popular application of LLMs. Chatbots are extremely useful for customer service / 24X7 helpdesk kind of use cases as these chatbots can engage with users very naturally

Now, let’s look at some of advantages and disadvantages of this technology and what we need to be aware of to work better while using these.

First let’s look at the advantages:

· Performance : As mentioned above, these models have seen such large volume and variety of data during their training process that they are really good at the tasks they are used for leading to productivity gains.

· Automation : These models can bring in automation in various tasks and can help to speed up some others. One of the excellent applications is Customer Service Chatbot where the first level of customer service based on the usual operating procedure or other frequently asked questions can be handled by the model and more critical queries which cannot be addressed by AI may be taken up by a human, thus improving efficiency.

· Customizable : Foundational models are generic which can be finetuned for specific needs in order to solve specific use cases. This will require additional training data specific to the problem or domain , which one can use and for additional training and finetuning to achieve their unique use case requirements.

· Enhanced User Experience: These models help make talking to chatbots, virtual assistants, and search engines better. As, a result, when an end user asks some query to the system, the AI gives them a very natural conversation experience.

Having talked about some of the major advantages , below we have some of the disadvantages associated with these models :

· Cost: The cost of training these large models is very high as they have to be trained on petabytes of data. These models require multiple high end GPU -enabled compute which makes it difficult for small enterprises to train their own models. They are also very expensive for inferencing as these models have billions of parameters and more.

· Trust & Ethical Concerns : All this while , we have discussed on the efficiency and performance of these models being trained on huge volume of data but that comes at a cost. The data used to train these models in a black box. We do not know if these models have been trained on data that has hate content or any other undesirable content such as spam data, harassment data , offensive language , personal information etc and its highly likely that it has because the data used for training these models are scraped from the internet. This leads to trustworthiness issues. Additionally, any bias in the data would penetrate down to the predictions generated by these model.

· Hallucination: Another major challenge with these models is hallucination i.e the model gives fabricated answers that are not reliable or factual. The response of the LLM may be partially correct or totally incorrect and most times these fabricated answers are conveyed in a convincing fashion. For example, ChatGPT — the most popular LLM in town powered by GPT models, can very frequently generate incorrect code.

While the advantages are many , it is also important to understand the limitations and help solve those, so that the models built are not only efficient but also trustworthy. There should be transparency into what data goes into while training these models to enable trust , thorough end to end governance of these models is critical in all stages of model lifecycle and post deployment. Use of Reinforcement Learning from Human Feedback (RLHF) which ensures involvement of human in the system to refine the results from the model and teach the model to provide appropriate behaviour by introducing a reward based system. Open AI employs this technology to make ChatGPT better.

Over time these LLMs will become more and more smarter and should be able to solve many more complex problems. There is lot of active research happening in this area. The latest being GPT 4o released in May, 2024 which is OpenAI’s most powerful capability till date which support multimodality — which means the model can reason across audio, text and vision in real time . OpenAI had also announced a new LLM called Sora in February 2024. Sora is capable of creating high-definition video clips when provided a short textual prompt and which is set to transform the video generation arena using LLMs. Sora is believed to be as transformation as ChatGPT has been for text data as OpenAI has surpassed all previous attempts in this field by making more longer and more high quality videos. It is not released yet to public and is undergoing testing by a select group of people.

Go through the links below if you find them interesting –

https://openai.com/index/hello-gpt-4o/

https://openai.com/sora

--

--

Lakshmi Narayanan

I am a motivated Data Scientist who is always eager to explore the latest advancements in the field. I am passionate about solving problems.