Solving your FOMO about everything in LLMs

11 min readApr 4, 2024

Table of Contents

RAG (Retrieval Augmented Generation)
Document Embeddings, Vector Stores and Querying
Frameworks: Langchain and The LLAMA frameworks( llamaindex, llama.cpp, ollama, and llamafiles)
How to find the best LLMs?
Other Popular Terminologies
Frameworks for pre-training and fine-tuning LLMs
LLM Pipelines
Cloud Platforms for LLM deployment and usage
Where can I ask good-quality questions about LLMs?
Additional Resources

The whole LLM innovation has brought new interpretations to the AI landscape that didn’t exist previously. RAG( Retrieval Augmented Generation) is one of the most popular terms you might be seeing all across the internet now. Diving deeper into the LLM world, we can also see the word “llama” gaining more popularity in projects such as llama.cpp, llamaindex, etc. But how would you understand what you shall be looking for? In this blog, we will explain some concepts that will give you a complete overview of everything the LLM industry has been talking and hyping about.

RAG (Retrieval Augmented Generation)

Imagine you’re in a library full of books but you have a question that needs to be answered from a specific book. The librarian knows the perfect book that can answer your question, so they refer and provide you with a comprehensive answer from their knowledge of the book. In the LLM scenario, the books are the dataset, and the Librarian is the AI that can comprehend the knowledge and respond to you with a simplified, understandable answer. This process is known as RAG (Retrieval Augmented Generation).

A Rag system coupled with a vector database. (Credits: https://www.determined.ai/blog/rag)

From a technical perspective to understand more about how RAG works, you have to understand more about Document Embeddings and Vector Stores.

Document Embeddings, Vector Stores and Querying

Computers understand numbers better than words. So it is essential to convert documents into representations of numbers that computers understand. These collections of numbers that capture the semantic representations are called “vectors”. Vectors are typically in a format that resembles an array or a collection of arrays. Every word, sentence, or document that is converted, has its own vectors which are distinguishable from each other and also capture relationships between certain words, sentences, or documents.

Once we have these numerical representations (embeddings), we need a place to store them, especially when we have a lot of documents. A vector store acts like a library for these embeddings. Just like a library organizes books in a way that makes it easy to find what you’re looking for, a vector store organizes embeddings so that the system can quickly and efficiently find the numerical representation of a document when needed.

One popular method of organizing, i.e., indexing these vector embeddings in a hierarchical fashion is HNSW indexing. Think of HNSW as a road map for a very large and complex city. The top layer of the map shows major highways that connect large parts of the city. As you go down into the layers, you see more and more local roads, until you’re looking at individual streets. When HNSW needs to find a specific point, it doesn’t start by checking every street; instead, it begins with the highways to get close to the area and then narrows down to local roads and streets, step by step, until it finds the exact location. This method makes finding a point much quicker than if you had to look at every street from the start.

In the context of RAG, when you perform a query on a RAG system combined with vector stores, it doesn’t just search for an answer blindly. Instead, it uses embeddings to turn your question into a numerical list that represents its meaning. Next, RAG uses this numerical list to search through a vector store. Because the store contains embeddings of many documents, RAG can quickly find which ones are most similar to your question. It’s like finding the best resources that match the topic of your inquiry. With the relevant documents identified, RAG reads through them and combines their information with what it already knows. This process helps it craft a response that’s not only based on its training but is also informed by the latest or most specific information available in the selected documents.

Here are some popular vector databases you can use in your RAG applications:

Frameworks: Langchain and The LLAMA frameworks( llamaindex, llama.cpp, ollama, and llamafiles)

LangChain: Langchain is one of the earliest repositories that got the most popularity with the release of LLMs. These are the areas where Langchain helps you the most with your LLM applications:

LLM Interaction: Interacting directly with LLMs, using prompt templates, and optimizing and organizing your prompts.
Building Chains: Chains are a sequence of interconnected components that help you execute tasks in a specific order. In other words, instead of a single LLM API call, you can do multiple calls as a “chain”, which allows you to execute a complete logical sequence. The logical sequence can be defined and orchestrated manually in the chains.
Building Agents: Agents have a specific goal, which the language model uses its own ways to execute, and all the API calls it makes are through its own reasonings. The agent uses the language model’s reasoning capabilities to determine the best course of action toward its goal.
Retrieval: Langchain allows connection from external data sources to LLMs and provides an interface for augmented generation.

A very basic code example of using LangChain:

from langchain.chains import LLMChain
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

# Initialize components
llm = OpenAI(temperature=0.9)
prompt = PromptTemplate(template="What is {subject}?", input_variables=["subject"])
chain = LLMChain(llm=llm, prompt=prompt)

# Execute the chain
result = chain.run(subject="gravity")
print(result)

Llamaindex: This is one of the best tools that acts as a data framework for your LLMs. You can connect to your existing data sources from vector databases, or any APIs, PDFs, docs, SQL, etc, structure the data as you want it (indices, graphs, etc), and query and retrieve over the above. In simple words, you can create your RAG application where llamaindex can help you connect even to large amounts of data beyond your LLM context length, and then perform your custom query and get the response you need, all in a few lines of code!

Here is an example of how you will be able to do the above in a local system


import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("YOUR_QUESTION")
print(response)

llama.cpp:

You might’ve wanted to play around with LLMs locally, rather than always relying on the API calls. This especially comes as a use case where you want to integrate LLM reasoning capabilities into your hardware. With llama.cpp, the aim is to enable LLM inference in your local setup or cloud with minimal effort and maximum performance. It also supports a wide variety of architectures and processors including GPUs. A wide range of models, including LLaMA 🦙, LLaMA 2 🦙🦙, Mistral 7B , and Mixtral MoE, are supported for inference with llama.cpp.

Ollama: Similar to llama.cpp, Ollama also helps you run LLMs locally, with a neat CLI interface. The repo showcases several examples of neatly packed projects that were built with the help of Ollama, which includes REST APIs, Plugins, and Extensions.

Llamafile: This framework reduces all the complexity of LLMs with a single-file executable called a llamafile, that runs locally on most computers, with no installation. Similar to llama.cpp, it supports a wide range of architectures, processors, and GPUs.

How to find the best LLMs?

Tons of LLMs out there and it must be hard to keep track of everything. Finding the best LLM for your use case need not be a straightforward approach, you can consider multiple metrics that define a good LLM. You can either pick an LLM out of a leaderboard, or take a specific LLM for your domain, and do an evaluation on its performance for your use case.

LLM Leaderboards:

HELM Leaderboard: This leaderboard includes both open-source and closed-source models, and the models are evaluated on scores based on different benchmarks in different domains. Example benchmarks are MedQA — EM for the medical domain, and LegalBench — EM for legal domain reasoning.
LMSYS Chat Arena Leaderboard: This industry standard leaderboard has human scoring in place, which means you get to use the best LLMs voted for and used by others in the community.
Huggingface Open LLM Leaderboard: Keeps track and evaluates the open source models that are updated to Huggingface

2. Custom Evaluation:

Here are some custom evaluation metrics that you can refer for evaluating your LLM.

G-Eval: General evaluation metric for defining any metric with free text.
Summarization: Evaluate the summarization capabilities of the LLM.
Hallucination: Measures the tendency of the LLM to generate incorrect or fabricated information.
Faithfulness: Assesses the faithfulness of the generated text to the input data.
Contextual Relevancy: Evaluates the relevance of the generated text to the context.
Answer Relevancy: Measures the relevance of the generated answer to the input query.
Contextual Recall: Assesses the recall of the generated text in relation to the context.
Contextual Precision: Measures the precision of the generated text in relation to the context.
RAGAS: Evaluate the performance of a RAG pipeline.
Bias: Measures the potential biases in the generated text.
Toxicity: Assesses the potential toxicity of the generated text.

Here are the popular LLM evaluation frameworks that help you evaluate your LLMs:

Other Popular Terminologies

Quantization:

To explain Quantization, we need to explain precision first.

Here’s a simple way to understand different levels of precision:

High Precision (32-bit floating-point): Imagine you’re writing down the exact amount of money you spent today, down to the last cent, using a very precise number like $123.456789. This level of detail is like using a 32-bit floating-point number, where you’re keeping track of a lot of information after the decimal point.
Lower Precision (8-bit integer): Now, if you were asked to round that number to the nearest dollar, you might say you spent $123. This rounding down is similar to reducing precision — you lose some of the detailed information (the exact cents), but you still have a general idea of the amount. In computer terms, this could be akin to an 8-bit integer representation.

Quantization in the context of Large Language Models (LLMs) is a technique used to reduce the precision of the model’s numerical values, which can significantly decrease the model’s size and speed up its computations, often with a very minimal loss in accuracy.

Before Quantization: The model’s weights are stored with high precision (like the detailed expense amount in the example above). These weights can represent very subtle differences because they’re stored as 32-bit floating-point numbers.
After Quantization: The weights are stored in a lower precision format (like rounding to the nearest dollar in the example above). This might be an 8-bit integer format where the range and granularity of values the weights can take are much more limited.

2. Hallucination

Hallucinations in Large Language Models (LLMs) refer to instances where the model generates incorrect, fabricated, or nonsensical information that is not grounded in its training data or the input provided. These inaccuracies can manifest as minor details, entire passages of text, or even plausible-sounding but entirely fictitious content.

Source of Hallucinations:

Hallucinations often arise due to the model’s attempt to generate coherent and contextually appropriate responses based on patterns it learned during training. Since LLMs predict the next word or token based on the preceding context without true understanding or reasoning, they can generate content that seems plausible but is factually incorrect or irrelevant.

3. Multimodality

Multi-modality in Large Language Models (LLMs) refers to the capability of these models to understand, process, and generate information across various forms of data, such as text, images, audio, and video.

Applications:

Visual Question Answering: The model can answer specific questions about the content of an image or a video.
Image Captioning: Generating descriptive text for visual content.
Speech-to-Text and Text-to-Speech Conversion: Understanding spoken language and generating spoken language outputs from text, allowing more natural interaction.
Cross-Modal Retrieval: Searching for images based on text queries or vice versa, enabling richer information retrieval experiences.

4. LORA (Low Rank Adaptation):

LoRA (Low-Rank Adaptation) focuses on reducing the number of trainable parameters while fine-tuning models on specific tasks or domains, thereby reducing the model size significantly, and making it easier to deploy and perform inference faster.

Pre-trained Model: You start with a large language model that has already been trained on a vast dataset. This model has learned a lot of information and can perform various tasks. However, it might not be specialized or optimized for a specific, new task you want it to perform.
Low-Rank Matrices: Instead of retraining the entire model (which is time-consuming and resource-intensive), LoRA introduces small matrices into specific layers of the model. These matrices are much smaller than the original model weights and are the only components that get trained or adjusted.
Adaptation: When you train these low-rank matrices, they adapt the model’s behavior in a targeted way. This allows the model to learn new tasks or improve its performance on specific tasks without altering its core knowledge base.

Frameworks for pre-training and fine-tuning LLMs

Pre-training:

Pre-training LLMs from scratch can be a huge task, and there are some large players in the market who have come up with frameworks that are similar to what TensorFlow and PyTorch have done for the weaker models, while still using Pytorch/TensorFlow abstractions to build LLMs.

DeepSpeed by Microsoft: An open-source deep learning optimization library that provides advanced model training and inference capabilities, focusing on speed, scale, and usability.
Megatron-LM by NVIDIA: A large, powerful transformer model developed by NVIDIA, designed to facilitate training large-scale language models efficiently and effectively.
GPT-NeoX by EleutherAI: An initiative by EleutherAI to replicate and extend GPT-3-like models, providing open-source alternatives with scalability in mind.
LLM Foundry by MosaicML: A framework aimed at simplifying the process of training and deploying large language models, enhancing accessibility and manageability.
Ludwig by Ludwig AI: A toolbox that allows you to train and test deep learning models without writing code, emphasizing simplicity and flexibility.
LLM-Pretrain-SFT by XYJigsaw: A repository focusing on Supervised Fine-Tuning (SFT) for pretraining language models, facilitating customization and optimization.
TinyLlama by JZhang38: This project offers a compact model with a guide on pretraining smaller language models, ideal for environments with limited resources.

Fine-Tuning:

You can fine-tune models more easily if you use a combination of any of the above pre-training frameworks with an open-source model which you can pull from Huggingface, and then configure a YML file with all the configurations you need.

Axolotl is just the tool you need for this.

An example YML file looks like this:

# use google/gemma-7b if you have access
base_model: mhenrichsen/gemma-7b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

# huggingface repo
datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
val_set_size: 0.1
output_dir: ./out

adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:


gradient_accumulation_steps: 3
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

LLM Pipelines

LLM Pipeline Examples by Google Cloud Platform: This repository provides comprehensive examples and guides on the end-to-end process of working with Large Language Models, from pre-training to fine-tuning smaller models.

Cloud Platforms for LLM deployment and usage:

Microsoft, Google, and AWS are leading in providing services for LLM deployment and making partnerships with LLM providers such as OpenAI, Mistral, etc. There are other new players in the market such as Together AI, Runpod, etc that provide low-cost API for connecting with LLMs.