Data Science AI Assistant with Gemma 2b-it: a RAG 101

25 min readMay 9, 2024

Demystifying Retrieval-Augmented Generation pipelines by building a data science assistant

If you want to see the code running using a Kaggle notebook: https://www.kaggle.com/code/lucamassaron/data-science-ai-assistant-with-gemma-2b-it

For the Kaggle competition “Google — AI Assistants for Data Tasks with Gemma,” I’ve prepared an AI Assistant to accomplish the task of “Explaining or teaching basic data science concepts,” as the competition required.

The project has also been an occasion to explain how a basic retrieval-augmented generation (RAG) system works by showing the role of the data, which constitutes the backbone of the system, the function of embeddings and distance measures, how to retrieve relevant information for the task of answering a question, and how to process such information by first using a distillation prompt and then assembling the answer required by the user in a meaningful and useful way.

In this project, the lion’s share is done by Gemma, the state-of-the-art open LLM model released by Google, in its 2b-it implementation, the smallest in terms of parameters. Gemma is not the only Google technology presented in the project because I also made use of ScaNN (ScaNN Github repository) to recall information. Apart from Gemma, ScaNN, and HuggingFace packages for transformers and embeddings, there are no ready-made solutions such as vector stores or RAG packages. You can actually see how everything works under the hood, and if you like it, reuse it for your own projects.

1. What is a RAG, and how can it help to explain or teach basic data science concepts

A Retrieval-Augmented Generation (RAG) is a solution that improves the text generation of a large language model by integrating its answers using some external knowledge retrieval.

Hence, it combines a retriever to fetch relevant information and a generator to produce accurate responses based on this retrieved knowledge. Basically, it is just like doing a search engine query (the retriever), getting the best answers, and then asking a large language model such as Gemma or Gemini to process the information (generator) to answer an initial question.

Such an approach ensures AI models have access to up-to-date and relevant facts, improving the quality and reliability of their generated text, especially in tasks like question-answering where factual accuracy is crucial, and LLMs are infamous for sometimes coming up with made-up information (hallucinations).

In this case, Google Gemma seems already quite apt at answering basic questions about data science, but the idea is to further improve its competencies by providing it with reliable information about AI, statistics, machine learning, and data science in general.

2. Setting up the necessary stuff

In the first cell of this notebook, some key packages for the project are installed or updated to the latest version:

The first command installs or upgrades the torch package quietly, specifying version compatibility for CUDA 11.7 from the PyTorch repository.
The second command installs or upgrades the transformers package to version 4.38.2, a popular library for natural language processing tasks.
The third command installs the accelerate package, which is used for optimizing machine learning training pipelines.
The fourth command installs the bitsandbytes package from the specified index URL, potentially a custom or private package repository.
The fifth command installs or upgrades the sentence_transformers package, providing pre-trained sentence embedding models.
The sixth command installs or upgrades the scann package, likely used for approximate nearest neighbor search implementations.
The seventh command installs or upgrades the wikipedia-api package, which provides a programmatic interface for interacting with Wikipedia data.

!pip install -q -U torch --index-url https://download.pytorch.org/whl/cu117
!pip install -q -U transformers=="4.38.2"
!pip install -q accelerate
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q -U sentence_transformers
!pip install -q -U scann
!pip install -q -U wikipedia-api

In the next cell, the code uses the os module to set some environment variables. The first line sets the CUDA_VISIBLE_DEVICES variable to “0,” which instructs CUDA-enabled applications to use only the GPU with index 0 for computation, which is useful for managing GPU resources in multi-GPU systems. The second line sets TOKENIZERS_PARALLELISM to “false,” disabling parallelism in the Hugging Face Tokenizers library, which is potentially useful for troubleshooting or ensuring single-threaded execution. These environment variable configurations help control GPU usage and tokenizer behavior within the Python environment where this code is executed.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Moreover, since warnings may occur when using new versions of Python packages (aligning versions is often a task in itself), the following cell imports the warnings package and suppresses warnings during this session.

import warnings
warnings.filterwarnings("ignore")

In the next cell, the notebook loads Python libraries and modules for natural language processing tasks. It also includes libraries like re for regular expressions, NumPy and pandas for data manipulation, tqdm for progress bars, scann for approximate nearest neighbor search, and wikipediaapi for accessing Wikipedia content (yes, we are going to use Wikipedia as a knowledge base).

import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import scann
import wikipediaapi

import torch

import transformers
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig,
                         )
from sentence_transformers import SentenceTransformer
import bitsandbytes as bnb

3. Proceeding by building blocks

Before proceeding with the notebook, I need to explain how I will build the solution in a way that is clear, easily explainable, and both reusable and hackable.

The AI assistant will be a class containing all you need for it to work and with methods for changing some settings (such as the temperature, which corresponds to its creativity, or the impersonated role, which influences how it responds) and for asking questions.

All the internal functions, however, are external. Hence, they are easier to present as stand-alone code snippets, easily reusable for different purposes or projects, and easily upgradable or hackable. As you change an external function, you immediately change the behavior of the class without having to re-instantiate it again (it actually takes some time to re-index all the knowledge base, which may prevent some fast experimentation).

Here, as a first piece of code, the next cell presents a function that returns the device where to map the model and the data when working with the PyTorch library (used by the HF packages). It works with a CPU-based computer, a GPU one, and a macOS with MPS.

def define_device():
    """Define the device to be used by PyTorch"""

    # Get the PyTorch version
    torch_version = torch.__version__

    # Print the PyTorch version
    print(f"PyTorch version: {torch_version}", end=" -- ")

    # Check if MPS (Multi-Process Service) device is available on MacOS
    if torch.backends.mps.is_available():
        # If MPS is available, print a message indicating its usage
        print("using MPS device on MacOS")
        # Define the device as MPS
        defined_device = torch.device("mps")
    else:
        # If MPS is not available, determine the device based on GPU availability
        defined_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        # Print a message indicating the selected device
        print(f"using {defined_device}")

    # Return the defined device
    return defined_device

The next cells, instead, present two functions designed to operate using the SentenceTransformers package (the package home page). These functions can operate with lists of text and map them into embeddings.

Embeddings, such as those processed by packages like SentenceTransformers, are numerical representations of text or sentences that capture their semantic meaning. These embeddings are created by transforming words or sentences into high-dimensional vectors, where similar vectors represent similar meanings.

In the context of SentenceTransformers, these embeddings are generated using models like BERT or XLNet that have been fine-tuned to produce meaningful sentence representations. These embeddings can be used for various tasks like clustering, semantic textual similarity, and information retrieval (in our project, we actually need a retrieval function) by comparing the vectors using metrics like cosine similarity.

def get_embedding(text, embedding_model):
    """Get embeddings for a given text using the provided embedding model"""
    
    # Encode the text to obtain embeddings using the provided embedding model
    embedding = embedding_model.encode(text, show_progress_bar=False)
    
    # Convert the embeddings to a list of floats and return
    return embedding.tolist()

def map2embeddings(data, embedding_model):
    """Map a list of texts to their embeddings using the provided embedding model"""
    
    # Initialize an empty list to store embeddings
    embeddings = []

    # Iterate over each text in the input data list
    no_texts = len(data)
    print(f"Mapping {no_texts} pieces of information")
    for i in tqdm(range(no_texts)):
        # Get embeddings for the current text using the provided embedding model
        embeddings.append(get_embedding(data[i], embedding_model))
    
    # Return the list of embeddings
    return embeddings

The next cell contains a simple function capable of removing artifacts such as tokens, double asterisks, or spaces that sometimes appear in large language models' outputs.

def clean_text(txt, EOS_TOKEN):
    """Clean text by removing specific tokens and redundant spaces"""
    txt = (txt
           .replace(EOS_TOKEN, "") # Replace the end-of-sentence token with an empty string
           .replace("**", "")      # Replace double asterisks with an empty string
           .replace("<pad>", "")   # Replace "<pad>" with an empty string
           .replace("  ", " ")     # Replace double spaces with single spaces
          ).strip()                # Strip leading and trailing spaces from the text
    return txt

The following function, instead, simply adds an indefinite article to a role name, which is useful for making a prompt nicer and easier to read.

def add_indefinite_article(role_name):
    """Check if a role name has a determinative adjective before it, and if not, add the correct one"""
    
    # Check if the first word is a determinative adjective
    determinative_adjectives = ["a", "an", "the"]
    words = role_name.split()
    if words[0].lower() not in determinative_adjectives:
        # Use "a" or "an" based on the first letter of the role name
        determinative_adjective = "an" if words[0][0].lower() in "aeiou" else "a"
        role_name = f"{determinative_adjective} {role_name}"

    return role_name

After the previous functions, mostly devoted to processing text for better readability, the next class first helps load and initialize Gemma by quantizing it to 4-bit, reducing its memory footprint allowing for faster responses, and then generating text from it. **Gemma** is the core of our generative functions, making it a crucial element for processing information and returning it to the user in the most usable and useful form.

The GemmaHF class serves as a wrapper for the Transformers implementation of Gemma. Upon initialization, it sets up the model and tokenizer using the specified model name and a maximum sequence length for the tokenizer.

In short, the method initialize_model is designed to set up and configure a 4-bit quantized causal language model (LLM) and tokenizer. It begins by defining the data type for computation as float16. Then, it creates a configuration for quantization using the BitsAndBytesConfig class with settings for 4-bit quantization. The function loads a pre-trained model (Gemma 2b-it in the project, but you can also try the 7b version) with the specified quantization configuration. It also loads a tokenizer with the selected device mapping and maximum sequence length settings. Finally, the method returns the initialized model and tokenizer, which are ready for use by our AI assistant.

Finally, its generate_text method takes a prompt as input and generates a text using the instantiated tokenizer and model, allowing for customization of parameters such as maximum new tokens and temperature for sampling. Under the hood, it encodes the prompt, generates text based on it, decodes the output into text, and returns a list of generated text results.

class GemmaHF():
    """Wrapper for the Transformers implementation of Gemma"""
    
    def __init__(self, model_name, max_seq_length=2048):
        self.model_name = model_name
        self.max_seq_length = max_seq_length
        
        # Initialize the model and tokenizer
        print("\nInitializing model:")
        self.device = define_device()
        self.model, self.tokenizer = self.initialize_model(self.model_name, self.device, self.max_seq_length)
        
    def initialize_model(self, model_name, device, max_seq_length):
        """Initialize a 4-bit quantized causal language model (LLM) and tokenizer with specified settings"""

        # Define the data type for computation
        compute_dtype = getattr(torch, "float16")

        # Define the configuration for quantization
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=compute_dtype,
        )

        # Load the pre-trained model with quantization configuration
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map=device,
            quantization_config=bnb_config,
        )

        # Load the tokenizer with specified device and max_seq_length
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            device_map=device,
            max_seq_length=max_seq_length
        )
        
        # Return the initialized model and tokenizer
        return model, tokenizer
    
    def generate_text(self, prompt, max_new_tokens=2048, temperature=0.0):
        """Generate text using the instantiated tokenizer and model with specified settings"""
    
        # Encode the prompt and convert to PyTorch tensor
        input_ids = self.tokenizer(prompt, return_tensors="pt", padding=True).to(self.device)

        # Determine if sampling should be performed based on temperature
        do_sample = True if temperature > 0 else False

        # Generate text based on the input prompt
        outputs = self.model.generate(**input_ids, 
                                      max_new_tokens=max_new_tokens, 
                                      do_sample=do_sample, 
                                      temperature=temperature
                                     )

        # Decode the generated output into text
        results = [self.tokenizer.decode(output) for output in outputs]

        # Return the list of generated text results
        return results

Here, we arrive at the core of the generative function (before we just initialized the generative engine, Gemma).

The generate_summary_and_answer function generates an answer for a given question using context from a dataset. It embeds the input question (using the get_embedding the function we previously saw), finds similar contexts in the dataset, extracts relevant context based on similarity indices, generates prompts for summarizing the context and providing an answer, generates summaries and answers using the generative method from a "model" class, which can be a wrapper class containing Gemma implementations based on HF Transformers, Keras, Gemma C++ or any other available. Afterward, the function cleans the generated summary and answer and returns the cleaned answer for further processing. This function works as a sequence of steps to generate informative responses starting from an input question and some knowledge base data previously provided.

The two-step execution processing of the information retrieved from the knowledge base is necessary because extraction based on embedded vectors sometimes returns irrelevant information. It is a problem based on the fact that embeddings are a mapping that has many facets (they are high-dimensional themselves) and that distance measures and methods for finding what documents or text are most similar to your question are often approximate for performance reasons, sometimes resulting in unexpected retrieved results. First, summarizing relevant information, a task that Gemma can execute with prowess, helps in having a shorter, more compact, and surely more relevant context to provide further processing by Gemma, which consists of writing an answer to your question.

In this process, temperature, level of creativity, and role may result in different answers and different answering styles. I decided to rely on the “expert data scientist” role, but you may decide on the “ELI5 divulgator” or the “verbose scholarly narrator” (at your own risk XD).

Finally, notice the part of the generative prompt that says: “If the context doesn’t provide any relevant information, answer with <I couldn’t find a good match in my knowledge base for your query. Hence, I answer based on my own knowledge>.” This is partly to prevent the assistant from losing its usefulness and to alert the user regarding the assistant providing peculiar answers when the question is off-topic, too difficult, or lacks sufficient information.

def generate_summary_and_answer(question, data, searcher, embedding_model, model,
                                max_new_tokens=2048, temperature=0.4, role="expert"):
    """Generate an answer for a given question using context from a dataset"""
    
    # Embed the input question using the provided embedding model
    embeded_question = np.array(get_embedding(question, embedding_model)).reshape(1, -1)
    
    # Find similar contexts in the dataset based on the embedded question
    neighbors, distances = searcher.search_batched(embeded_question)
    
    # Extract context from the dataset based on the indices of similar contexts
    context = " ".join([data[pos] for pos in np.ravel(neighbors)])
    
    # Get the end-of-sentence token from the tokenizer
    try:
        EOS_TOKEN = model.tokenizer.eos_token
    except:
        EOS_TOKEN = "<eos>"
    
    # Add a determinative adjective to the role
    role = add_indefinite_article(role)
    
    # Generate a prompt for summarizing the context
    prompt = f"""
             Summarize this context: "{context}" in order to answer the question "{question}" as {role}\
             SUMMARY:
             """.strip() + EOS_TOKEN
    
    # Generate a summary based on the prompt
    results = model.generate_text(prompt, max_new_tokens, temperature)
    
    # Clean the generated summary
    summary = clean_text(results[0].split("SUMMARY:")[-1], EOS_TOKEN)

    # Generate a prompt for providing an answer
    prompt = f"""
             Here is the context: {summary}
             Using the relevant information from the context 
             and integrating it with your knowledge,
             provide an answer as {role} to the question: {question}.
             If the context doesn't provide
             any relevant information answer with 
             [I couldn't find a good match in my
             knowledge base for your question, 
             hence I answer based on my own knowledge] \
             ANSWER:
             """.strip() + EOS_TOKEN

    # Generate an answer based on the prompt
    results = model.generate_text(prompt, max_new_tokens, temperature)
    
    # Clean the generated answer
    answer = clean_text(results[0].split("ANSWER:")[-1], EOS_TOKEN)

    # Return the cleaned answer
    return answer

4. Wrapping up everything

At this point, the next cell wraps all the functions into an AIAssistant class.

The AIAssistant class impersonates an AI assistant that interacts with users by providing answers based on a given knowledge base (basically a list of texts containing the knowledge).

Upon initialization, the class loads an embedding model, indexes the knowledge base for efficient search, initializes a language model and tokenizer, and builds a searcher for similarity search using the SCANN library. The class includes functions to query the knowledge base, adjust the assistant’s temperature (creativity), and define its answering style.

The query function generates and prints an answer to a user query by utilizing the generate_summary_and_answer function.
The set_temperature function allows adjusting the assistant's creativity level, while the set_role function defines the answering style of the AI assistant.

This class wraps all together the functionality of an AI assistant that makes good use of embeddings, powerful language models such as Gemma, and similarity search to provide informative responses to user queries based on a predefined knowledge base.

Here are a few notes about ScaNN. ScaNN is a library developed by Google Research that offers efficient and scalable nearest neighbor search capabilities. It provides advantages over other solutions by utilizing techniques like quantization and Anisotropic Hashing, which enhance search performance.

Anisotropic Hashing is a method used in hashing techniques for multimodal retrieval that involves learning projection functions to produce dimensions with varying lengths or scales. This flexibility in scaling can be beneficial for capturing complex relationships and structures in high-dimensional data, offering improved retrieval performance in scenarios where isotropic methods may not be as effective. You can read everything about this method in the paper:

Guo, Ruiqi, et al. “Accelerating large-scale inference with anisotropic vector quantization.” International Conference on Machine Learning. PMLR, 2020. (Paper Link)

or by browsing the code repository at https://github.com/google-research/google-research/tree/master/scann

What is interesting to note is that in my solution, I do not use the cosine distance but simply the dot product, as suggested by this paper:

Steck, Harald, Chaitanya Ekanadham, and Nathan Kallus. “Is Cosine-Similarity of Embeddings Really About Similarity?.” arXiv preprint arXiv:2403.05440 (2024). (Paper Link)

And it works pretty well!

class AIAssistant():
    """An AI assistant that interacts with users by providing answers based on a provided knowledge base"""
    
    def __init__(self, gemma_model, embeddings_name="thenlper/gte-large", temperature=0.4, role="expert"):
        """Initialize the AI assistant."""
        # Initialize attributes
        self.embeddings_name = embeddings_name
        self.knowledge_base = []
        self.temperature = temperature
        self.role = role
        
        # Initialize Gemma model (it can be transformer-based or any other)
        self.gemma_model = gemma_model
        
        # Load the embedding model
        self.embedding_model = SentenceTransformer(self.embeddings_name)
        
    def store_knowledge_base(self, knowledge_base):
        """Store the knowledge base"""
        self.knowledge_base=knowledge_base
        
    def learn_knowledge_base(self, knowledge_base):
        """Store and index the knowledge based to be used by the assistant"""
        # Storing the knowledge base
        self.store_knowledge_base(knowledge_base)
        
        # Load and index the knowledge base
        print("Indexing and mapping the knowledge base:")
        embeddings = map2embeddings(self.knowledge_base, self.embedding_model)
        self.embeddings = np.array(embeddings).astype(np.float32)
        
        # Instantiate the searcher for similarity search
        self.index_embeddings()
        
    def index_embeddings(self):
        """Index the embeddings using ScaNN """
        self.searcher = (scann.scann_ops_pybind.builder(db=self.embeddings, num_neighbors=10, distance_measure="dot_product")
                 .tree(num_leaves=min(self.embeddings.shape[0] // 2, 1000), 
                       num_leaves_to_search=100, 
                       training_sample_size=self.embeddings.shape[0])
                 .score_ah(2, anisotropic_quantization_threshold=0.2)
                 .reorder(100)
                 .build()
           )
        
    def query(self, query):
        """Query the knowledge base of the AI assistant."""
        # Generate and print an answer to the query
        answer = generate_summary_and_answer(query, 
                                             self.knowledge_base, 
                                             self.searcher, 
                                             self.embedding_model, 
                                             self.gemma_model,
                                             temperature=self.temperature,
                                             role=self.role)
        print(answer)
        
    def set_temperature(self, temperature):
        """Set the temperature (creativity) of the AI assistant."""
        self.temperature = temperature
        
    def set_role(self, role):
        """Define the answering style of the AI assistant."""
        self.role = role
        
    def save_embeddings(self, filename="embeddings.npy"):
        """Save the embeddings to disk"""
        np.save(filename, self.embeddings)
        
    def load_embeddings(self, filename="embeddings.npy"):
        """Load the embeddings from disk and index them"""
        self.embeddings = np.load(filename)
        # Re-instantiate the searcher
        self.index_embeddings()

5. Providing the knowledge base from Wikipedia

I decided to retrieve some information from Wikipedia to provide a knowledge base for the AI Assistant to work confidently with data science questions.

Why Wikipedia?

Wikipedia provides a vast and diverse range of information on various topics, making it a rich context data source. Its structured organization, thanks to the Wikipedia API interface, allows for easy extraction and processing.

The following code, apart from the first two functions useful for cleaning the text from tags and formatting, extracts references, such as pages or other Wikipedia categories, using the extract_wikipedia_pages function. Then, the get_wikipedia_pages function crawls to all the pages and information related to some initial Wikipedia category or page.

# Pre-compile the regular expression pattern for better performance
BRACES_PATTERN = re.compile(r'\{.*?\}|\}')

def remove_braces_and_content(text):
    """Remove all occurrences of curly braces and their content from the given text"""
    return BRACES_PATTERN.sub('', text)

def clean_string(input_string):
    """Clean the input string."""
    
    # Remove extra spaces by splitting the string by spaces and joining back together
    cleaned_string = ' '.join(input_string.split())
    
    # Remove consecutive carriage return characters until there are no more consecutive occurrences
    cleaned_string = re.sub(r'\r+', '\r', cleaned_string)
    
    # Remove all occurrences of curly braces and their content from the cleaned string
    cleaned_string = remove_braces_and_content(cleaned_string)
    
    # Return the cleaned string
    return cleaned_string

def extract_wikipedia_pages(wiki_wiki, category_name):
    """Extract all references from a category on Wikipedia"""
    
    # Get the Wikipedia page corresponding to the provided category name
    category = wiki_wiki.page("Category:" + category_name)
    
    # Initialize an empty list to store page titles
    pages = []
    
    # Check if the category exists
    if category.exists():
        # Iterate through each article in the category and append its title to the list
        for article in category.categorymembers.values():
            pages.append(article.title)
    
    # Return the list of page titles
    return pages

def get_wikipedia_pages(categories):
    """Retrieve Wikipedia pages from a list of categories and extract their content"""
    
    # Create a Wikipedia object
    wiki_wiki = wikipediaapi.Wikipedia('Gemma AI Assistant (gemma@example.com)', 'en')
    
    # Initialize lists to store explored categories and Wikipedia pages
    explored_categories = []
    wikipedia_pages = []

    # Iterate through each category
    print("- Processing Wikipedia categories:")
    for category_name in categories:
        print(f"\tExploring {category_name} on Wikipedia")
        
        # Get the Wikipedia page corresponding to the category
        category = wiki_wiki.page("Category:" + category_name)
        
        # Extract Wikipedia pages from the category and extend the list
        wikipedia_pages.extend(extract_wikipedia_pages(wiki_wiki, category_name))
        
        # Add the explored category to the list
        explored_categories.append(category_name)

    # Extract subcategories and remove duplicate categories
    categories_to_explore = [item.replace("Category:", "") for item in wikipedia_pages if "Category:" in item]
    wikipedia_pages = list(set([item for item in wikipedia_pages if "Category:" not in item]))
    
    # Explore subcategories recursively
    while categories_to_explore:
        category_name = categories_to_explore.pop()
        print(f"\tExploring {category_name} on Wikipedia")
        
        # Extract more references from the subcategory
        more_refs = extract_wikipedia_pages(wiki_wiki, category_name)

        # Iterate through the references
        for ref in more_refs:
            # Check if the reference is a category
            if "Category:" in ref:
                new_category = ref.replace("Category:", "")
                # Add the new category to the explored categories list
                if new_category not in explored_categories:
                    explored_categories.append(new_category)
            else:
                # Add the reference to the Wikipedia pages list
                if ref not in wikipedia_pages:
                    wikipedia_pages.append(ref)

    # Initialize a list to store extracted texts
    extracted_texts = []
    
    # Iterate through each Wikipedia page
    print("- Processing Wikipedia pages:")
    for page_title in tqdm(wikipedia_pages):
        try:
            # Make a request to the Wikipedia page
            page = wiki_wiki.page(page_title)

            # Check if the page summary does not contain certain keywords
            if "Biden" not in page.summary and "Trump" not in page.summary:
                # Append the page title and summary to the extracted texts list
                if len(page.summary) > len(page.title):
                    extracted_texts.append(page.title + " : " + clean_string(page.summary))

                # Iterate through the sections in the page
                for section in page.sections:
                    # Append the page title and section text to the extracted texts list
                    if len(section.text) > len(page.title):
                        extracted_texts.append(page.title + " : " + clean_string(section.text))
                        
        except Exception as e:
            print(f"Error processing page {page.title}: {e}")
                    
    # Return the extracted texts
    return extracted_texts

To develop an AI assistant capable of answering questions about data science, I’ve chosen to begin with topics such as machine learning, data science, statistics, and deep learning artificial intelligence. As evident from the output, its range of topics is truly impressive, even for a seasoned data scientist!

categories = ["Machine_learning", "Data_science", "Statistics", "Deep_learning", "Artificial_intelligence"]
extracted_texts = get_wikipedia_pages(categories)
print("Found", len(extracted_texts), "Wikipedia pages")

In the end, the script processes 16161 Wikipedia pages.

As a last step, the extracted knowledge base is saved to disk for later usage.

wikipedia_data_science_kb = pd.DataFrame(extracted_texts, columns=["wikipedia_text"])
wikipedia_data_science_kb.to_csv("wikipedia_data_science_kb.csv", index=False)
wikipedia_data_science_kb.head()

6. A test run

We are now ready to test our AI assistant!

We instantiate it using the Gemma 2b-it and the gte-large embeddings and provide Wikipedia extracts as a knowledge base.

The Generate Text Embedding (gte) model is a variant of the BERT model developed by Alibaba DAMO Academy. This embedding model is available in three versions (large, base, and small) and is specifically designed for English text processing. Compared with other embedding models, the gte-large variant demonstrates superior performance in retrieval tasks, but it also needs more storage space for embedding vectors compared to competitors (we do not worry much about that because ScaNN is quite fast for this application).

The instantiation will take a short while, and then you can ask the AI assistant a few questions.

# Initialize the name of the embeddings and model
embeddings_name = "thenlper/gte-large"
model_name = "/kaggle/input/gemma/transformers/2b-it/1"

# Create an instance of AIAssistant with specified parameters
gemma_ai_assistant = AIAssistant(gemma_model=GemmaHF(model_name), embeddings_name=embeddings_name)

# Map the intended knowledge base to embeddings and index it
gemma_ai_assistant.learn_knowledge_base(knowledge_base=extracted_texts)

# Save the embeddings to disk (for later use)
gemma_ai_assistant.save_embeddings()

# Set the temperature (creativity) of the AI assistant and set the role
gemma_ai_assistant.set_temperature(0.0)
gemma_ai_assistant.set_role("data science expert whose explanations are useful, clear and complete")

Let’s start with a warm-up question: “What is the difference between data science, machine learning, and artificial intelligence?”

gemma_ai_assistant.query("What is the difference between data science, machine learning, and artificial intelligence?")

Here is the answer:

Sure, here’s the difference between data science, machine learning, and artificial intelligence:
Data Science is an interdisciplinary field that focuses on extracting knowledge and insights from data in various forms, both structured and unstructured. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.
Machine Learning is a subfield of soft computing within computer science that explores the study and construction of algorithms that can learn from and make predictions on data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.
Artificial Intelligence is a broader field of study that encompasses the study of reasoning, knowledge representation, planning, learning, natural language processing, perception, and support for robotics. It involves tasks such the development of algorithms and techniques that can mimic human intelligence.

Now, here is a more complicated question you may also encounter in a data science interview!

gemma_ai_assistant.query("Explain how linear regression works")

Sure, here’s an explanation of how linear regression works:
Linear regression is a statistical method used to predict a dependent variable based on one or more independent variables. It is a simple but powerful technique that can be used to model a wide range of relationships between the dependent and independent variables.
The basic idea behind linear regression is that the dependent variable can be expressed as a linear function of the independent variables. This means that the dependent variable can be represented by a straight line.
The linear predictor function is a function that relates the independent variables to the dependent variable. The weights of the independent variables are chosen to optimize the relationship between the prediction and the criterion.
The process of linear regression involves the following steps:
1. Define the linear predictor function.
2. Choose a regression method.
3. Fit the model to the data.
4. Evaluate the model’s performance.
The linear predictor function is a function that expresses the relationship between the dependent and independent variables in a linear equation. The coefficients of the independent variables in this equation represent the weights of the variables in the linear predictor function.
The regression method is chosen based on the data and the research question. There are many different regression methods available, each with its own strengths and weaknesses.
Once the linear predictor function and regression method have been chosen, the model is fitted to the data. This process involves finding the values of the coefficients that minimize the sum of the squared errors between the predicted values and the actual values.
The model’s performance is then evaluated to determine how well it fits the data. There are many different metrics that can be used to evaluate model performance, including mean squared error (MSE), root mean squared error (RMSE), and adjusted R-squared.
If the model’s performance is satisfactory, it can be used to make predictions about the dependent variable based on the independent variables.

Let’s ask for more complex methods and algorithms, such as decision trees.

gemma_ai_assistant.query("What are decision trees, and how do they work in machine learning?")

Sure, here’s a summary of the context:
Decision Tree Learning is a supervised machine learning approach used to predict the value of a target variable based on several input variables.
Types of Decision Trees:
* Classification Tree (CART): Predicts the class of an instance based on the most prominent features.
* Regression Tree (CRT): Predicts a real-valued value of an instance based on the most prominent features.
Key Concepts:
* Tree Structure: A tree in which each internal node represents a feature, and branches represent combinations of features.
* Splitting Rules: Criteria used to split the data at each node.
* Metrics for Splitting: Measures the homogeneity of the target variable within the subsets.
* Ensemble Methods: Combine multiple decision trees to improve accuracy and interpretability.
Applications of Decision Tree Learning:
* Classification: Predicting the class of an instance.
* Regression: Predicting a real-valued value of an instance.
Notable Decision Tree Algorithms:
* ID3
* C4.5
* CART
* Chi-square Automatic Interaction Detection (CHAID)
* Random Forest
* Bootstrap Aggregated Trees (BAGging)
* Decision List

The next question about cross-validation is a return to fundamentals.

gemma_ai_assistant.query("What is cross-validation, and why is it used in machine learning?")

Cross-validation is a statistical technique used in machine learning to assess how well a predictive model will generalize to an independent data set. It involves resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. This allows the model to be evaluated under different conditions and to identify potential overfitting or underfitting issues. Cross-validation can be used for both model selection and hyperparameter tuning.

Finally, there is a tricky question on regularization. How will the AI Assistant handle it?

gemma_ai_assistant.query("Explain the concept of regularization and its importance in preventing overfitting in machine learning models")

Regularization is a process of introducing additional information into a machine learning model to prevent overfitting. Overfitting occurs when a model becomes too closely fit to the training data, leading to poor performance on unseen data. Regularization introduces a penalty term that penalizes complex functions, forcing the model to be simpler and more generalizable.
Regularization has several important benefits for preventing overfitting:
- Reduces variance: By penalizing complex functions, regularization reduces the variance of the model’s predictions, leading to improved generalization performance.
- Improves generalization ability: Regularization forces the model to be more generalizable, allowing it to perform well on unseen data.
- Reduces bias-variance tradeoff: Regularization helps to balance the bias and variance of the model, leading to a more accurate and robust model.
There are various regularization techniques in machine learning, including:
- L1 regularization (Lasso): Minimizes the sum of the absolute values of the model’s weights.
- L2 regularization (Ridge): Minimizes the squared value of the model’s weights.
- Tikhonov regularization: Minimizes a combination of the L1 and L2 norms of the model’s weights.

7. Preparing for deploying the model

To deploy the model, you need the files we saved and a copy of the functions and classes we used in this notebook. The procedure is the same, although you don’t need to embed the knowledge base again; you just reload the previously calculated embeddings. However, the previously seen code works speedily with a GPU available.

If you have access only to the CPU machine at the inference phase, you can leverage the C++ version of Gemma, which, based on 8-bit switched floating point compressed weights, can offer an adequate text processing speed. I take the compiled version from another notebook (see https://www.kaggle.com/code/lucamassaron/gemma-cpp for more details on the compiling procedure) from where I copy the Gemma C++ executable. I also allow the executable to be executed and install Google SentencePiece, whose libraries are necessary for the executable to work (in particular, the lib sentence piece.so library).

!cp -r /kaggle/input/gemma-cpp/gemma_cpp /kaggle/working/gemma_cpp # Copy compiled Gemma C++
!chmod +x ./gemma_cpp/gemma # Make Gemma C++ executable
!conda install -q -c conda-forge sentencepiece -y # Install Google SentencePiece (https://github.com/google/sentencepiece)

The following Python code defines a class named GemmaCPP, which works as a wrapper for interacting with the C++ implementation of Gemma (https://github.com/google/gemma.cpp).

The class has an initializer method that takes four parameters: gemma_cpp, tokenizer, compressed_weights, and model. These parameters are used to initialize attributes of the class instance with the same names, which will later serve for interacting with the commands for the C++-compiled Gemma. Additionally, the class contains a method named generate_text, which takes a prompt as input along with optional args and kwargs (for compatibility with other Gemma implementations). Within this method, a shell command is constructed using the provided prompt and other parameters, formatted appropriately to be executed with the Gemma C++ executable.

The subprocess.Popen function is then called to execute the shell command, capturing the standard output (stdout) and standard error (stderr) streams. The stdout data is decoded from bytes to a string, and if there is any error message in stderr, it is printed out. Finally, the method returns the output text wrapped in a list. This code facilitates text generation using Gemma's C++ implementation via Python, allowing the integration between the two languages.

import subprocess
import sys
import re

class GemmaCPP():
    """Wrapper for the C++ implementation of Gemma"""
    
    def __init__(self, gemma_cpp, tokenizer, compressed_weights, model):
        self.gemma_cpp = gemma_cpp
        self.tokenizer = tokenizer
        self.compressed_weights = compressed_weights
        self.model = model
        
    def eliminate_long_dots(self, input_string):
        """Eliminate long sequences of dots from the input string"""
        # Define a regular expression pattern to match sequences of 2 or more dots
        pattern = r'\.{2,}'

        # Replace all occurrences of the pattern with a space
        output_string = re.sub(pattern, ' ', input_string)

        return output_string.strip()
    
    def beautify_string(self, input_string):
        """Clean the input string by removing non-letter characters at the beginning
           and isolated letters at the end after multiple spaces"""
        # Remove non-letter characters at the beginning of the string
        output_string = re.sub(r'^[^a-zA-Z]+', '', input_string.strip())

        # Remove isolated letters at the end of the output string after multiple spaces
        output_string = re.sub(r'\s{3,}(.+)\Z', '', output_string.strip())

        return output_string
        
    def generate_text(self, prompt, *args, **kwargs):
        """Generate text using the cpp tokenizer and model"""

        # Define the shell command
        prompt = prompt.replace('"', '').replace("'", "")
        shell_command = f'echo "{prompt}" | {gemma_cpp} -- --tokenizer {tokenizer} --compressed_weights {compressed_weights} --model {model} --verbosity 0'

        # Execute the shell command and redirect stdout to the Python script's stdout
        process = subprocess.Popen(shell_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

        output_text = ""
        reading_block = "[ Reading prompt ]"
        
        # Communicate with the process and capture stdout 
        for k, char in enumerate( iter(lambda: process.stdout.read(1), b'') ):
            single_char = char.decode(sys.stdout.encoding)
            output_text += single_char
            if len(output_text) % 20 == 0:
                count_reading_blocks = output_text.count(reading_block)
                if count_reading_blocks > 1:
                    break
                    
        # Remove long sequences of dots and the reading block, beautify the string
        output_text = output_text.replace(reading_block, "")
        output_text = self.eliminate_long_dots(output_text)
        output_text = self.beautify_string(output_text)
        output_text = prompt + output_text
        
        # Return output text
        return [output_text]

Now that everything is ready, I can instantiate a Gemma AI Assistant based on Gemma C++ and previously extracted and processed knowledge base.

embeddings_name = "thenlper/gte-large"
gemma_cpp = "./gemma_cpp/gemma"
tokenizer = "/kaggle/input/gemma/gemmacpp/2b-it-sfp/1/tokenizer.spm"
compressed_weights = "/kaggle/input/gemma/gemmacpp/2b-it-sfp/1/2b-it-sfp.sbs"
model = "2b-it"

# Create an instance of the class AIAssistant based on Gemma C++
gemma_ai_assistant = AIAssistant(
    gemma_model=GemmaCPP(gemma_cpp, tokenizer, compressed_weights, model),
    embeddings_name=embeddings_name
)

# Loading the previously prepared knowledge base and embeddings
wikipedia_data_science_kb = pd.read_csv("wikipedia_data_science_kb.csv")
knowledge_base = wikipedia_data_science_kb.wikipedia_text.tolist()

# Uploading the knowledge base and embeddings to the AI assistant
gemma_ai_assistant.store_knowledge_base(knowledge_base=knowledge_base)
gemma_ai_assistant.load_embeddings(filename="embeddings.npy")

Let’s try a new query on machine learning topics and see how it takes to get an answer when only CPUs (Kaggle Notebooks have 4 cores) are working:

gemma_ai_assistant.query("In short, what are the key differences between gradient boosting and random forests?")

Gradient Boosting:
* Uses a sequential ensemble of weak learners to iteratively improve the overall model.
* Each weak learner is trained on a subset of the training data and makes a local decision.
* The weak learners are then combined in a weighted manner to form the final model.
* Gradient boosting is easy to interpret and can be used to create complex models.
Random Forests:
* Uses an ensemble of decision trees to iteratively improve the overall model.
* Each decision tree is trained on a subset of the training data and makes a local decision.
* The decision trees are then combined in a way that reduces the variance of the final model.
* Random forests are more robust to overfitting than gradient boosting.
Here are some of the key differences between gradient boosting and random forests:
| Feature | Gradient Boosting | Random Forests |
| — -| — -| — -|
| Training process | Iterative, weak learners are trained on a subset of the training data | Iterative, decision trees are trained on a subset of the training data |
| Model combination | Weighted combination of weak learners | Averaging of the predictions from the decision trees |
| Interpretability | Easy to interpret | Less easy to interpret |
| Robustness to overfitting | Less robust | More robust |
In general, gradient boosting is a good choice for problems where you want an easy-to-interpret model that can be used for both prediction and classification. Random forests are a good choice for problems where you want a robust model that is less likely to overfit.

8. Conclusions

It seems that the AI Assistant is working fine and promptly answering questions in a correct and usable way. Using the same approach, the same code could also be used for other tasks of this competition, such as:

Answering common questions about the Python programming language
Explaining or teaching concepts from Kaggle competition solution write-ups
Answering common questions about the Kaggle platform

All you need is to prepare the context data by extracting it from a website, a dataset, or other sources such as the meta-Kaggle meta.

Enjoy your new AI assistant and build your own RAGs by replicating this simple and straightforward approach!