Experimenting with a Locally Deployed Llama 3 Model for RAG using Ollama and Streamlit

4 min readApr 24, 2024

Introduction

In my previous post, I began with a minimalistic Colab notebook to experiment with RAG using Google’s Gemma Model. From there, I expanded by deploying the Mistral Model using FastAPI and a straightforward React Frontend to simulate a production environment. Additionally, I explored the basic use case of enabling Code Generation using the VSCode extension and locally running LLM via Ollama.

Links:

Level up your coding with a local LLM and CodeGPT

Elevating LLM Deployment with FastAPI and React: A Step-By-Step Guide

A Quick Experiment on Building Your Own GEN AI Application Utilising RAG and Google’s Gemma

This post will explore the recently launched Llama 3 model for RAG use case, deployed locally via Ollama. We’ll utilise Streamlit to create a user-friendly chatbot UI.

Figure 1: System Diagram

Introduction to the Technologies

GEN AI + RAG: for a brief intro to these two topics, please refer to my earlier posts.
Llama 3: Meta’s latest open-source large language model, claimed to be the most capable openly available LLM to date. Independent tests suggest that Llama-3 70B Model is comparable to Gemini 1.5 and Opus/GPT-4 in quality, with the even more powerful Llama-3 400B+ model still in training, expected to significantly leapfrog in performance.
Streamlit: An open-source app framework for Machine Learning and Data Science projects, enabling developers to create beautiful, interactive web applications with ease. We’ll use Streamlit to build the chatbot UI for its simplicity.
Ollama: Ollama is a platform designed to streamline the deployment and customization of large language models, including Llama 3, Phi 3, Mistral, and Gemma. It provides tools and resources that allow users to easily run these models on various operating systems like macOS, Linux, and Windows (preview). This capability makes Ollama particularly suitable for developers looking to tailor AI models to specific tasks or integrate them into larger systems.

Step-by-Step Implementation:

1. Setting Up the Environment:

The environment requires Python libraries such as Streamlit, langchain, and supporting libraries for handling PDF documents and creating embeddings.

To manage your local python environment effectively, I recommend anaconda.

Execute the following two lines to install necessary libraries and to download llama-3 model locally.

Pip install streamlit langchain pypdf

Ollama pull llama3

The initial setup involves creating a function to index and store PDF files in a local directory as vector representations using FAISS. This process is crucial for enabling the rapid retrieval of information during the query-response cycle. The same code has also been used in my previous post.

2. Creating the Retrieval System:

The get_retriever function initialises the vector store if it doesn’t exist and loads it to serve as the retrieval backbone for the RAG system. This is cached to enhance performance, ensuring that the indexing process is not redundantly executed.

3. Developing the Conversational Interface:

A conversation chain is created, combining the retrieval system and Llama 3 model to handle user queries. This involves setting up a conversational memory to maintain context and improve response relevance.

Streamlit is used to create an interactive web interface. The UI code segment starts by configuring the page and setting up a header. The main interaction loop displays messages and handles user input, demonstrating a seamless integration of backend AI operations with a frontend developed in Streamlit.

A quick tutorial of using Streamlit for building a chatbot can be found here: https://docs.streamlit.io/develop/tutorials/llms/build-conversational-apps

4. Handling Conversations:

The application captures user input through Streamlit’s chat interface, processes it using the conversational chain, and displays responses. This is a continuous loop, allowing for real-time interaction with the AI.

Figure 2: sample conversation

Conclusion

This experiment showcases the ease of using open-source LLMs locally with advancements in tools and technologies. We can now experiment with cutting-edge models that rival commercial offerings, democratizing AI and pushing boundaries. These implementations will become essential tools for developers aiming to leverage AI’s full potential in real-world applications.