Building a RAG-Powered AI with Symfony and Elasticsearch
Nowadays, more and more applications integrate AI features to take products to the next level. With the rise of large language models (LLMs), several strategies are available to adapt these models to specific tasks.
Among these approaches, two stand out: RAG (Retrieval-Augmented Generation) and fine-tuning.
In this article, we will explore the difference between these two methods and implement a RAG system.
What is RAG
RAG is the acronym for Retrieval-Augmented Generation. This approach combines text generation with the search for external information.
It is a hybrid model that integrates a retrieval (information retrieval) module with a text generation model. The idea is to use a search system (such as a database or an index) to retrieve relevant information and enrich the text generation. This process can significantly improve the quality of the responses generated by the model and enhance its domain specialization.
How does RAG work
Retrieval-Augmented Generation (RAG) is a two-step process that combines information retrieval with text generation. This hybrid approach enhances the quality and relevance of the responses generated by the model, making it much more powerful and contextually aware.
- Information Retrieval (Retriever): The system queries a database or a text corpus to find relevant documents or passages based on the question asked.
- Text Generation (Generator): The generation model uses the retrieved information to produce a more accurate and detailed answer.
This contrasts with a traditional text generation model, which simply generates responses based on its own training without relying on external data.
What is Fine-tuning
Fine-tuning involves retraining a pre-trained model on a specialized dataset while preserving the general knowledge it acquired during its initial training. This allows the model to adapt to a specific domain or task more effectively.
How does Fine-tuning work
Fine-tuning is the process of adapting a pre-trained model to better perform specific tasks or operate within a certain domain. This allows the model to leverage its general knowledge, while learning to understand and generate content that is highly relevant to the specialized task at hand.
1. Choose a base model: Select a pre-trained model suited to your needs, such as LLaMA, GPT-4, or BERT. Different models excel at different tasks — text analysis, text generation, translation, etc.
2. Prepare domain-specific data: Gather and preprocess a dataset that reflects the specialized context or task you want the model to learn. High-quality, well-annotated data is essential for optimal performance.
3. Retrain the model: Fine-tune the selected model using your new dataset. his step fine-tunes the model’s parameters to align better with your specific requirements while retaining its general language understanding.
RAG vs Fine-Tuning
💡 We have outlined the key differences between Fine-tuning and the RAG system. However, these two models can be combined to leverage the strengths of both approaches. By integrating RAG with fine-tuning, you can create a more powerful, accurate, and domain-specific model while ensuring it stays up to date with new information.
RAG architecture
A deep dive into the RAG architecture. This section explores the structure and components of the RAG model in detail.
Chunking
Chunking is the process of splitting documents into smaller, meaningful segments to fit within the context window of a model. This step is essential because language models have token limits, and breaking content into well-structured chunks ensures that relevant information is retrieved efficiently.
There are different chunking strategies:
- Fixed-length chunking: Splits text at regular intervals (e.g., every 512 tokens).
- Semantic chunking: Uses natural language processing to split text based on meaning, preserving context across sentences and paragraphs.
- Overlapping chunking: Creates overlapping segments to avoid losing contextual links between chunks.
Choosing the right chunking approach depends on your use case, document type, and the importance of preserving semantic coherence.
Embedding
Once the chunks are created, they are encoded into vector representations using an embedding model and stored in a vector database. These vectors serve as mathematical representations that enable efficient similarity searches during the retrieval phase.
Choosing the right embedding model is crucial, as different embeddings are trained on various datasets, making some more effective for specific tasks.
To evaluate the suitability of an embedding model for your use case, consider:
- The training data: Some embeddings perform better on code, scientific texts, or general language.
- Domain specialization: Certain embeddings are optimized for legal, medical, or financial documents.
- Benchmarking results: You can refer to the Massive Text Embedding Benchmark (MTEB) to compare embeddings across tasks.
Vector Database
A vector database is a specialized system designed to store and manage computed vector representations of data. It enables efficient similarity searches by leveraging mathematical distances between vectors, allowing for fast and accurate retrieval of relevant information. (most of the time, it uses the Approximate Nearest Neighbor (ANN) algorithm).
Retrieval
This step involves querying the vector database to find the most relevant information based on your input. The system searches for similar vectors using similarity metrics (e.g., cosine similarity, Euclidean distance) to retrieve the most contextually relevant data.
By retrieving these relevant vectors, the model gains additional context, enriching its response with precise and well-aligned information.
Generator
At this stage, the language model (LLM) generates a response using both the user’s query and the retrieved contextual information from the vector database.
The generator processes a combined input, merging the original query with the retrieved documents. By incorporating this external knowledge, the model produces a more accurate, contextually relevant response, reducing the risk of hallucinations and improving the overall quality of the generated text.
Implementation
Building on the theoretical concepts discussed earlier, we will implement a practical project that follows the same principles.
You can find the full project in the GitHub repository: Symfony-Rag
💡 You can find all the necessary information in the README to run and configure the project.
Goals
The objective of this project is to implement a Retrieval-Augmented Generation (RAG) system that adheres to each step outlined in the theoretical part. This serves as a foundational project, enabling easy experimentation with different use cases, embedding models, and large language models (LLMs) to explore viable applications.
We will also use the following package, which simplifies the integration of AI capabilities into our project, enabling efficient communication with various LLMs.
theodo-group/llphant
Project Structure
The project is structured to keep things simple and easy to follow. The key components include:
- GenerateEmbeddingsCommand: Responsible for generating vector embeddings from input data.
- RagController : Responsible for handling user queries and retrieving relevant answers based on stored embeddings.
Embedding Generation Process
The project includes a Symfony console command,GenerateEmbeddingsCommand, which automates the process of creating embeddings and storing them in Elasticsearch.
1. Load JSON Data
- The command reads a JSON file (`intervention.json`) containing structured data about interventions (e.g., type, equipment, technician, duration, etc.).
2. Process and Format Documents
- Each record is converted into a structured LLPhant Document with metadata, including a unique hash, source type, and chunk number.
3. Generate Embeddings
- The OllamaEmbeddingGenerator processes the formatted documents, creating vector embeddings for each entry.
4. Store Embeddings in Elasticsearch
- The generated embeddings are stored in an ElasticsearchVectorStore, allowing for efficient similarity searches and retrieval operations.
💡 In this example, I use data in JSON format. You can easily load data from a PDF or text document. Everything you need for chunking, embedding, and VectorStore is available in the theodo-group/llphant package.
Question-Answering System
The RagController is responsible for handling user queries and retrieving relevant answers based on stored embeddings.
1. User Input
- The controller renders a form where users enter their queries.
2. Query Processing
- When the form is submitted, the question is retrieved and processed.
3. Embedding & Search
- The OllamaEmbeddingGenerator generates embeddings for the query.
- The ElasticsearchVectorStore is queried for relevant embeddings.
- A semantic search is performed using LLPhant’s.
4. Answer Generation
- The OllamaChat model processes retrieved context and generates a response.
- The final answer is displayed on the frontend (`rag.html.twig`).
You will have access to a small form to interact with the RAG system at http://127.0.0.1:8080/
Conclusion
In this article, we explored the fundamental differences between Fine-Tuning and Retrieval-Augmented Generation (RAG), highlighting how each approach contributes to adapting large language models (LLMs) to specific tasks. While fine-tuning excels at domain-specific learning through model retraining, RAG provides a dynamic way to retrieve and incorporate external knowledge, ensuring up-to-date and contextually relevant responses.
Through our practical implementation, we demonstrated how to build a RAG-based system using Symfony, Elasticsearch, and Ollama. By generating embeddings, storing them in a vector database, and integrating them with an LLM, we created a system that efficiently retrieves and generates accurate responses.
As AI continues to evolve, combining RAG and fine-tuning offers a powerful strategy for enhancing language models — balancing domain expertise, adaptability, and real-time knowledge retrieval. Whether for chatbots, recommendation systems, or enterprise search solutions, this hybrid approach enables more intelligent and responsive AI applications.