Utilise RAG framework to become a 10x developer

Dmitry Ivanov
Picsart Engineering
14 min readJun 5, 2024
10x developer, as it seen by AI

Nowadays, there are a lot of discussions about the influence of LLM on developers work. Different LLMs show great results in code generation, and developers have started using them in their daily jobs. But do these actually affect developers’ performance much? Do the developers who started using tools like Copilot become these mysterious 10x developers?

Well, to answer that question, let’s take a look at how an average developer spends time during their day…

Let’s take a look at the Code Time Report, issued in January 2022. It shows that only about 40% of developers spend more than one hour per day coding, and only 10% spend more than two hours a day.

Image from Code Time Report

So, what do these developers do most of the day? The majority of a developer’s day is normally spent trying to understand how the existing systems work from different perspectives: from product, code, and architectural points of view. That is a task with the highest cognitive load: to understand how the small pieces of the system are connected and where to place changes so as not to break the system. Imagine a bomb defuser game — you add small changes in the wrong place, and the entire system crashes.

Software developer applying a patch to the system in production (Image from WIkipedia)

Of course, there are a lot of good practices in the industry aimed at decreasing that cognitive load from developers, such as architecture patterns, frameworks, different types of testing, CI/CD, platforms, clouds, etc. All these things help a lot, but we still have to deal with a high cognitive load. In CodeScene’s Code Red white paper, we find interesting numbers showing that developers spend almost half of their time dealing with technical debt.

Image from CodeScene’s Code Red whitepapper

Of course, that is a trade-off between new feature delivery speed and code quality or lack of domain knowledge. Business wants to move quickly in different directions, and sometimes products are not ready for such pivots from a code perspective.

So, what is changing now? Developers can generate code faster with the help of AI tools, but that also means the cognitive load of making refactoring or adding new features only increases. Developers will continue to spend more time thinking than writing (or generating) code. So the percentage of time spent writing code will decrease, but the load will increase. We all know this leads to burnout, which is bad not only for individuals but for business too. This might not be the case for a small projects or some research activities, but on the long distance and big products it will.

So, can AI help here? Instead of concentrating on and optimizing ~20% of daily developer work, let’s try to optimize the other 80% (it’s interesting to see how the Pareto principle could work here)! And let’s try to decrease other metrics that affect productivity — time spent on technical debt and onboarding to the project.

But here is the trap: LLMs need a lot of time to train or fine-tune. And we don’t have such luxury — knowledge about the codebase from weeks ago is not enough, and sometimes it might even be dangerous — in case your colleagues fixed some critical bug last night. You need to know about that immediately! So we want to have the power of LLM and super fresh knowledge about the code. One of the ways to achieve that is to use Retrieval-Augmented Generation (RAG) architecture.

What is the RAG?

RAG is a technique for enhancing the accuracy and reliability of AI with the help of providing more context to a user queries.

Let’s see how that works in the context of our target — to help developers work with the existing codebase of the system. Let’s ask LLM about some code questions related to our code:

In order to answer this question, LLM needs to have context about our codebase. So we need to provide that context:

Of course, our developer does not want to copy all the codebase to questions they ask LLM! Let’s put a proxy agent in the middle, which handles such requests, searches relevant data, and enhances the developer query with that data:

And that is it! In theory, the agent could be a simple script that concatenates all the files from the repository and sends those with the developer query to LLM. However, LLMs have limited context size, meaning they cannot handle more than some limit of data. More importantly, handling each token (basically a word or punctuation) costs money, even for models running locally — we are still paying for electricity. So the very first and important optimization would be to make this search efficient so that the data returned is relevant to the developer query.

As we see from the diagram above, the crucial part of RAG is retrieving relevant data from our knowledge base. The better data we find, the better context we provide to LLM, and the better answer for the query we get. To achieve that, one of the best solutions is to use a Vector database. Vector databases allow you to understand the conceptual similarity of unstructured data by representing them as vectors, enabling retrieval based on data similarity. The creation of vector data is primarily done through embeddings. Embeddings translate the data into a more manageable, lower-dimensional vector form, typically through neural network models.

And the same approach should be used to populate database with data — we took our data, convert it into vectors using embedding model and store in Vector database.

Building fully local RAG with Ollama, Spring AI and Kotlin.

So lets try to build AI powered console chat application which should help developers during their day by analysing source code and answering their questions about existing database. We are gonna use the following stack:

  • Ollama to run LLM locally
  • Spring Boot with Spring AI
  • Kotlin

To simplify our setup, we will use SimpleVectorStorage provided by Spring AI, which is a simple in-memory implementation of a Vector database that provides all the features we need. We also consider that we are going to analyze Kotlin/Java projects.

Let’s take a look at the application overview:

  • The green boxes are the parts of RAG that we are going to implement. These include ChatService, which answers user queries using the RAG approach, and FileProviderService + DocumentPipeline, which implement the loader part of RAG.
  • The grey boxes are the blocks provided by Spring AI, such as ChatClient, VectorStore, EmbeddingClient, and document enrichers.
  • LLM and Embedding model are handled by Ollama, but they could also be provided by any other service, such as OpenAI.
code-chat application overview

Our very first step would be to run LLM locally. The easiest way to do that is to use Ollama. Download and install it from https://ollama.com/download. Now we are ready to run LLM locally:

ollama run llama3:8b

Next, we need to create new Spring project, for instance using Generator from IntelliJ IDEA:

Next we need to modify build.gradle and add repositories:

repositories {
mavenCentral()
maven { url 'https://repo.spring.io/milestone' }
maven { url 'https://repo.spring.io/snapshot' }
}

And dependencies:

dependencies {
implementation 'org.springframework.boot:spring-boot-starter'
implementation 'org.springframework.boot:spring-boot-starter-web'
implementation 'org.jetbrains.kotlin:kotlin-reflect'

implementation platform("org.springframework.ai:spring-ai-bom:0.8.1-SNAPSHOT")
implementation 'org.springframework.ai:spring-ai-ollama-spring-boot-starter'
implementation 'org.springframework.ai:spring-ai-tika-document-reader'

testImplementation 'org.springframework.boot:spring-boot-starter-test'
testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
}

Populate Vector store

Now we can start building database populate pipeline. First we define our Vector Database as SimpleVectorStore:

import org.springframework.ai.embedding.EmbeddingClient
import org.springframework.ai.vectorstore.SimpleVectorStore
import org.springframework.ai.vectorstore.VectorStore
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.Configuration

@Configuration
class VectorStoreConfig {

@Bean
fun vectorStore(embeddingClient: EmbeddingClient): VectorStore = SimpleVectorStore(embeddingClient)
}

Next, lets define FileProviderService which provides our pipeline with files from project directory:


import org.springframework.stereotype.Service
import java.io.File

@Service
class FileProviderService {

companion object : Logging

// we are limiting our files to kotlin/java projects
private val fileExtensions = setOf("kt", "java", "gradle", "xml", "yaml", "properties")

fun loadFiles(path: String, callback: (File) -> Unit) {
val directory = File(path)
if (!directory.exists() || !directory.isDirectory) {
logger.error { "The specified path is not a valid directory." }
return
}
directory.walk().forEach { file ->
if (file.isFile && file.extension in fileExtensions) {
val content = file.readText(Charsets.UTF_8)
if (!isBinaryContent(content) && !file.shouldIgnore(content)) {
callback(file)
}
}
}
}

/**
* Simple check to determine if the content is likely binary.
* This function checks for the presence of null characters as a heuristic.
*/
private fun isBinaryContent(content: String): Boolean = content.contains('\u0000')

private fun File.shouldIgnore(content: String) =
content.isBlank()
|| with(absolutePath) {
contains("/build/") || contains("/integration/") || contains("/test/") || contains("/contract/")
}
|| with(name) {
startsWith(".") || endsWith(".class") || contains("Test")
}
}

Now lets build our pipeline:

  1. Convert file to Document via TikaDocumentReader
  2. Store File path to Document metadata
  3. Enrich Document metadata with SummaryMetadataEnricher and KeywordMetadataEnricher
  4. Split Document into chunks using TokenTextSplitter
data class CodeConfiguration (
// should add summary
val enrichSummary: Boolean,
// should add keywords
val enrichKeywords: Boolean,
// source path
val path: String,
// project name, used to store VectorDb to file
val project: String,
)

@Component
class DocumentPipeline(
val loader: FileProviderService,
val client: ChatClient,
val vectorStore: VectorStore,
val config: CodeConfiguration
) {

companion object : Logging

fun loadFiles() {
with(config) {
logger.info { "Start loading files from: $path" }
populate()
}
}

private fun CodeConfiguration.populate() {
val basePath = File(path)
val summaryMetadataEnricher = SummaryMetadataEnricher(client, listOf(CURRENT))
val keywordMetadataEnricher = KeywordMetadataEnricher(client, 5)
val splitter = TokenTextSplitter()

loader.loadFiles(path) { file ->
val metadata = mapOf("path" to file.relativeTo(basePath).toString())

TikaDocumentReader(FileSystemResource(file)).get()
.map { Document(it.content, it.metadata + metadata) }
.enrichIf(enrichSummary, summaryMetadataEnricher)
.enrichIf(enrichKeywords, keywordMetadataEnricher)
.let {
splitter.apply(it).also {docs ->
vectorStore.add(docs)
}
}
}
}

private fun List<Document>.enrichIf(condition: Boolean, enricher: DocumentTransformer): List<Document> =
if (condition) enricher.apply(this) else this

}

The very first optimisation we are doing within our loader pipeline — is to store file path in the metadata. That will allows us later to find the original file and pass that to LLM as extended context, when answering user questions.

Next steps are enriching documents with summary and keywords, which are the very important steps in the process, because code might not contains any comment, or be poorly written, so similarity search would be ineffective. But with the help of those enrichers we link some meaningful description to each piece of code, that we are going to store in DB.

Splitting text in chucks also should help with providing good results of similarity search, however for now we just split the file based on tokens, which is not a very efficient strategy for source code. For more information check Whats next? section.

How Spring AI enrichers works?

Basically both Summary and Keyword enrichers utilize AI to generate that data, with following prompts:

 public static final String DEFAULT_SUMMARY_EXTRACT_TEMPLATE = """
Here is the content of the section:
{context_str}

Summarize the key topics and entities of the section.

Summary: """;

...

public static final String KEYWORDS_TEMPLATE = """
{context_str}. Give %s unique keywords for this
document. Format as comma separated. Keywords: """;

So, when we enrich each document, there are 2 queries are being send to LLM, that might be important because could increase the cost and also time of pipeline. However, based on experiments, when working with random code — those steps are absolutely needed.

Chat service

Now, when we have all the data in Vector storage, lets create a simple chat service, which will answer questions:

data class ChatResponse(
val responseText: String,
val files: Set<String>
)

import org.springframework.ai.chat.ChatClient
import org.springframework.ai.chat.messages.Message
import org.springframework.ai.chat.prompt.Prompt
import org.springframework.ai.chat.prompt.PromptTemplate
import org.springframework.ai.chat.prompt.SystemPromptTemplate
import org.springframework.ai.document.Document
import org.springframework.ai.vectorstore.SearchRequest
import org.springframework.ai.vectorstore.VectorStore
import org.springframework.beans.factory.annotation.Value
import org.springframework.core.io.Resource
import org.springframework.stereotype.Service
import java.io.File

@Service
class ChatService(
val client: ChatClient,
val vectorStore: VectorStore,
val config: CodeConfiguration
) {
@Value("classpath:/prompts/system.txt")
private val systemPrompt: Resource? = null

@Value("classpath:/prompts/user.txt")
private val userPrompt: Resource? = null

fun ask(message: String): ChatResponse {
val request = SearchRequest.query(message).withTopK(5)
val docs: List<Document> = vectorStore.similaritySearch(request)

val systemPromptTemplate = SystemPromptTemplate(systemPrompt)
.createMessage(mapOf("documents" to docs.toFiles()))
val userMessage = PromptTemplate(userPrompt)
.createMessage(mapOf("message" to message))

val messages = listOf(systemPromptTemplate, userMessage)
return ChatResponse(
responseText = client.call(Prompt(messages)).result.output.content,
files = docs.getNames()
)
}

private fun List<Document>.getNames():Set<String> =
mapTo(mutableSetOf()) {
it.metadata["source"] as String
}

private fun List<Document>.toFiles(): String =
mapTo(mutableSetOf()) {
val path = it.metadata["path"] as String
File("${config.path}/$it")
}.filter {
it.exists()
}.joinToString("\n") {
val text = it.readText()
"""
```
$text
```
""".trimIndent()
}
}

At the beginning we define prompts templates, that we are going to use:

  • System — which defines a system context
  • User — defines user prompt

Both are defined as resource files:

You're assisting with questions about project code, acting as a team lead.
Given the CONTEXT provided, with code and documentation from the project, and use that to answer user questions.
If unsure, simply state that you don't know. Do not add generic answers, use only CONTEXT provided.
IF question is not about generating new code, provide some code samples from CONTEXT section.
If question is about architecture answer using ADRs and CONTEXT. Identify architecture patterns if possible.
Respond format: MARKDOWN

CONTEXT information is below, files are split with "```":
---------------------
{documents}
---------------------
User question: {message}

Next, we are using similarity search to obtain top 5 documents from database:

val request = SearchRequest.query(message).withTopK(5)
val docs: List<Document> = vectorStore.similaritySearch(request)

After that, we are reading source code using file name from returned documents metadata and build prompts using templates:

val systemPromptTemplate = SystemPromptTemplate(systemPrompt)
.createMessage(mapOf("documents" to docs.toFiles()))
val userMessage = PromptTemplate(userPrompt)
.createMessage(mapOf("message" to message))

After that we return raw response from LLM and also adding file names as additional context to response.

Console chat

Now, lets build all the things together. We will use CommandLineRunner and Scanner to work with console input/output:

import experiments.aichat.service.chat.ChatService
import experiments.aichat.service.loader.DocumentPipeline
import org.springframework.boot.CommandLineRunner
import org.springframework.stereotype.Component
import java.util.*

@Component
class ConsoleChat(
val loadPipeline: DocumentPipeline,
val chatService: ChatService
) : CommandLineRunner {

override fun run(vararg args: String?) {
loadPipeline.loadFiles()

val scanner = Scanner(System.`in`)
println("Enter your messages (type '/exit' to quit):")

while (true) {
val input = scanner.nextLine()

if (input.isBlank()) continue
if (input.equals("/exit", ignoreCase = true)) break

val response = chatService.ask(input)
println(response.responseText)
println("More info: ${response.files}")
}
}
}

We also need to configure Spring AI to use our local LLM, we can do it using application.yaml we are going to use llama3 for both chat and embeddings:

spring:
application:
name: ai-chat
ai:
ollama:
base-url: http://localhost:11434
chat:
enabled: true
options:
model: llama3:8b
embedding:
enabled: true
options:
model: llama3:8b
main:
web-application-type: none
output:
ansi:
enabled: ALWAYS
code:
# enrich data with summary
summary: true
# enrich data with keywords
keywords: true
# path to source code
path: .
# project name
name: ai-chat

And that is it… now we can run our application and chat asking questions about code!

How to switch to another AI service?

With Spring AI it’s easy to start using another service, we can switch both chat and embeddings, or combine those two for different cases. But remember, that enrichers for embeddings are using chat client.

To start using OpenAI, we need to include the starter to build.gradle :

    implementation 'org.springframework.ai:spring-ai-openai-spring-boot-starter'

And create another profile: application-open-ai.yaml

spring:
ai:
ollama:
chat:
enabled: false
embedding:
enabled: false
openai:
api-key: <your-key-here>
chat:
enabled: true
options:
model: gpt-4o
temperature: 0.1
embedding:
enabled: true
options:
model: text-embedding-3-small

You could also create different profiles for the projects you want to analyze.

Save & Load Vector store

One of the benefits of using SimpleVectorStore is that this store can be saved to a text file and loaded back. This means we can have persistence between tool executions and we can share the store with other team members trough a VCS system!

We could enhance our DocumentPipeline to load and save the vector db.

class DocumentPipeline(
...
val vectorStore: VectorStore,
val config: CodeConfiguration
) {
....

fun loadCache():Date? {
with (getCacheFile()) {
if (vectorStore is SimpleVectorStore && this.exists() && this.isFile) {
(vectorStore as SimpleVectorStore).load(this)
val date = Date(this.lastModified())
logger.info { "Vector DB loaded from file: ${this.absolutePath}, last modified: $date" }
return date
}
}
return null
}

fun loadFiles() {
with(config) {
logger.info { "Start loading files from: $path" }
populate()
saveToFile()
}
}

private fun CodeConfiguration.populate() {
....
}

private fun saveToFile() {
if (vectorStore is SimpleVectorStore) {
with(getCacheFile()) {
(vectorStore as SimpleVectorStore).save(this)
logger.info { "Vector DB saved to file: ${this.absolutePath}" }
}
}
}

private fun getCacheFile() = File("${config.project}.cache")
}

@Component
class ConsoleChat(
val loadPipeline: DocumentPipeline,
val chatService: ChatService
) : CommandLineRunner {

override fun run(vararg args: String?) {
loadPipeline.loadCache()
loadPipeline.loadFiles()
......
}
}

Full source code of this application could be found in repository:

Whats next?

Besides technical improvements (like monitoring file changes and updating the vector store, using a better console, etc.), there are a few areas to improve the RAG pipeline itself: advanced chunking and multi-sourcing for having a better context for LLM.

Advanced chunking for code

A very important aspect of similarity search is data chunking. Selecting an appropriate chunking strategy requires careful consideration of several essential factors, including the nature of the indexed content, the embedding model, its optimal block size, the expected length and complexity of user queries, and how the retrieved results are utilized in a specific application. When working with source code, we need to split data into chunks using code semantics. We need to consider language and frameworks to produce useful chunks.

Let’s take a small example — if we use a fixed-size chunking strategy, we could end up with a chunk of data that does not hold any useful information for analysis. It might be some chunk from a large method, with loops using i, j, k variables and temporary structures. So that leads us to the conclusion that code quality matters not only for humans but also when we work with AI: if your code is well-structured, methods, classes, and variables have meaningful names, and you have some documentation in place — for LLM it would be much easier to find the right answers to your questions.

At the same time, the structure of source code is different from a “normal” text document. The linked elements might be declared at different ends of a file or even in multiple files. To maximize our chances of providing the right context to LLM, we need to eliminate all side elements and highlight the important ones.

Integrate more systems

At the moment, we have integrated only the code system, but another important thing in onboarding a real developer to the project is to learn how things are done in the organization. For instance, to answer questions like “What are the steps needed to create a new REST endpoint?” or “How to define a new DB entity?”

One of the better ways to achieve that is to integrate VCS (version control system) and task tracker (like JIRA) into a RAG pipeline. When a developer requests something, we could ask LLM to analyze not only the code but also to find the reasons why that code is in that place. So we will have a historical context and our system would be able to provide better answers to the questions.

Another source of knowledge might be “side” documents, which are related to the project but not part of the source code — like API specifications (for instance, autogenerated OpenAPI specifications), project structure (import build time information from gradle/maven), or different test reports (like test coverage).

--

--