Retrieval augmented generation (RAG) experiments with human-guided machine evaluation

Published in

Tech@World School History

8 min readFeb 28, 2024

The purpose of this article is to share how we develop one of the World School History Project’s AI applications safely through experimentation and continuous evaluation with humans in the loop.

As well as building a knowledge base from school history curricula, textbooks and other learning materials, the World School History Project also seeks to make it possible to interact with the knowledge base as an end user. On our product roadmap for this year are two applications:

an interactive map that allows users to explore different topics and how different regions of the world treat them (if at all); and
a chat interface that allows users to ask questions of the knowledge base in natural language (as one would with ChatGPT).

Here we share how we design experiments to guide our development of the second of these, the chat interface — so as to:

respond with the most relevant information, i.e. the response should address the user’s question as directly as possible;
respond only with the information contained within the knowledge base, i.e. the answer should be derived only from the resources contained in the knowledge base and should not be based on existing sources that the language model have been trained on, which also implies the minimisation of hallucinations.

By sharing the way we design experiments, we also hope to give a non-technical audience more insight into how “AI” systems are developed and how important human input is in evaluating their performance.

Retrieval Augmented Generation (RAG)

In terms of the broad approach we adopt for our solution, we use what is known as retrieval augmented generation, which has many components that can be manipulated to improve the quality of responses. You can think of RAG as allowing you to asking an AI model (in this case a large language model) to read some resources and then answer some questions based only on what is contained in them, in a similar way to when students have to answer reading comprehension questions.

Retrieval augmented generation (RAG) allows us to exploit both the natural language capabilities of large language models (such as the GPTs which power ChatGPT; you may also have heard of other model families such as Gemini, Llama, Mistral, Mixtral, Zephyr) and the information in the knowledge base by grounding the responses in the resources in the knowledge base.

Below is a simplified schematic of retrieval augmented generation (RAG) from a user’s perspective (i.e. excluding the components related to building the knowledge base).

Simplified RAG schematic from a user perspective

From this basic schematic, we can further elaborate on the components that feature in each of the processes:

Query processing: Translate the user’s query into a machine-understandable format.
A language model is required to convert the human language user query into a format that can be used for retrieval. This is likely to involve converting it into an embedding (vector of numbers) that can be used to query the knowledge base.
Retrieval: Retrieve relevant resources from the knowledge base based on the query.
The resources in the knowledge base that have semantic relatedness exceeding a threshold are retrieved. This semantic relatedness is determined by some similarity or distance metric between the embedding (vector) representing the query and the embeddings representing each of the resources/documents/chunks in the knowledge base. Additional metadata might also be included in the query to further narrow down or rerank the results, for example the user may wish to specify which countries’ curricula and learning materials should be considered in formulating the response.
Note also that the language model used to embed the query should be the same as the one used to embed the resources so that they exist in the same “space” and can be compared. See this previous article to get an overview of how our knowledge base has been developed.
Response augmentation and generation: Output a relevant human-comprehensible response to the user’s query.
A generative (large) language model (e.g. GPT4, Llama30b) is used to generate the output to the user. As well as the user query, the prompt given to the model also includes instructions to base the response exclusively on the resources retrieved. Depending on the language model, the prompt may need to be modified to ensure the model does this, and that it does not return anything if there exist no resources exceeding the relatedness threshold in the knowledge base.

Experimental conditions: What makes a difference to performance?

With respect to the three components (Query processing, retrieval, and Response augmentation and generation), we can identify certain elements within each of these that we can manipulate so as to alter the performance of our RAG solution (those mentioned below are not the only ones possible but they are the ones we currently include in our experiments).

Query processing

Embedding model used to represent the query and the resources in the knowledge base. Together with the reranking model, this determines to what extent the resources retrieved are relevant. For our initial batch of experiments, we tried thenlper/gte-small, BAAI/bge-small-en-v1.5, and intfloat/e5-small-v2; these were all small models that achieved decent performance at the time of writing (although by now there may well be other models that perform better; see the MTEB leaderboard).

Retrieval

Reranking model to rerank the results based on other attributes beyond the text embeddings. The model determines the relative weights that should be given to the text similarity and the other attribute(s). We tried two conditions: one with no reranking, and one with reranking based entity matches between the query and resources (it is likely that we will include additional features in future experiments and iterations). For the reranking condition, we used the cohere model but in future experiments we will also experiment with open source models, such as BAAI/bge-reranker-large, RankVicuna or RankZephr.
Threshold for inclusion of resources in the generation of the response. This determines both the volume and relevance of results included. We chose to try with thresholds of 0.65, 0.75, 0.85, and 0.95.

Response augmentation and generation

Generative model (typically an LLM) to generate the response using the resources retrieved. In our initial batch of experiments we used llama-7b, a very small open model, and gpt-3.5-turbo, which has powered some versions of ChatGPT. We chose to experiment with the small llama-7b model to see if it could still be good enough for our purposes. As with embedding and ranking, new models are constantly being developed with better and better performance so for more extensive experiments we will select other models.
Prompt used in conjunction with the retrieved resources, which is fed to the generative model to give the response. Different prompts can make a huge difference to the output but in this first batch of experiments we essentially only experimented with one prompt that stipulated that the model should only include information in its responses that were supported by the resources retrieved. (As a side note, for open models such as llama, prompts need to be formatted in accordance with certain templates so as to produce the best responses; for this reason the prompts given to our two models had to be slightly different).

To summarise, for our initial batch of experiments, we have (3 x 2 x 4 x 2=) 48 conditions in total from combining:

3 x embedding models;
2x reranking conditions;
4 x resource inclusion thresholds;
2 x generative models.

The results of these initial experiments will be used to guide more extensive experiments to achieve an optimal system.

Evaluation criteria and metrics: How do we know how good our responses are?

To return to the objectives of the system (as outlined above), we are trying to maximise the following:

Relevance of the resources retrieved — the resources retrieved to be used in the response should be relevant to the user’s query (this implies that they should be correctly ranked/scored);
Resource fidelity of the response — support for the response should be found in the resources retrieved (note that this says nothing about the veracity of the response with respect to the world, but only says that it is a faithful reflection of the resources retrieved; therefore if the resources contain falsehoods, these would also be reflected in the responses);
Human readability of the response — this is more subjective but is perhaps the factor that makes a given AI solution appear more “intelligent”.

To determine each of the above, we adopt a strategy that takes human evaluation as the ground truth and then uses this to evaluate and tune machine evaluation. For the human reviews, we asked two reviewers to evaluate the following:

Resource relevance: For each of the resources retrieved, could the resource be used to answer the query. Yes/No
Ressponse fidelity: If the response contains anything (no matter whether true or false) that is supported by the resources retrieved? Yes/No
Readability: How readable the response is. Scale from 0 to 10.

We then translate these human responses into scores:

Resource relevance and response fidelity scores: 0 if neither reviewer says Yes; 1 if one of the reviewers says Yes; 2 if both reviewers say Yes.
Readability: Mean of normalised scores (normalisation means that we take the relative score of each reviewer so that we take into account individual differences in the way they rate, e.g. one of them might be more stingy with their scores or have higher standards).

Experiments to evaluate and optimise machine evaluation with human input: Which machine evaluation can we trust the most?

Having obtained the scores from the human reviewers, we can then use the scores to evaluate machine-computed evaluation metrics. We computed the following types of scores, which are associated with the metrics above

These scores are then correlated with the following machine-computed metrics:

Similarity scores between the resources retrieved and the responses (several scoring strategies and metrics are possible, but for our initial experiments we used cosine similarity between embeddings as computed by the three embedding models we experimented with). The extent to which high scores from a particular machine similarity scoring method are associated with the resource being deemed relevant by human reviewer(s) indicates the extent to which the machine similarity scoring method provides a proxy for resource relevance;
Averages (e.g. mean, median) of similarity scores between the resources retrieved and the responses, different summary statistics (e.g. mean, median) can be used to give a single score to represent the similarity between the response and all the resources it should be derived from. The extent to which the scores from a particular machine similarity scoring method correlate with the human resource fidelity scores indicates the extent to which the machine scoring method provides a proxy for resource fidelity;
Readability scores of the responses (there are many readability scoring methods, libraries and tools, but for our initial experiments we used textstat and readable). The extent to which the scores from a particular machine readability scoring method correlate with the human readability scores indicates the extent to which the machine readability scoring method provides a proxy for readability.

In other words, the extent to which the machine computed scores agree with the human evaluation scores gives us a level of confidence with which we can trust a particular machine evaluation method or metric. (For those who are interested in other LLM/RAG evaluation techniques, this article gives quite a nice high level view of the different approaches.)

This article is the one in an ongoing a series of articles on how we develop the tech behind the World School History project. We strive to ensure we use AI responsibly, which is why we always involve humans in any both the development of our knowledge base (see our previous article to learn more) and of the applications we seek to develop on top of it. Another objective of the article was to give lay audiences a flavour of what is involved in developing AI systems.