Enhancing Retrieval-Augmented Generation with Synthetically Generated Attributes

Published in

gft-engineering

4 min readMar 26, 2024

Natural language processing approaches and techniques, as well as artificial intelligence models, are in a constant evolution. Because of the sheer number of use cases that come with adopting Large Language Models, we see novel approaches published practically daily.

Retrieval-augmented generation is a mainstream methodology that integrates the LLMs content generation capabilities with particular knowledge repositories, such as databases, Sharepoint, or PDFs. The models can now generate replies from data that is not publicly available in addition to the original training dataset thanks to this integration.

However, the quality of the data that is retrieved determines how effective RAG systems are. We provide a new method that uses artificially generated attributes to improve this quality. In this article, we describe this method and examine how the addition of synthetic features to the corpus can greatly improve the accuracy of the retrieval process. In order to provide a more accurate retrieval mechanism, we will investigate how domain-specific variables such as themes, categories, chronological context, geographical relevance, and document purpose might be synthesised.

We will also examine the advanced retrieval methods made possible by these features, including dynamic ranking, attribute-weighted retrieval, and attribute-based filtering, all of which help to select content that is more pertinent and focused.

Enriching the contents by adding synthetic attributes

Enriching the contents by adding synthetic attributes
During this first step, we propose to augment contents by including synthetic attributes during an ingestion process. These attributes are generated by LLMs and serves as advanced metadata, adding context to the content. These are some examples:

Topics and categories: Utilising LLMs for categorization and topic extraction, documents can be tagged with specific topics or categories, helping in thematic retrieval.
Temporal context: Assigning time-related tags ensures the retrieval of timely and relevant content, essential for time-sensitive topics.
Geographical relevance: Geographical tags can significantly enhance retrieval by aligning content with specific location-based requirements.
Purpose or intent: Identifying the intent behind documents helps refine retrieval, aligning it more closely with the query’s purpose, whether informative, persuasive, or instructional.

These attributes introduce new capabilities to the retrieval process allowing a more contextualised content fetching.

2. Advanced retrieval using synthetic attributes

Once the contents has been enriched, the RAG is evolved to use them during the selection process taking into account:

Attribute based filtering: This enables more precise searches, aligning retrieval with the generation task specific needs.
Attribute weighted retrieval: When multiple attributes are relevant, the system can prioritize them, adjusting retrieval to the context of the query.
Dynamic ranking: Beyond traditional scoring, RAG now uses synthetic attributes in ranking according to domain or specific use cases. This ensures that the results are aligned to the context of the query.

This advanced RAG method ensures better alignment of the retrieved content with the generative task to resolve the initial query, encouraging more contextualised and higher quality responses.

3. Integration of retrieved content into generation

The final phase is about including the retrieved content and attributes as embeddings into the LLM model. Here, synthetic attributes are going to provide contextual insights to the LLM, enhancing its output relevance and quality.

In generating a thematic report, the model utilises embeddings from articles tagged with relevant topics and categories, ensuring thematic alignment and comprehensive content generation.

Benchmarking and Interactive enhancement

To implement this framework, it is convenient to define a benchmark procedure for the contents obtained by the RAG. These three steps are implemented by progressively including synthetic attributes to improve the performance of the RAG. It is common to find that the inclusion of a new synthetic attribute has secondary effects on previously obtained good results. Therefore, continuous measurement and progressive experimentation are clear to refine the model.

Complementing vector databases

An advantage of this framework is its complement and potentially replace vector databases, accommodating traditional SQL and NoSQL databases like PostgreSQL or Elastic. So, we could implement RAG capabilities for some scenarios by reusing existing databases, and avoiding vector databases complexities.

Also, by enriching contents with synthetic attributes we can create structured metadata that allows these databases to perform complex retrieval tasks.

Conclusions

By incorporating synthetically generated attributes into the RAG, not only refines retrieval accuracy but also simplifies the deployment and operation by avoiding vector databases. This innovation opens up new possibilities for RAG systems, enhancing their applicability across various domains and scenarios.