The 3 Major Challenges Hindering the Implementation of RAG: Can 3 New RAG Frameworks Really Save the Day? (HippoRAG)

9 min readAug 26, 2024

Creating AI knowledge bases always comes with a set of frustrating challenges: poor content quality produced by models, inability to retrieve key information, responses that don’t match user queries, or answers that are too vague. Moreover, the retrieval of text often fails to pinpoint the most accurate sentence or segment.

These are the main issues currently faced by RAG (Retrieval-Augmented Generation) products. To address these limitations, several new RAG frameworks have been proposed this year, including GraphRAG , HippoRAG, and EfficientRAG. These frameworks aim to overcome the traditional RAG limitations by integrating knowledge graphs, improving retrieval algorithms, and optimizing tokenization techniques.

These new frameworks primarily target critical pain points in current RAG systems, such as the independent encoding of each token, leading to a lack of contextual understanding, and the inability to effectively handle multi-hop question answering tasks that require integrating multiple information sources.

This article focuses on HippoRAG, as I’ve previously covered other topics related to knowledge graph retrieval in articles like GraphRAG and Comprehensive Guide to GNN, GAT, and GCN: A Beginner’s Introduction to Graph Neural Networks After Reading 11 GNN Papers

EfficientRAG’s framework and paper are relatively straightforward, so I’ll only briefly cover its core technologies in section 04.

01 The Three Current Bottlenecks of RAG:

Independent Token Encoding Leading to a Lack of Contextual Understanding:

Current tokenization techniques cause each stored token to be independently encoded without considering its context, leading to shortcomings when processing tasks that require integrating new knowledge across multiple passages.

When RAG performs retrieval, it might stop matching relevant content after finding the first keyword. Once it matches content related to one token, it won’t continue matching content related to other tokens → making it impossible to effectively combine information from multiple sources and failing to consider the context of each independently encoded token.

Lack of Controllability in RAG Retrieval:

There’s a saying that traditional search can only precisely search for keywords, while modern large models can only perform fuzzy searches.

The reason behind this is that current RAG models typically use semantic-based retrieval methods, where the LLM (Large Language Model) interprets new text based on its trained weights and generates text accordingly.

Poor Retrieval Matching:

A RAG model might return results that match a passage’s semantics, but the details might not fully align with the user’s query. The matching might be at the paragraph level, but the specific sentence within the paragraph may not match the user’s question.

These issues are generally addressed by increasing the length of text segments, allowing more content to be unified under one token encoding. However, this increases the workload for data processing, especially in large datasets or long-text processing.

Tokenization: Tokenization is the process of breaking down a continuous text (sentence or paragraph) into smaller units, usually words, subwords, or characters. These units are called tokens.
Token lengths can range from smaller chunks (e.g., 128 or 256 tokens) to larger ones (e.g., 512 or 1024 tokens). If the model has a maximum sequence length limit (e.g., GPT-4 8k can handle up to 8192 tokens, and GPT-4–32K can process up to 32768 tokens, roughly equivalent to 50 pages of text), the chunk size must be adjusted accordingly to ensure the text can be properly processed.

Below are two images illustrating the issues caused by independent token encoding, using a somewhat extreme example with only 256 tokens. Compared to GPT-4’s 8k and 32K tokens, this is indeed quite small.

Inaccurate Answer Due to Insufficient Token Length: The relevant information was cut off.

Incomplete Answer Due to Short Tokenization: Only half of the original content is included.

02 The Main Structure of HippoRAG: LLM + Knowledge Graph + PageRank Algorithm

The PageRank algorithm is a graph-based method used to evaluate the importance of nodes, typically applied in network analysis and information retrieval. In knowledge graphs, the PageRank algorithm can help identify and rank nodes most relevant to the query.

At its core, HippoRAG is inspired by the neocortex and hippocampus in the brain: the neocortex is responsible for long-term memory, complex reasoning, and language understanding, while the hippocampus handles spatial memory, converting short-term memory into long-term memory, and quick indexing.

HippoRAG synergistically orchestrates LLMs, knowledge graphs, and the Personalized PageRank algorithm to mimic the different roles of the neocortex and hippocampus in human memory.

The second and third rows in the image below illustrate the brain’s hippocampus and neocortex processes in response to a query.

The blue stripe represents the hippocampus, while the red area is the neocortex.

In the second image on the right, when a question is asked, the neocortex first processes and examines the relationship between the elements. Once it identifies “prof. Thomas,” it maps it to the hippocampus for further analysis.

Through the interaction between the hippocampus and the neocortex, humans can update and apply knowledge, becoming increasingly intelligent. HippoRAG aims to replicate this process to reconstruct RAG.

03 Hippocampal Memory Indexing Theory

The hippocampal memory indexing theory, proposed by Teyler and Discenna, aims to achieve pattern separation and pattern completion:

Pattern Separation: Ensures that representations of different perceptual experiences are distinct, avoiding confusion.
Pattern Completion: Allows the retrieval of a complete memory from partial stimuli.

Memory encoding facilitates pattern separation:

When a question is asked, the neocortex first receives and processes the perceptual stimuli, converting them into more actionable and possibly higher-order features.

These features are then passed through the parahippocampal regions (PHR) to the hippocampus.

In the hippocampus, significant signals are indexed and linked, forming a hippocampal index used for memory storage.

Pattern Completion:

When the hippocampus receives partial perceptual signals from the PHR, pattern completion drives the memory retrieval process.

The hippocampus uses its context-dependent memory system (a densely connected neural network in the CA3 region) to recognize and retrieve the complete and relevant memory.

The retrieved memory is sent back through the PHR to the neocortex, where it is simulated and replayed.

This complex memory processing allows the brain to integrate new information by modifying the hippocampal index rather than updating the representations in the neocortex. This means the brain can effectively process and store new memories without completely altering the structure of existing long-term memories.

The connection between the hippocampus and neocortex had always been unclear to me, but now I finally understand how they work together.

04 Knowledge Graphs Combined with the PageRank Algorithm

HippoRAG simulates the hippocampal process of indexing and retrieving information using knowledge graphs and the PageRank algorithm, which corresponds to the hippocampus’s ability to retrieve memories when receiving partial information.

The LLM acts as the neocortex, responsible for processing complex information.

Retrieval Encoders serve as the PHR, coordinating information transfer between the hippocampus and the neocortex.

Offline Indexing / Pattern Separation

The LLM processes incoming text passages and uses Open IE to convert the text into structured knowledge triples (e.g., “Thomas researches Alzheimer’s”).

Retrieval Encoders match the information extracted by the LLM with nodes in the knowledge graph, performing a function similar to pattern separation, ensuring that new information is indexed and associated in the knowledge graph in the correct manner.

The Retrieval Encoders then pass the knowledge triples to the hippocampus, which constructs and stores the corresponding index, connecting it with existing knowledge.

Online Retrieval / Pattern Completion

When a query is received, the LLM performs Named Entity Recognition (NER), extracting important entities from the query, such as “Stanford” and “Alzheimer’s,” and allows the Retrieval Encoders to pass the entity information to the hippocampus (KG + PageRank) for node matching.

The hippocampus uses the PageRank algorithm to retrieve the most relevant information or memory fragments from the knowledge graph based on the entities and contextual information in the query.

To reduce computational load, node specificity is introduced, similar to the concept of inverse document frequency (IDF).

Inverse Document Frequency (IDF): In information retrieval, IDF is a common technique used to measure the importance of a word within a document collection. Words that appear infrequently (i.e., high IDF value) are typically considered more distinctive and, therefore, given more weight in retrieval.

This means that if a node (or piece of information) appears in fewer passages, its specificity will be higher, and vice versa.

In the information or memory retrieval process, the probability of each query node is multiplied by its node specificity and then combined with the Personalized PageRank algorithm (PPR). This approach allows the adjustment of retrieval probabilities for each node and its neighborhood, giving more influential nodes greater impact during retrieval.

In GNN, the attention mechanism is used; in GraphRAG, it’s the community hierarchy.

05 Feature/Method Comparison:

This month, a new RAG framework concept, EfficientRAG, was introduced. Unlike the knowledge graph-based approaches mentioned above, it combines traditional tokenization with multi-round retrieval to answer questions that require information from multiple documents.

The reason it’s called EfficientRAG is that it introduces lightweight labeling and filtering models, reducing the need to repeatedly call large language models (LLMs) to generate new queries during each retrieval, thereby decreasing latency and computational costs.

The framework architecture is as follows:

Query Reception: The system first receives the user’s query.
Retriever: The retriever extracts relevant text chunks from the database that contain potentially useful information for answering the query.
Labeler & Tagger: This module labels each retrieved chunk, determining whether the key information within it is helpful for answering the query. Specifically, the system marks these chunks as or to indicate whether the chunk contains enough information to continue deriving the next question or if the retrieval can end.
Filter: The filter processes the text chunks generated by the labeling and tagging module, extracts useful information, and combines it with the original query to generate a new, more specific query for the next retrieval round.
Iterative Retrieval: The system continues retrieving based on the newly generated query until all chunks are marked as or the maximum number of iterations is reached.
Answer Generation: Finally, all useful text chunks are passed to the LLM to generate the final answer.

The core feature of this framework lies in the labeler & tagger and filter, which break down complex multi-hop questions into simple sub-questions and filter out irrelevant information at each step, reducing the need to call computation-heavy LLMs and minimizing the interference of unnecessary information.

Conclusion

Currently, RAG feels like it’s in a stagnation phase. Although there are explorations of different frameworks, the overall implementation is still weak. Past open-source RAG projects like Qanything, Langchain chatchat, maxKB, and Coze, Flowise — knowledge base Q&A in these workflows — haven’t delivered very satisfying results.

In comparison, AI Workflow and GPTs seem more practical and capable of addressing real-world needs. I’m starting to explore relevant tutorials as well.