Enterprise Knowledge Enhancement with GenAI and LangChain

Published in

GlobalLogic LatAm

18 min readMay 3, 2024

Summary

LangChain emerges as an innovative framework tailored for the development of smart search applications leveraging advanced language models such as those from OpenAI, Llama, and Cohere. Its modular and scalable structure is ideal for creating dynamic applications that integrate sophisticated language comprehension capabilities. LangChain’s architecture allows for the customized assembly of processing chains, adaptable to the specific needs of an organization, with easy integration of document loaders, text transformers, and vector storage systems for efficient searches.

In this article we will demonstrate the efficacy of LangChain through the implementation of a project that utilizes the Retrieval-Augmented Generation (RAG) approach, particularly for question-answering tasks over a document corpus. The process involves loading, splitting, storing, retrieving, and generating information, structured in a pipeline that efficiently processes natural language queries.

In practice, the project showcases modularity across various workflow components such as data extraction, embedding generation, query response, and user interface creation. This modular design ensures maintainability and scalability, allowing for easy modification and extension without overhauling the application.

To evaluate the project’s performance, we exercised the implemented solution using the QAGenerator and ConversationalGenerator. The results highlighted the strengths and weaknesses of each approach. The QAGenerator provided accurate information but lacked conversational memory, leading to disconnected responses in follow-up queries. The ConversationalGenerator, however, maintained dialogue coherence, underscoring the importance of conversational memory in user interactions.

Lastly, the comparison with ChatGPT 3.5 underlines the advantage of having an up-to-date database, as LangChain-powered generators could offer current information, unlike ChatGPT 3.5 which is limited by its last knowledge update. This project illustrates LangChain’s role in developing AI communication systems with real-time data retrieval capabilities, offering a considerable advantage in the development of context-aware NLP applications.

LangChain: A Versatile Framework

In today’s era, marked by the imperative for rapid and accurate access to relevant information, LangChain emerges as an innovative framework for creating smart search applications powered by advanced language models such as OpenAI, Llama, Cohere, among others (LangChain, n.d.-a). LangChain enables the seamless integration of sophisticated language comprehension capabilities into tailored applications, enhancing the retrieval and contextualization of information.

For instance, an organization might utilize LangChain to develop an internal virtual assistant capable of processing natural language queries and retrieving specific answers from its corporate knowledge base. The language model leverages internal documentation to grasp the context and unique terminology of the entity, thus providing pertinent results. Developers can swiftly customize the system using LangChain’s flexible architecture to meet the ever-changing needs.

LangChain offers a modular and scalable framework for building dynamic applications driven by language models. Its architecture of interchangeable components allows developers to assemble customized processing chains tailored to their specific needs.

For example, LangChain’s document loaders can be effortlessly integrated with various data sources. Then, text transformers prepare the data for processing by language models. Finally, vector storage systems index the generated embeddings, enabling efficient and precise searches.

The modularity of LangChain facilitates ongoing experimentation and evolution. Key components such as data sources, language models, and storage can be replaced or extended without the need to rewrite the entire application. This allows for flexible adaptation to new requirements and advancements in language model technology.

The following sections of this article will delve into the specifics of the project:

Project Use Case describes the RAG approach and its implementation in the project.
Project Organization outlines the modular structure of the project and its components.
Workflow details the data flow through the various stages of the pipeline.
Results presents the findings from testing the QAGenerator and ConversationalGenerator, and compares their performance with ChatGPT 3.5.
Conclusions summarizes the key takeaways and the significance of LangChain in developing context-aware NLP applications.

Project Use Case

Within the scope of our project, we have embraced the approach of Retrieval-Augmented Generation (RAG) (LangChain, n.d.-b). This method has surfaced as a robust and efficient solution for natural language processing applications, particularly in question-answering (QA) tasks over a corpus of documents. Our implementation is grounded on the principles set by LangChain and its comprehensive documentation, which presents a modular and cohesive structure for constructing QA processing chains.

The RAG architecture unfolds across five fundamental stages:

Loading: Initially, it is imperative to load the data to be analyzed. In this regard, we utilize a document loading interface provided by LangChain, which supports the integration of diverse data formats and sources.
Splitting: Subsequently, the documents are fragmented into smaller pieces. This segmentation is crucial as it prepares the content for more efficient processing, facilitating its management by retrieval and generation algorithms.
Storage: Once the texts are split, they are stored in a suitable structure, which, in many implementations, involves embedding text fragments with vector representations, thus facilitating semantic similarity-based retrieval.

Figure 1: Pipeline for Indexing (LangChain, n.d.-b).

Retrieval: In this critical phase, relevant information is retrieved from the database using similarity search to identify text fragments that best align with the presented query.
Generation: Lastly, a large language model (LLM) is employed to generate a relevant answer based on the question and the data retrieved in the preceding stages.

Figure 2: Pipeline for Retrieval & Generation (LangChain, n.d.-b).

In our project, we have gone a step further by modularizing each of these critical components using Python interfaces and constructors. This design strategy ensures system scalability, allowing for the easy expansion or modification of each component in response to changing requirements or technological advancements.

Project Organization

Our project, which is hosted on the GenAI Knowledge Management with LangChain (López & Jiménez, 2023) GitHub repository, is divided into modules, with each one responsible for a specific part of the workflow, such as data extraction, embedding generation, query response, and user interface creation. This modular organization facilitates the maintainability and expansion of the project since components are interchangeable and can be adapted to different data sources and language models without costly changes to the application’s structure.

main.py: The main file that initiates the application and coordinates the execution of different stages and modules.
models: Directory containing the concrete implementations of the modules.
factories: Directory containing the concrete implementations of the factories that create instances of the modules and configure their parameters.
protocols: Directory containing the interfaces or protocols that define the contracts for the classes in the models and factories modules, following SOLID principles and allowing for greater modularity and scalability of the project.

View GitHub repo

Workflow

Store Data: The process begins with data extraction, carried out by the extractor module, which works in conjunction with data_store to store relevant information and with preprocessor to ensure data is preprocessed before being stored or used.

class Application:
    ⁝       ⁝       ⁝
    def store_data(self):
        data_store: IDataStore = self.data_store_factory.create_data_store()
        data_preprocessor: IDataPreprocessor = self.data_preprocessor_factory.create_data_preprocessor()
        source_list: List[Dict] = self.extractor_factory.get_source_list()
        for source in source_list:
            extractor: IExtractor = self.extractor_factory.create_extractor(source)
            extractor.extract(data_preprocessor, data_store)

Store Vectors: Once the data is stored and preprocessed, the loader, text_splitter, and embedding modules come into play. The first is responsible for loading documents from different data sources, the second for dividing the documents into more manageable fragments, while the third transforms these fragments into vector representations, also known as embeddings. These embeddings are later stored by vector_store in a vector database, facilitating fast and efficient searches.

def store_vector(self):
        loader: BaseLoader = self.loader_factory.create_loader()
        text_splitter: TextSplitter = self.text_splitter_factory.create_text_splitter()
        embeddings: Embeddings = self.embeddings_factory.create_embeddings()
        vector_store: VectorStore = self.vector_store_factory.create_vector_store(embeddings, loader, text_splitter)
        return vector_store

Chat: When the user interacts with the system through the interface provided by gui, it is the generator module that comes into action, using retriever to search the vector database for the text fragments most relevant to the user's query. Subsequently, a language model, managed by language_model, is used to generate coherent and useful responses based on the search results.

def chat(self):
        embeddings: Embeddings = self.embeddings_factory.create_embeddings()
        vector_store: VectorStore = self.vector_store_factory.get_vector_store(embeddings)
        retriever: VectorStoreRetriever = self.retriever_factory.create_retriever(vector_store)
        llm: BaseLanguageModel = self.language_model_factory.create_language_model()
        gui: IGUI = self.gui_factory.create_gui(retriever, llm)
        generator: IGenerator = self.generator_factory.create_generator(retriever, llm)
        gui.run(generator)

Enterprise Knowledge Enhancement with GenAI and LangChain

Summary

LangChain: A Versatile Framework

Project Use Case

Project Organization

Workflow

Store Data

Extractor

Data Preprocessor

Data store

Store Vectors

Loader

Text Splitter

Embedding

Vector Store

Chat

Retriever

Language Model

Generator

GUI

Results

ChatGPT 3.5

QAGenerator

ConversationalGenerator

Conclusions

References

Written by Luis Camilo Jimenez