How we built our first loan application chatbot using LLMs: Part 2

Shawn Lim

Published in

Technology @ Funding Societies | Modalku

12 min readDec 5, 2023

Co-authored together with Abir Datta, Staff Engineer

Recap

We built a chatbot called Shane.
Shane can provide 24/7 customer support, and the response time has been dramatically reduced to an average of 15 seconds. This ensured that customers could get assistance whenever they needed it.
Quality of response remains high, with Shane being able to automate many questions related to financing and investing. It can accurately escalate to agents accordingly.

This is a 2-part series of how we made it happen:

Part 1: Suitable for audiences from different backgrounds, provides an overview of the project’s inception, the critical decisions made along the way, and its impact.
Part 2: This article, tailored for engineers, will offer a technical deep dive into how we built the entire stack.

User flow

This is the high-level flow of the system we are creating:

A user asks a question on a channel: Website, WhatsApp, Facebook, Instagram, etc.
The question is received on Intercom, a SaaS platform for customer service.
A webhook is triggered for every message or assignment to Shane.
Our AI backend handles the webhook call and responds with a generated answer.
Optionally, the conversation can be escalated to a customer service agent.
The user receives a reply, and the conversation continues.
In the case of an outbound message, our Sales representatives use Intercom to send out messages. User replies are then handled starting from step #1.

Technical Architecture

The high-level system architecture shown below describes how different components work together to achieve the targeted user flow. In the following sections, we will explain the thinking behind each component.

Intercom

Intercom serves as the core hub for managing our customer interactions, chosen for its:

Extensive Integration Capabilities: Intercom’s strength lies in its ability to seamlessly integrate with a wide array of messaging channels. This flexibility allows us to connect with customers across various platforms efficiently.
Webhook Functionality: The webhook features of Intercom are particularly beneficial. They enable us to craft and deliver custom messages generated by our large language models (LLMs). This capability is crucial for providing personalized and responsive communication.
Streamlined Communication: Using Intercom as our central gateway, we streamline all customer messaging through a single, robust platform, ensuring consistency and reliability in our customer interactions.

Modular monolith

Our LLM-powered system is structured as a modular monolith because we anticipated multiple use cases of LLMs.

chat — This is our main chatbot module that powers Shane

analysis — This is an internal facing module that powers features like AI-generated financial statement analysis, invoice analysis, etc.

These modules rely on many similar patterns, such as Retrieval Augmented Generation (RAG), where we use common code for multiple different use cases. In this context, a modular monolith approach makes code sharing across various use cases more explicit and easier to author.

Python for ML/AI Development

Python is an integral part of our machine learning and artificial intelligence tech stack for these key reasons:

Extensive ML/AI Library Support: Python is often the first language to receive support for new ML libraries, providing immediate access to the latest tools and innovations in the field.
Community and Accessibility: The language benefits from a vast, active community contributing to continuous improvements and support. Coupled with Python’s clear, readable syntax, makes it both accessible for newcomers and efficient for experienced developers.
Data Handling: Python excels in data manipulation and visualization, essential for ML/AI projects. Libraries like numpy and pandas empower developers to effectively manage and interpret complex datasets.

Apify as a scraper

To keep our chatbot updated with the latest data, we needed an efficient method for scraping information from our public website and Intercom helpdesk. Building a scraper from scratch can be tedious and error-prone, especially considering the many nuances like rate limiting and selectors. To avoid reinventing the wheel, we opted for a tool called Apify:

Affordability: Apify fits within our budget, offering free services at our current scale. This cost-effectiveness is crucial for us to maintain resource efficiency.
Ease of Use: Setting up and managing scraping tasks with Apify is straightforward, making it a user-friendly option for our team.
Integration with Our Workflow: We use Apify to scrape data, which is then exported and processed into embeddings for our vector store. This seamless integration is vital for the smooth operation of our chatbot.

Azure OpenAI Service for our LLM provider

In our pursuit to integrate large language models into our tech stack, we prioritized compliance, reliability, and the quality of the AI models. Major cloud providers like Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure were all considered. Our decision to go with Microsoft Azure for accessing OpenAI’s services was driven by several key factors:

Compliance and Security: Azure provides robust compliance and security features that align with our stringent standards. In an industry where data protection and regulatory compliance are paramount, Azure’s strong emphasis on security and compliance frameworks makes it a reliable choice.
High-Quality AI Models: Our decision was heavily influenced by the performance of OpenAI’s GPT-4, available through Azure. GPT-4 stands out for its ability to generate high-quality, contextually accurate responses, which is crucial for our applications. Its advanced capabilities in understanding and generating natural language text align with our goal of delivering superior AI-driven solutions.
Support and Ecosystem: Microsoft’s commitment to supporting OpenAI’s development, together with its own ecosystem of AI-powered products ensures that we have access to the latest advancements in AI and support for our implementation.

LanceDB as a vector store

We needed a vector store to embed chunks of knowledge that could be fed back to Azure OpenAI for RAG. In summary, the flow works like this:

During ingestion, scraped information is transformed into embeddings and stored in a vector store.
During inference, a user’s question is also transformed into an embedding.
This embedding is used to search the vector store for the closest matching embeddings.
We then retrieve the corresponding chunks of knowledge text and include them in the prompt sent to Azure OpenAI.

A subsequent section on Knowledge Base will discuss the ingestion and inference process in greater detail.

When considering options for vector stores, we explored various commercial solutions like Pinecone and Weaviate. However, we encountered certain challenges with these options:

Cost Concerns: Many commercial vector stores come with a high price tag, making them less feasible given our budget constraints.
Complexity in Setup: Getting started with these commercial solutions often involves a steep learning curve and complex setup processes, which can delay implementation.

In contrast, LanceDB stood out for its straightforward approach and alignment with our technology strategy:

Ease of Use: LanceDB’s simplicity in setup and operation significantly reduced our time to deployment. Its user-friendly interface and documentation allowed our team to quickly integrate it into our existing systems.
Serverless Architecture: A key advantage of LanceDB is its serverless foundation, utilizing Amazon S3 for storage. This aligns perfectly with our overarching strategy to increasingly adopt serverless technologies. Serverless architectures offer scalability, cost-efficiency, and reduced operational overhead, which are essential for our evolving tech infrastructure.
Cost-Effective: Being backed by a serverless infrastructure like S3, LanceDB offers a more economical solution compared to other commercial vector stores. This cost-effectiveness is vital in managing our resources efficiently.
Strategic Alignment: LanceDB’s serverless nature and ease of integration align well with our long-term technology roadmap, which emphasizes agility, scalability, and cost-effectiveness.

DynamoDB for structured data storage

We’ve opted for Amazon DynamoDB for structured data storage, as its serverless nature ensures scalability and reduces maintenance, aligning with our agile infrastructure needs.

By adopting single-table design techniques to design partition and sort keys, we are able to support multiple access patterns:

Stores chat metadata for direct (1–1) associations.
Manages conversation-to-knowledge base links for one-to-many (1-M) lookups, crucial for our chatbot’s context handling. In the screenshot below, every line item is a question asked in a conversation. We can query all questions asked in a conversation by looking up PK=Conversation#908.

Testing using Promptfoo

Testing our chatbot, which relies on large language models (LLMs), presents unique challenges due to the non-deterministic nature of LLM responses. We chose Promptfoo for this task because:

Non-Deterministic Testing: Promptfoo is specifically designed to handle the unpredictability of LLM responses, which is crucial for our chatbot’s performance.
Variety of Testing Strategies: It offers a range of testing strategies, allowing us to comprehensively evaluate different aspects of our chatbot’s behavior.
Open Source Solution: Being open source, Promptfoo not only fits our budget but also offers the flexibility to adapt to our specific testing needs.

Here’s an example of how the test framework catches errors and ensures quality of our responses:

Internal tools

To expedite deployment initially, we did not develop a lot of tooling in our ingestion flow. All we had were some scripts that we ran locally to populate the vector store with embeddings. After we deployed the system into production, we saw a greater need to build various tools to manage our chatbot:

Conversation debugger: Check which knowledge base articles our chatbot was referring to whenever it gives a response to the user. This helps us to ensure the chatbot wasn’t producing irrelevant or incorrect responses, a phenomenon we refer to as ‘hallucinating.’
Prompt Configuration: A read-only tool for developers and agents to understand how the chatbot is configured to answer questions.
Knowledge base refresher: Tools for developers to trigger a knowledge refresh / scrape so that the vector stores are populated with fresh embeddings of any information present on our websites / helpdesk articles.

Alternatives and considerations

AWS Lambda

The project had initially started life as a pure serverless / AWS Lambda-powered system.

Dependency Size Issues: AWS Lambda’s maximum deployment package size of 250MB is restrictive for AI projects using Python with large dependencies like lancedb, pyarrow,langchain, etc.
Container Solutions: While considering Lambda containers, we faced the constraint of using Amazon ECR, which is more costly than Docker Hub. This inconsistency with our other services utilizing Docker Hub was a deterrent.
Cost and Performance Concerns: The longer inference times (up to 10 seconds) for OpenAI, combined with the high memory demands for extensive knowledge base queries, significantly inflated Lambda costs.

We switched to AWS Fargate with FastAPI to run our microservice, bypassing Lambda’s limitations.

Langchain

Update Lag and Flexibility Issues: Langchain often falls behind the rapid updates of OpenAI, and its rigid code structure leads to bloated, inflexible implementations.
Performance and Security Risks: Langchain’s agents are not only slow and complex but also challenging to debug. They pose potential security risks, which require vigilant oversight.

We eliminated Langchain from our inference endpoint, opting to utilize native OpenAI SDKs directly. However, we retained Langchain for data ingestion, leveraging its loaders which we will talk about in the next section.

Knowledge Base

The chatbot’s knowledge base is sourced from two main channels:

Content crawled from web pages in the Funding Societies Malaysia Help
Internal confluence pages related to financing

The information is stored in the vector store as mentioned earlier using LanceDB. Each change in the knowledge base prompts the creation of a new version of the knowledge base stored in the bucket.

The vector store is organised into multiple tables, each representing a specific content domain. These include internal confluence content on financing, investing, secure account information, and general information about Funding Societies. The separation of internal and external content into distinct tables facilitates a better understanding of the search process within LanceDB, ensuring that documents are retrieved from the appropriate knowledge sources.

Ingestion process

The ingestion process for document embeddings involves several steps:

Load the Documents —

The external help pages are crawled using Apify as discussed earlier and then documents are loaded from the Apify Dataset using the Langchain’s ApifyDatasetLoader.
For Internal Confluence documents, native confluence APIs are used retrieve HTML content, which is then converted to markdown to preserve formatting.

Text processing is applied to clean the documents, removing headers, footers, special characters, unnecessary text etc.

Creating the embeddings from the loaded documents —

Our general strategy is to not to have overlapping contents within the vector DB records to preserve the best context in every DB record
For Apify web crawled documents, each chunk corresponds to one document. The cleaned help documents are small(less than 2000 tokens) and each have different context.
For confluence pages, each heading corresponds to a different topic and so the documents are chunked using CharacterTextSplitter using the #(header) character as the splitter. Generally the chunk sizes are around 1500 tokens.
OpenAI’s text-embedding-ada-002 model is employed to create embeddings.

def get_apify_datasets_create_chunks(dataset_id) -> list[Document]:
    loader = ApifyDatasetLoader(
        dataset_id=dataset_id,
        dataset_mapping_function=lambda dataset_item: Document(
            page_content=process_text_apify(dataset_item["text"], dataset_item["url"]),
            metadata={"source": dataset_item["url"]},
        ),
    )
    docs = loader.load()
    return docs

def get_text_from_confluence_pages(cf_wiki_url, cf_api_key, cf_username, cf_page_id):
    confluence = Confluence(url=cf_wiki_url, username=cf_username, password=cf_api_key)
    documents = []
    page_data = confluence.get_page_by_id(cf_page_id, expand="body.storage.value")
    html_content = page_data.get("body").get("storage").get("value")
    md_content = md.markdownify(html_content).replace("\n\n\n", "\n").replace("\n\n", "\n").replace("\n\t\n\t", "")
    documents.append(Document(page_content=md_content, metadata={}))
    return documents
def get_confluence_docs_create_chunks(page_id, chunk_size=1500, chunk_overlap=150) -> list[Document]:
    docs = get_text_from_confluence_pages(
        os.getenv("CF_WIKI_URL"),
        os.getenv("CF_API_KEY"),
        os.getenv("CF_USERNAME"),
        page_id,
    )
    text_splitter = CharacterTextSplitter(
        separator="#",
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(docs)
    return chunks

The Langchain implementation of LanceDB is used to insert these documents with embeddings into the vector store.

def add_docs_to_vector_store(vector_db, table_name, documents):
    embeddings = OpenAIEmbeddings(model=os.getenv("OPENAI_EMBEDDING_ENGINE"), chunk_size=1)
    table = vector_db.create_table(
        table_name,
        schema=custom_schema,
    )
    LanceDB.from_documents(documents, embeddings, connection=table)

Inference process

The document retrieval process is an integral part of the chatbot’s question-answering mechanism.

When a user submits a question, the chatbot determines if knowledge retrieval is necessary. If so, the question is turned into an embedding, and this embedding is used to query the LanceDB tables.
The relevant documents are retrieved, sorted by ascending distance, and the top four documents with the nearest distance to the query are selected.
The document texts are then supplied as a prompt to OpenAI, and the response from OpenAI serves as the answer provided to the user by the bot.

Prompt Engineering

Prompt engineering is very important in creating effective conversational AI. We have specific set of prompts and guidelines which we pass in every conversation with Shane. We build Shane’s persona as customer support and sales executive within the constraints of a mini chat widget.

Samples of some of our some guidelines:

Communication style

How to respond when the customer greets or asks to speak to Shane.
Offering assistance in a concise and friendly manner.
Using the same language as the user to enhance the conversational experience.
Using multiple sentences in new lines with bullets or numbered lists for clarity.
Steering clear of negative aspects like late repayment consequences.

Information shared

Emphasizing the persona’s role as an informative entity, not a financial advisor.
Clear guidelines on steering clear of discussions related to illegal activities.
Avoiding sharing contact information and directing customers to the company website.

Prompt engineering plays a pivotal role in shaping the behavior and effectiveness of conversational AI systems. Prompt engineering is a continuous evolving process and we keep on tweaking and adding prompts regularly to enhance the user experience and also while adding new features. We use the Promptfoo testing tool mentioned earlier to check for regressions when we introduce changes to our prompts.

OpenAI function calling

We use OpenAI function calling to describe functions and provide them to the model to intelligently choose to call one or many functions. The implementation using Azure OpenAI is similar as mentioned here.

We have few functions in our application like:

Product Application function — Used for customers interested in applying for financing, seeking loans, or exploring suitable products.
Escalate to human agent — Provides an effective way to escalate to a human agent when users display rude behavior or request human assistance for specific issues.
End conversation which is designed to gracefully end conversations when users express disinterest, this function ensures a courteous farewell.

There are other functions for Loan Interest Calculation, Knowledge base Query etc.

Conclusion

At the time of writing, the system setup we’ve discussed represents the best solution to our challenges. However, the generative AI landscape is constantly evolving, and we anticipate that some of the ideas presented here will develop further over time. We encourage you to tailor these insights to best fit your unique setup.

At Funding Societies, our Engineering team is continuously exploring innovative ways to enhance our service to customers. We’re committed to staying at the forefront of technological advancements, ensuring that we not only keep up with the latest trends but also contribute to shaping the future of AI in our industry.