Closing the gap between available information and accessible knowledge: A Multi-Modal Approach to Learning

Published in

Engineering at Zafin

14 min readAug 28, 2023

By Sebastian Sergnese, Ankita Lal, Esther Apata, Ali Awil, Software Engineering Interns at Zafin, Branavan Selvasingham, Head of AI at Zafin, and Shahir A. Daya, Chief Technology Officer at Zafin

The Challenge: Navigating a Sea of Enterprise Information

We have all experienced the frustration of being surrounded by a wealth of information but struggling to find what we need. It’s like the famous line from the ancient mariner poem, “water water everywhere, and not a drop to drink,” adapted to an enterprise context as

“info info everywhere and not a way to know” by Branavan

This is a common problem we face when searching through extensive documentations, previous artifacts, and various file types. Often, we end up giving up on our search and starting from scratch or settling for a sub-optimal starting point.

Both customers and employees waste valuable time searching for answers and eventually must rely on support from company personnel for assistance. This not only consumes multiple time threads for a single query or retrieval activity but also creates unnecessary dependency on others.

In the Zafin context, this project aims to tackle these challenges by leveraging the latest advancements in Large Language Models (LLMs), embeddings, vector databases, relevant context fetching, and prompt engineering. We will explore how these technologies can be used to improve the efficiency and effectiveness of information retrieval.

Overview

We extract essential data from two primary sources: the Zafin Documentation Center and Zafin University. These data sources encompass both textual and video-based content, forming the foundational elements of our solution. The multi-modal data extraction is a multi-step process to ensure meaningful and high-quality text output. This text is then processed through OpenAI’s Text Embedding Ada model and converted into embeddings. These embeddings possess the distinctive capability to encapsulate intricate semantic nuances and similarities. They are stored within a secure PostgreSQL Vector Database, hosted on the Azure cloud infrastructure. This database functions as the contextual neural core, enhancing the information retrieval capabilities of our system.

Driving our system’s capabilities is OpenAI’s GPT models (3.5 and 4). Through the application of predefined parameters and the utilization of similarity searches, our system generates responses that are contextual, Zafin-specific, accurate and user-friendly. The system is continuously ensuring the currency of the knowledge base by monitoring documentation updates and modifications.

High Level Architecture of the project comprising of a chat interface interacting with an API that leverages OpenAI GPT and embedding models. Scraper scripts create embeddings from the Zafin Documentation and Video content. — Figure 1: High Level Architecture diagram for the project

Gathering Data/Scraping

A crucial step in developing the solution involves gathering a wide range of carefully selected information for seamless integration with the LLM. This collection of knowledge forms the foundation on which the assistant constructs relevant and insightful responses to user questions.

Within the scope of our project, our primary goal is to make the most of valuable documentary data intended for clients and employees. This process started by carefully scraping the content of two key data sources: the Zafin Documentation Center, an encompassing repository of Zafin product documentation, and Zafin University, a specialized platform meticulously designed to empower our team and customers with a profound mastery of Zafin product utilization.

Now, let us delve into the methods we have used to gather and organize information from both sources.

Zafin Documentation Center

Here at Zafin, all our customer-facing documentation is stored within the Zafin Documentation Center. This serves as the primary source of information for all Zafin product documentation. If a customer requires assistance in configuring any aspect of our software, they can search for relevant documentation within the Zafin Documentation Center to guide them through the process. The first step is to extract all the documentation and store it within the PostgreSQL Vector Database we have deployed on Microsoft Azure. Zafin has chosen the Document360 Knowledge Base Platform to power their core documentation center, and luckily for us, Document360 has its own RESTful APIs. This simplifies the process of extracting all the articles written by the documentation team.

Retrieving Article Content

The first step to storing all the articles in the Vector Database is to retrieve a list of all the articles within Document360. We then iterate through the list of articles to extract the following key pieces of information: article ID, article URL, article title, and article content.

Splitting Articles into Equal Chunks

One of the key decisions we made here was to split each article into equal chunks of 512 tokens, with 100 tokens of overlap, before creating the embeddings. Since an embedding is a single vector value which represents the semantic meaning of a large chunk of text, it was important that all text chunks remained similar in size. Even though the OpenAI embedding model can create embeddings for chunks of text up to 8,000 tokens, the size of 512 tokens was chosen because smaller chunk sizes can be represented more accurately as embeddings. Additionally, retaining 100 tokens of overlap between chunks ensures that context from prior chunks is retained in subsequent chunks. This also ensures that no chunk would be smaller than 100 tokens. To differentiate between chunks from the same article, each chunk within the same article is assigned a value representing which section of the article it came from (i.e., The first chunk is assigned value 1, the second chunk value 2, etc.).

Creating Embeddings

We then harness the capabilities of OpenAI’s `text-embedding-ada-002` model to create embeddings which capture the semantic meaning of each article chunk.

Storing to Vector Database

Finally, we can store all relevant information within the PostgreSQL database in Microsoft Azure. If the following table does not already exist within the database, then it will be automatically created by our scraping software.

Figure 2: Command to create the Zafin Documentation Center content database

Since articles, which already had a unique ID assigned to them by Document360, need to be stored as multiple chunks, a new UUID (Universally Unique Identifier) will be assigned to each section of the article to ensure each entry in the table can be individually identified. Storing of the original ID assigned by Document360 is essential to compare the content stored within the table to the content within Document360 to continuously update the PostgreSQL database.

What we end up with is a resilient python script capable of creating, populating, and updating our PostgreSQL database.

Zafin University

Zafin University serves as a comprehensive learning platform that has been meticulously designed to meet the learning needs of both our employees and clients. The platform offers a wide range of video-based courses that provide users with a comprehensive understanding of company policies and an in-depth exploration of Zafin products.

In the context of implementing the enterprise knowledge assistant, Zafin University emerges as a reservoir of valuable knowledge that can enrich the capabilities of the knowledge engine. However, OpenAI’s GPT-3.5 and GPT-4 are currently limited to processing textual inputs and lacks compatibility with video through its API.

The challenge at hand prompts an intriguing question: How can we effectively assimilate the content from Zafin University’s videos into our AI assistant? To address this, we employ two distinct methods for converting video content into textual form: audio transcription and image transcription.

Audio Transcription

At the core of our comprehensive video transcription service lies audio transcription. This technique plays a pivotal role in accurately extracting spoken content from Zafin University’s video courses. The process involves transforming videos into an audio format using the Pydub library. This audio is then divided into segments, each less than 25 MB, to adhere to OpenAI Whisper API’s input data limit. These segments are transcribed using the Whisper API, resulting in Sub Rip Subtitle (SRT) format transcripts that capture the spoken expressions of presenters.

Yet, an imperative aspect remains post-transcription: the temporal synchronization of each segment’s transcripts. Each sub-audio starts at 00:00, disrupting chronological flow. A systematic approach is used to ensure a continuous transcription, following the sub-audio order in the main audio.

The refined transcriptions are then cohesively melded through a meticulous amalgamation process to generate an all-encompassing textual portrayal of the original audio content.

The figure shows a sequence diagram for the interactions between components for the audio transcription process. — Figure 3: Audio transcription process sequence diagram

Image Transcription

The image transcription process is a multi-step transformation from video to images and subsequently to text.

The Video to Image Service uses OpenCV-Python and scikit-image libraries to convert videos into sets of distinct images. These images are then analyzed, and identical frames are omitted, leaving behind a collection of unique frames that represent the video’s slides. We precisely determine each frame appearance time, relying on two factors: video frame rate (fps) and frame counter. The occurrence time of each frame is calculated as follows:

Frame occurrence time = frame counter/video frame rate

The Image to Transcript Service takes place and utilizes libraries like Pillow and Python-tesseract to transcribe frames, extracting text from each image through OCR (Optical Character Recognition).

The figure illustrates how the image transcription process leverages timestamping. — Figure 5: Image transcription along with timestamping

Following transcription, the Cleaning Service employs OpenAI GPT-3.5 API to remove nonsensical symbols, enhancing transcript organization and comprehensibility.

Database and Schema

Upon extracting both audio and visual transcriptions from each video, they are subsequently associated through timestamps. If the audio transcript exceeds the token limit threshold, it will be subdivided into sections of maximum allowable size under the token limit. As outlined in the Document 360 scraping process, each segment will remain under 512 tokens and maintain a 100-token overlap. These segmented portions will be saved in conjunction with their respective visual transcriptions. The audio transcript is then converted into embeddings, and all pertinent information is stored within a PostgreSQL table, following the schema delineated below.

Figure 6: Command to create Zafin University video content database

In conclusion, while integrating Zafin University’s extensive educational content into the knowledge engine may present challenges, we have implemented a combination of audio and image transcription techniques to make it accessible. This strategic approach ensures that the valuable insights from Zafin University’s videos are effectively harnessed to enhance the capabilities of the enterprise knowledge solution.

Merging Components

Now that we’ve successfully developed software for the ongoing population of our PostgreSQL database on Microsoft Azure with Zafin-related content, our next objective is to establish a seamless connection between this enriched content and an LLM. This connection will enable us to efficiently retrieve relevant information as per user requests.

The enterprise knowledge engine is further enhanced through connection to two additional services developed by our colleagues:

NL to SQL — connects the user to their Zafin database.
NL to REST — connects the user to Zafin external APIs.

For a more in-depth exploration of these services, you can find detailed information inside the blog post written by them.

Connection to both services along with the documentation content are enabled through function calling, ensuring a cohesive user experience, and providing efficient communication between the natural language processing capabilities and the underlying technical components.

Function calling

OpenAI has introduced a new capability to gpt-3.5 and gpt-4, allowing these models to execute functions instead of directly responding with text from their training data. When prompted, the models can now intelligently generate a JSON object containing necessary arguments to invoke specific functions. By adhering to the function descriptions, queries can seamlessly be transformed into function calls.

Connecting to the data sources

These models have undergone refinement to identify situations where invoking a function is appropriate based on user input. Subsequently, they respond with JSON outputs that align with the expected function format. This function calling feature provides a structured way of obtaining data from the models. For instance, consider the function “doc_search,” which, given its JSON description, retrieves relevant documents from Zafin University and the Zafin Documentation Center databases to answer questions.

The JSON description of doc_search is:

Figure 7: JSON description of doc_search function

The model adeptly determines both when and which function to call. For instance, if we ask the model a question like “what is temporality?” in the context of documentation, it will recognize the relevance and invoke the “doc_search” function. The initial response includes the function name and its corresponding arguments.

Figure8: OpenAI function call response for sample query

From this response, we extract the function name and arguments to actually execute the function with the provided input. In the case of “doc_search,” this function takes a phrase as input and queries a database (Zafin Documentation Center and Zafin University database) to fetch relevant documents using cosine similarity. The returned information is structured in JSON format.

Figure 9: doc_search function

Subsequently, we incorporate the output of the “doc_search” function into the messages, labeling it as a function response.

Figure 10: doc_search function call response formatted as a message

After integrating this function response, the entire message package is sent back to the model for summarization, resulting in a suitable and coherent response to the user’s query.

Figure 11: NL summary of doc_search response

Connecting to NL to REST and NL to SQL

Harnessing the advanced capabilities of both the gpt-3.5 and gpt-4 models, we adeptly translate the nuances of natural language into a series of sophisticated API calls, thereby establishing a robust bridge between the realms of human communication and technical interfaces.

In addition to the notable “doc_search” function, it is noteworthy that two additional functions contribute to facilitating API calls within this integrated system. These functions are accompanied by comprehensive JSON descriptions that delineate their respective functionalities.

Figure 12: JSON description of NL to Rest and NL to SQL functions

The two additional functions are executed as POST requests to two services. These functions take the user queries (which are the last message in the list messages) as the body of the request and return the result as a JSON object. The result is an NL Summary of the response to the user’s query.

The complexity of this process can be better visualized through a comprehensive sequence diagram, effectively encapsulating the intricate steps involved in the overarching function calling framework.

The figure illustrates the sequence diagram for the natural language to enterprise knowledge solution. — Figure 13: NL to enterprise knowledge solution sequence diagram

Prompt engineering

Prompt engineering involves the creation and enhancement of prompts to maximize the utilization of LLMs. The effectiveness of prompt engineering hinges on the skillful design of prompts, which plays a pivotal role in the triumph of the endeavor. Proficiently crafted prompts have the potential to markedly enhance the AI model’s proficiency in particular tasks. By supplying pertinent, precise, clear, and methodically organized prompts, the model’s grasp of context can be amplified, resulting in the production of more precise and dependable responses.

Prompt engineering plays a pivotal role in harnessing the power of OpenAI’s API. Constructing well-structured and concise instructions or queries is paramount for function calling and chat completion.

In our scenario, we outline, through the system prompt, the specific role the knowledge engine will undertake. We also provide instructions regarding the expected quality and format of responses and furnish guidance on utilizing the Function Calling feature to achieve desired outcomes.

Figure 14: System prompt

Test Results

After implementing the enterprise knowledge solution, we conducted a targeted evaluation process tailored specifically to the Zafin University context. This assessment was designed to thoroughly assess the efficacy of the knowledge solution when handling Zafin’s product-related inquiries. To achieve this, we administered a Zafin Certification assessment, a realistic way for evaluating its performance.

The certification assessment included 30 well-structured multiple-choice questions. Each question presented respondents with four possible options, with a passing threshold established at 80%. For testing its consistency, we performed 10 trials of the same test.

It’s worth highlighting that the model’s performance is influenced by various parameters. Our analysis revealed several configuration keys that significantly impact the model’s capacity to deliver accurate responses to user queries. Parameters such as temperature, data sources, model selection, and context size emerged as critical factors affecting the model’s overall performance.

To shed light on this aspect, we meticulously executed a series of rigorous tests, consistently maintaining a temperature setting of 0 throughout the process. This approach ensured the reliability and consistency of our findings as we delved into the consequences of varying parameter configurations.

To ensure that the similarity search employed to retrieve content from our vector database did not interfere with our results, we decided to first test the GPT-4 model while exclusively relying on Zafin University content. With this best-case scenario configuration, we achieved an impressive accuracy rate of 90%, consistently, correct answers.

Given that our current model uses by default the GPT 3.5 model and draws from diverse data sources and services including the Zafin Documentation Center content, NL to REST, and NL to SQL, it was imperative to evaluate its performance with the actual configurations. We conducted tests using a configuration featuring the GPT-3.5 model, utilizing approximately 8,000 tokens (split into 16 chunks) of context, resulting in a score of 33.33%.

In response, we performed another round of testing with a reduced token count of 7,000 tokens (split into 14 chunks) to evaluate whether 8000 tokens of context were too much information for the LLM to handle in a single query. This adjustment yielded an improved score of 36.67%.

This highlights a slight increase in performance with less data sources. However, this is not enough information to conclude whether this is the only issue present. Further testing is required to deduce what may be attributed to gaps in function calling, similarity searches, and/or available content.

To uncover the root of the issue, manual testing was conducted, where we meticulously looked through the answers provided by the Knowledge Engine. It is important to note that to conduct automated testing, the model was instructed to provide single letter responses to the multiple-choice questions provided. This was not the case in the manual testing. Manual testing of the GPT-3.5 model with the same configurations as above yielded results of 70% correct responses when using 8000 tokens of context, and 73% when using 7000 tokens of context. Additionally, the manual testing revealed that the GPT-3.5 model occasionally struggled to accurately comprehend certain questions which required more logical reasoning.

Consequently, we upgraded to GPT-4 while retaining the same configuration settings (7000 tokens of context). This transition led to a substantial performance boost, resulting in an achievement of 67.68% in automated testing, and 87% in manual testing.

Conclusion

This project showcases the foundational capabilities that are readily available for usage throughout the organization. We’re excited for the roadmap ahead and for helping drive real value for our users.

References

OpenAI. Available at: https://openai.com/ (Accessed: 21 August 2023).
Function calling and other API updates. Available at: https://openai.com/blog/function-calling-and-other-api-updates (Accessed: 21 August 2023).
Cosine similarity — an overview | ScienceDirect Topics. Available at:https://www.sciencedirect.com/topics/computer-science/cosine-similarity (Accessed: 21 August 2023).
Core concepts (document360.com). Available at: https://apidocs.document360.com/v1-api/apidocs(Accessed: 21 August 2023)
Bridging the Gap: Exploring use of Natural Language to interact with Complex Systems. Available at:https://medium.com/engineering-zafin/bridging-the-gap-exploring-using-natural-language-to-interact-with-complex-systems-11c1b056cc19 (Accessed: 21 August 2023)
What are Prompts? Available at: https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/(Accessed: 18 August 2023)
Pytesseract. Available at: https://pypi.org/project/pytesseract/ (Accessed: 18 August 2023)
Pydup. Available at: https://pypi.org/project/pydub/ (Accessed: 18 August 2023)
OpenCV. Available at: https://pypi.org/project/opencv-python/ (Accessed: 18 August 2023)
Artificial Intelligence Definitions. Available at: https://hai.stanford.edu/sites/default/files/2020-09/AI-Definitions-HAI.pdf (Accessed: 21 August 2023).
PostgreSQL https://www.postgresql.org/ https://www.postgresql.org/ (Accessed:21 August 2023)

Definitions

As the enterprise knowledge solution bridges the gap between AI, data storage, and user interaction, the following terms form the building blocks of its functionality, enabling seamless access to information and fostering an enhanced learning experience.

AI (Artificial Intelligence): Refers to the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (acquiring information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions), and self-correction.
Token: In the realm of language processing, a token is a fundamental unit of text. It can represent a single character, word, or even a group of words. Tokens are crucial for breaking down text into manageable pieces that can be analyzed and processed by AI models. Tokens enable the engine to understand and work with textual data effectively.
Embedding: An embedding is a numerical representation of textual or data-based information. Embeddings are generated using advanced AI models like OpenAI’s Text Embedding Ada and these embeddings capture the semantic essence of text, enabling the engine to understand and compare the meaning of different pieces of information.
Vector Database: A vector database is a repository that stores numerical vectors, which represent several types of data in a multi-dimensional space. In the enterprise knowledge solution, a PostgreSQL Vector Database is utilized to store the embeddings generated from Zafin’s documentation.
NLP (Natural Language Processing): NLP is the branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. NLP allows the engine to comprehend user queries and craft responses in a way that mirrors human communication.
LLM (Large Language Model): LLM refers to an advanced artificial intelligence model designed to process and generate human-like language patterns. In the engine, an LLM platform is employed to enable the engine to generate coherent and contextually accurate responses based on the user’s queries and the knowledge stored within the system.