Bridging the Gap: Exploring use of Natural Language to interact with Complex Systems

Published in

Engineering at Zafin

14 min readAug 15, 2023

By Ameya Chawla Ml, Ansh Arora,Mahalakshmi Muthuvel, Memo Al-Dujaili, Peter Sheryl, Software Engineering Interns at Zafin, Branavan Selvasingham, Head of AI at Zafin, and Shahir A. Daya, Chief Technology Officer at Zafin

The Tectonic Shift from GUI to NUI

Applications have been steadily moving along the vector of increased usability and democratized access to users of all skillsets, specializations, and roles. This is evidenced by historical migrations from terminal user interface (TUI) to graphical user interface (GUI) and Desktop towers to Mobile devices. We are now at the inflection point of the migration from GUI to Natural User Interface (NUI) where applications, even enterprise applications, become increasingly natural to use and accessible by users of varying skillsets and specializations.

At Zafin, we are on the path to bringing greater access to underlying core banking product capabilities to a greater breadth of users and one of the ways to achieve this is to thoughtfully add a natural language-based interaction channel. A channel which allows our products to meet the user in the way they naturally think and communicate. A channel that is a first-class citizen in terms of its capability to access and utilize the power that is readily available under the hood and ready to consume in the form of REST APIs, Massive Datasets, Analytics, and beyond.

In this article we will share details and findings of one of our research projects as we explore how best to integrate a Natural User Interface into our products.

Project Overview

The business capabilities of our products are exposed by RESTful APIs. Our primary data store is a relational database. As such, we decided to explore the effectiveness of going from Natural Language (NL) to a SQL (Structured Query Language) query and to a REST API.

NL to SQL — streamlines database interactions by automatically converting natural language into precise SQL statements and summarizes the query response into NL. It provides user-friendly data retrieval suited for both professionals and novices.
NL to REST — simplifies API interactions for the end user by translating a user’s intent to an appropriate REST API invocation.

These NT to SQL and NL to REST capabilities provide a natural language interface to complex underlying technology. The following figure illustrates at a high level the architecture of the project.

Diagram illustrates at a high level the architecture of the project. — **Figure 1:** High Level Architecture diagram for the project

Before we go into the details of our exploration, let’s start by defining some relevant terms.

Some definitions

AI, Generative AI, Natural Language Processing (NLP), and Large Language Models (LLM) are common terms heard lately. Machines have improved life, from the wheel that transformed agriculture to the screw that held complicated construction projects together to today’s robot-enabled manufacturing lines. Artificial Intelligence is the machine’s capability to perform tasks associated with intelligent beings, such as understanding, interpreting, or learning from experience.

Artificial Intelligence (AI) is like teaching a computer to think and learn, almost like a human. Imagine a robot that can do things like recognize pictures, understand speech, or play games. It’s like programming the computer to figure things out on its own and make decisions based on the information it has. It’s a way of making machines smarter and more useful.
Generative AI is a specific part of AI that focuses on creating new things. Imagine an artist who paints pictures or writes music. Generative AI is like teaching a computer to be that artist. For example, a Generative AI could create new pictures that look like real paintings or compose new music that sounds like it was made by a human musician. It can take a few examples or rules and then make something entirely new and unique from them.
Natural Language Processing (NLP) is like teaching a computer to understand and use human language. Imagine having a conversation with your computer, and it can understand what you’re saying and respond back, just like talking to a friend. NLP helps computers read, understand, and make sense of human languages, like English. It’s used in things like voice assistants on your phone, translating languages, and chatbots. It’s all about making computers better at communicating with us, using our own words and language.
A LLM is an AI model designed to process and understand human language. The power of LLM stems from its lengthy pre-training procedure, in which it is exposed to massive volumes of text from the internet. During this phase, the model learns to predict the next word in a sentence based on the context provided by the preceding terms. This method enables the model to understand and comprehend grammar, semantics, and diverse linguistic patterns. The LLM can be fine-tuned on specific tasks or datasets for specialized applications.

Natural Language to REST

Our approach combines several components, from OpenAPI specification to function creation, K-means clustering, and semantic embedding, resulting in efficient communication between human language and API endpoints.

The solution we built leverages NLP techniques and OpenAI function calling capabilities. By enabling users to interact directly with APIs through natural language, this service eliminates the need for laborious User Interface (UI) development, paving the way for more inclusive and efficient API interaction. The adaptability of the service to API changes ensures seamless integration with evolving API specifications.

OpenAPI Specification Extraction

At the core of the solution lies a process that seamlessly integrates the API functionality from OpenAPI specifications. We use a blend of scraping and API querying to create comprehensive metadata that encapsulates details, including endpoint descriptions, parameter prerequisites, response structures, and response codes. This structured information is the foundational building block for the following transformative phases.

Semantic Endpoint Embedding Using Hugging Face

We harness the capabilities of Hugging Face’s open-source all-MiniLM-L6-v2 model to encode each API endpoint description into a detailed 384-dimensional vector representation. This detailed embedding process ensures that even the most nuanced aspects of each endpoint’s functionality are captured, enabling intent recognition that transcends the limitations of simplistic keyword-based approaches.

Function Creation and Cluster Formation

We then combine the OpenAPI specification into intelligently crafted functions. These functions are extracted and structured from the OpenAPI specification, encapsulating essential attributes such as function name, description, and parameter details. We distill this information into functional representations through a dynamic backend process, facilitating intent detection and parameter recognition.

K-means Clustering for Enhanced Insight

The pivotal innovation in our approach derives from the strategic application of K-means clustering to the meticulously embedded endpoint vectors. This transformative step clusters API endpoints based on their semantic resemblances, ingeniously grouping functions with common characteristics. Each cluster is exemplified by a centroid — an abstract representation that encapsulates the collective characteristics of API endpoints within the cluster.

Semantic Cluster Matching for Precise API Selection

Our solution’s strength emerges by efficiently matching a natural language query against these clusters. The system seamlessly calculates the semantic proximity between the embedded query and the centroids of each cluster. The cluster whose centroid most closely aligns with the essence of the query appears as the key for fulfilling the user’s intent, leading to precise and intelligent API endpoint recommendations.

Streamlined Function Retrieval and Execution

With the optimal cluster identified, our solution retrieves the functions associated with that cluster from our curated database. These functions encapsulate the cluster’s intent, effectively embodying a nuanced understanding of user needs. The chosen cluster is then sent to the OpenAI GPT-3.5-turbo large language model, which is used to precisely detect the suitable function to call within the cluster and perform parameter extraction. This smooth process guarantees accuracy and efficient API calls, resulting in natural language API calls.

API Data Extraction

The diagram depicts a pipeline where Swagger is extracted from SwaggerHub via its API. The saved Swagger is then used to gather key API elements for clustering and function generation. These results are kept in a PostgreSQL database.

ETL pipeline workflow diagram. The diagram depicts a pipeline where Swagger is extracted from SwaggerHub via its API — **Figure 2:** ETL pipeline workflow diagram

Clustering

Benefits of Clustering

The decision to employ clustering as a solution stems from the significant challenge of determining the appropriate API to invoke, primarily concerning API detection and intent recognition. The conventional strategy involves preparing datasets based on APIs and fine-tuning LLMs, which is labour-intensive and requires scalability, along with repeated model adjustments for new API endpoints. A clustering approach is proposed based on endpoint descriptions, parameters, and request body descriptions. This strategy directs attention to clusters rather than individual endpoints, effectively mitigating the risk of selecting an incorrect endpoint. This method enhances reliability by leveraging user queries, encompassing user intentions and relevant keywords aligning with API details. Additionaly, it reduces the overhead search of endpoints by providing a smaller set to solve the NL query.

Clustering Procedure

The clustering procedure is initiated by sourcing data from the SwaggerHub API and transforming it into 384-dimensional embeddings. Clustering is achieved within this 384-dimensional space. Notably, the decision concerning the number of clusters is guided by K, aiming to minimize intra-cluster variance for proximity between data points and their centroids.

This optimization aligns to locate similar endpoint clusters for a given user query. Each cluster is anchored by a centroid, calculated as the means of embeddings from cluster points. The centroid’s significance surfaces in its role as a representative entity for prediction. Initial clustering divides all APIs into clusters comprising five or fewer endpoints, a stipulation dictated by OpenAI’s token limit (16,000) for function calls. A cluster containing more than five endpoints post-clustering is subdivided into two sub-clusters to adhere to the limitation.

The subsequent step involves ascertaining which API to invoke from the designated cluster, achieved through Named Entity Recognition.

Graph of Visualization of cluster in 2-dimension space as 384-dimension. The graph equation represents the dot product for computing similarity between the centroid and the new prompt text embedding. — **Figure 3:** Visualization of cluster in 2-dimension space as 384-dimension

The above equation represents the dot product for computing similarity between the centroid and the new prompt text embedding. All these similarity scores are sorted to find the closest cluster.

This approach exhibits a robust framework. Incorporating new APIs entails replicating the established procedure, encompassing sentence embedding creation, and predicting the most similar clusters. If a cluster holds fewer than 5 APIs, new APIs are assimilated into existing clusters; otherwise, a new cluster is formed to accommodate them. This adaptive scalability ensures that the approach remains effective and efficient in handling API expansion.

OpenAI Function Calling

The gpt-3.5-turbo model by OpenAI includes the function calling capabilities. This model can call functions rather than respond directly to a query with NL. The function calling abilities enable using ChatGPT’s outputs systematically (JSON outputs). The model can produce arguments that interact with a custom function in the source code. It opens new possibilities for building intelligent and customizable applications with flexibility and advanced reasoning capabilities.

The models can intelligently choose to output a JSON object containing arguments to call functions. The model takes the message history alongside a new property named “functions.” This property is provided as a JSON list that includes the methods name, description, and parameters. Subsequently, the models can then interpret the end user’s natural language and give a function to call. The source code can then handle calling the function accordingly. The list of functions provided by the developer:

Figure 4: OpenAI function call example for Pokemon APIs

When the functions are passed to the OpenAI LLM, the models can determine the user intent alongside the arguments extracted from the user prompt. The response looks like the code snippet:

Figure 5: OpenAI function call response for sample query

OpenAI function calling eliminates the need for exhaustive mapping of specific user prompts or creating large datasets for fine-tuning, which drives development in a more intuitive direction. The method is inherently scalable, accommodating for the addition of new functions to the list without modifying the model. The approach empowers developers to compile a comprehensive list of functions complete with descriptions and parameters. This flexibility enables the seamless handling of a wide range of intents, facilitatiing the development of dynamic conversational applications.

Calling API

We implemented a two-stage approach for API calls. Initially, we used a data extraction pipeline to gather critical data from SwaggerHub. This data was then transformed into an optimized JSON for accurate function calls. Our process concluded by entrusting chosen functions to OpenAI for parameter extraction. OpenAI’s response would include the endpoint name alongside the required payload. We execute the request translating intended functionality into action.

The following sequence diagram illustrates the end-to-end flow going from natural language to a REST API.

**Figure 6:** NL to REST sequence diagram

Results

We present the outcome of our streamlined approach for efficiently retrieving information in response to user queries.

Outcome Query visual. End-to-end example for natural language to REST API conversion — **Figure 7:** End-to-end example for Natural Language to REST API conversion

Query Analysis and Clustering

Given the following NL query, “Can you tell me about Charizard?” by comparing the sentence embedding of the query with a set of predefined cluster centroids, we successfully categorized it into Cluster 16. This cluster was visually represented as the distinct red cluster located in the upper-right corner of our clustering diagram.

Relevant Function Identification

Within Cluster 16, we systematically extracted the function within the cluster from our database. OpenAI evaluated the most appropriate function for our task: “find_pokemon_by_name.”

Named Entity Recognition (NER)

OpenAI employed Named Entity Recognition (NER) on the query to proceed with the selected function. This step aimed to extract the essential parameters required by the function. The function call correctly identified “Charizard” as the parameter, showcasing the precision of the approach.

Formulating The Request

We construct a request with all necessary details using the extracted parameter to retrieve comprehensive information about Charizard.

API Execution and Result

The constructed request is then executed, yielding information about Charizard. This final output is a testament to the efficiency and accuracy of our approach. The response is then summarized using LLM to return natural language to the user.

Visual execution of API response natural language after hitting the API request formed from OpenAI Function Call. — **Figure 8:** API response after hitting the API request formed from OpenAI Function Call

Importantly, this approach offers significant advantages. By grouping similar endpoints into clusters and selecting relevant functions, we have reduced the burden of dealing with many API endpoints. Instead of navigating through 62 API endpoints, our streamlined approach allows us to focus on just five or fewer endpoints. This strategic reduction is particularly valuable, given the constraints of the 16,000 token limit. Our approach optimizes the information retrieval process, enhancing user experience through simplicity and effectiveness.

Natural Language to SQL

The primary objective for this NL to SQL component was to enable interacting with SQL databases using natural language. Our solution devised an end-to-end approach for interacting with structured databases in natural language with metadata extraction and OpenAI’s LLM to offer a system that simplifies user database interactions by eliminating the need to query against the database programmatically. Our API provided metadata extraction, SQL generation and result summarization capabilities that serve as a foundation for constructing SQL queries that reflect the user’s intention.

Database Metadata Enrichment

Our solution involves extracting metadata from the database schema. We programmatically extract the database structure in real time, including tables, columns, constraints, and relationships. This schema guides the generation of SQL statements that align with the database’s underlying structure. This generated metadata is passed as a part of the system prompt. The given class diagram shows the metadata extracted for query generation, which is the optimal solution covering all the critical information in the database while keeping the prompt under the 16,000 tokens.

Class Diagram for Metadata schema showing the metadata extracted for query generation. — **Figure 9:** Class Diagram for the Metadata Schema

Zero-shot Prompting

Zero-shot Prompting is a method where we guide a powerful LLM to perform specific tasks without needing to train it extensively beforehand. This method is achieved by giving the model carefully created instructions. These instructions include details about the task, how the input should look, and the expected output. This method is much quicker than the traditional way of preparing data for training.

SQL Generation

By leveraging the metadata, we integrate OpenAI’s LLMs to generate valid SQL from the user’s natural language.

Summarization

Furthermore, we utilize the language model to summarize SQL query results, offering users an overview of the data in natural language.

Alternative Solutions

The comprehensive, open-source LangChain framework allows developers to create language-powered apps leveraging LLMs. Chaining components from different modules enables the development of specialized applications around LLMs. Summarization, generative question-answering (GQA), and chatbots are a few applications that can be developed using LangChain.

The following sequence diagram illustrates the end-to-end flow going from natural language to a SQL query execution.

NL to SQL sequence diagram illustrates the end-to-end flow going from natural language to a SQL query execution. — **Figure 10:** NL to SQL sequence diagram

Results and Performances

After testing the Prompt Engineering approach and the LangChain Framework with numerous questions of various complexities, the prompt engineering solution performed much better than the LangChain framework. Although the LangChain method could answer all the questions, it took longer than prompt engineering. To support our claim, the graph below depicts the ten natural user inputs given to both approaches; there was a significant time difference.

Graph — shows the difference between performances of LangChain and Prompt Engineering Approach in terms of time. — **Figure 11:** A graph that shows the difference between performances of LangChain and Prompt Engineering Approach in terms of time

On summarizing the results from the graph, it is observed that LangChain takes an average time of 7.53 seconds. In contrast, the prompt engineering method had an average time of 4.84 seconds. With this, prompt engineering outperforms the LangChain framework, with a significant gap in terms of performance. Runs three and ten are with a noteworthy time difference. The above graph shows that in run ten, LangChain takes 7.74 seconds to generate the output. In contrast, the other approach took 4.85 seconds. It is also observed that in run three, the time LangChain takes is 4.66 seconds; however, in the prompt engineering method, it was just 2.19 seconds. The maximum time prompt engineering method takes to answer a question is 6.29sec, less than the average time the LangChain framework takes to generate the output.

Furthermore, both approaches give the similar natural language output as shown in the following figures.

Visual of code — Comparison of Prompt Engineering vs LangChain for given question. — **Figure 12:** Comparison of Prompt Engineering vs. LangChain for given question

LangChain’s comparatively slower performance can be attributed to its distinct schema. It employs iterative query generation and multiple syntax checks, which contribute to the observed lag. In contrast, our solution generates precise queries in one go, building upon LangChain’s capabilities.

Acknowledgements

We would like to extend our gratitude to Varghese Cottagiri, Jenston B’Vera, Kunal Mahajan, and Kyra Azzopardi for their support throughout this project. Special thanks also go to Simon Kalechestein and Charbel Safadi for their extremely helpful review and comments on this article.

References

Artificial Intelligence Definitions. Available at: https://hai.stanford.edu/sites/default/files/2020-09/AI-Definitions-HAI.pdf (Accessed: 09 August 2023).
OpenAPI (2023) OpenAPI Initiative. Available at: https://www.openapis.org/ (Accessed: 12 August 2023).
OpenAPI Specification V3.1.0 (no date) OpenAPI Specification v3.1.0 | Introduction, Definitions, & More. Available at: https://spec.openapis.org/oas/v3.1.0 (Accessed: 12 August 2023).
Hugging face — the AI community building the future. (no date) Hugging Face –. Available at: https://huggingface.co/ (Accessed: 12 August 2023).
Education Ecosystem (2018) Understanding K-means clustering in machine learning, Medium. Available at: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1(Accessed: 09 August 2023).
Build, Collaborate & Integrate APIs SwaggerHub. Available at: https://app.swaggerhub.com/home(Accessed: 09 August 2023).
Cosine similarity Cosine Similarity — an overview | ScienceDirect Topics. Available at: https://www.sciencedirect.com/topics/computer-science/cosine-similarity (Accessed: 09 August 2023).
Function calling and other API updates. Available at: https://openai.com/blog/function-calling-and-other-api-updates (Accessed: 10 August 2023).
Tiu, E. (2021) Understanding Zero-shot learning — making ML more human, Medium. Available at: https://towardsdatascience.com/understanding-zero-shot-learning-making-ml-more-human-4653ac35ccab(Accessed: 10 August 2023).
Prompt engineering and LLMS with Langchain Pinecone. Available at: https://www.pinecone.io/learn/langchain-prompt-templates/ (Accessed: 10 August 2023).
OpenAI. Available at: https://openai.com/ (Accessed: 10 August 2023).
LangChain. Available at: https://docs.langchain.com/docs/ (Accessed: 10 August 2023).
Documentation (no date) PokéAPI. Available at: https://pokeapi.co/docs/v2 (Accessed: 10 August 2023).
GitHub. Available at: https://github.com/morenoh149/postgresDBSamples/blob/master/chinook-1.4/Chinook_PostgreSql.sql (Accessed: 10 August 2023).
Mckaywrigley (no date) Mckaywrigley/chatbot-ui: An open source chatgpt UI., GitHub. Available at: https://github.com/mckaywrigley/chatbot-ui (Accessed: 10 August 2023).