The Role of GenAI and LLMs at Kustomer

Published in

Kustomer Engineering

9 min readJan 16, 2024

Generative AI and LLMs

Generative AI and LLMs are arguably the most popular terms in the tech news and social media this year. They have come to revolutionize the world of Artificial Intelligence and how we build software, but also how we interact with machines across all sectors and industries, and in our daily lives.

Generative AI refers to a category of artificial intelligence that is capable of creating new content that is similar, in statistical properties, to the data it was trained on. It learns from a vast amount of data, internalizing the underlying patterns, structures, and features. Using this learned knowledge, generative AI can then generate novel data samples that were not in the training data but bear a resemblance to it in terms of structure and style. The generated content can range from text, images, and audio, to more complex data like video or even 3D models.

An LLM, which stands for large language model, is a Generative AI tool that shows astonishing language capabilities. More specifically, it is the result of training a massive neural network on a vast amount of text data. The training process consists of showing the machine millions of text examples with the goal of predicting the next word in a sequence, given the words that have come before (a task known as language modeling). By processing countless sentences, phrases, and words during training, an LLM captures intricate patterns and relationships within the language. It essentially learns a statistical representation of the language which includes grammar, syntax, and even some level of semantic understanding. It also seems to capture some general understanding of how our world works and, to some extent, what we usually call common sense.

Language models are not new

Indeed, at Kustomer, more than 3 years ago, we started leveraging what today sounds like prehistoric language models (BERT, RoBERTa, and Universal Sentence Encoder) to build language understanding features and provide intent detection, message categorization, suggested actions and semantic search. However, the size of the new models and their capabilities have only increased over the last few months. The large in large language model refers to the size of the training data and the extensive amount of parameters the model has, often numbering in the billions or even trillions. And the size of these models has shown to contribute to their ability to comprehend and generate text in a coherent, contextually appropriate, and often insightful manner.

A key feature of modern LLMs such as ChatGPT, LLaMA, or Claude is their ability to handle a variety of natural language processing tasks: they can answer questions, summarize text, translate languages, and much more, right out of the box. These capabilities open up exciting avenues for creating more intelligent, understanding, and helpful dialog systems. They can understand instructions and queries in natural language, provide informative responses, and maintain a coherent conversation over multiple turns, which are crucial features for effective conversational assistants.

LLMs are not perfect

In any case, LLMs are not perfect magical tools. The more we use them, the more we understand their limitations. Some of their drawbacks are already very well known:

They don’t know about recent events or real-time information. LLMs are trained on static datasets, meaning they have a knowledge cut-off date beyond which they are unaware of new events, discoveries, or changes in information.
They don’t know about domain-specific, private and custom data. LLMs can only be aware of facts and data they have been trained on. They can’t access or retrieve personal or company-specific information.
They sometimes provide misinformation and made-up answers, something commonly referred to as hallucinations.
They are prone to show the same biases, societal prejudices and stereotypes present in their training data.
Their behavior is not deterministic and they are often seen as black boxes where the internal workings are not easily interpretable.

That said, an imperfect tool can still solve so many tasks and be very useful. Let’s dive into how we use modern LLMs to build smart products at Kustomer.

LLMs for Conversational Assistants

Because of their language capabilities, LLMs look like the essential component of a smart conversational assistant. However, we have already mentioned that while they shine at providing accurate responses about general world knowledge, they are unable to answer questions about domain-specific, private, and custom data. How can we customize the LLM knowledge to provide factual and accurate answers in a specific domain, such as customer support for your business?

Kustomer’s AI-generated Responses

AI Responses is a new type of interaction that Kustomer offers within KIQ Customer Assist, which is able to understand even the most complex inquiries expressed in natural language, and provide precise answers, as long as they can be found in a knowledge base.

They can effectively understand messages with typos and variations, something hard to achieve with traditional chatbots, and provide elaborated conversational answers.

And the answers can be easily customized by selecting a target collection of articles. No need to get involved in costly and long processes of training from scratch a new foundation model or fine-tuning an existing LLM.

Retrieval Augmented Generation (RAG) System

One of the most common architectures meant to augment or customize the knowledge of an LLM with additional private or custom data consists of building a Retrieval Augmented Generation (RAG) system. A RAG system combines semantic search technology and the language understanding and language generation skills of an LLM.

Such a system encompasses three main components:

Indexing: an offline step that loads the input documents, transforms them, and ingests their semantic representations into a suitable storage, typically a vector database.
Retrieval: a real-time process that takes the user’s input query, transforms it using the same representation as the documents ingested in step 1, and using some type of similarity-based algorithm, retrieves a set of relevant documents likely to contain an answer.
Answer generation: a real-time step that fetches the documents retrieved and instructs an LLM to generate an accurate answer to the user’s input query, supported by those documents.

Indexing

In the context of search, indexing refers to the process of storing the documents we want to search in a suitable database. Both the representation of these documents and the database are designed to optimize speed and performance at query time.

In the field of Natural Language Processing (NLP), we call embeddings any numerical representation of the meaning of a word, a sentence, or a longer piece of text. These embeddings can be generated manually, using different statistical techniques or, more recently, using a language model. When you look carefully at an embedding, you can only see what looks like a nonsensical bunch of numbers. The key idea to remember is that these embeddings encode and compress the information in such a way that words or sentences that have similar meanings also have similar numbers.

So, we have access to modern LLMs (there are excellent alternatives out there, from vendors such as OpenAI, Anthropic, or Cohere to open-source models specifically trained for semantic similarity, such as the ones published on HuggingFace), which have proved very good at generating semantic representations of text, and we can use them to easily transform any document (eg. a knowledge base article) into an embedding.

And that’s precisely what we do in our knowledge base: every time a user composes and publishes a new article, we analyze its structure (extracting the title, the body, the tags, and some metadata), transform its contents, and store different representations of it (being one of them, a semantic embeddings) in a vector store.

Notice that, because the LLMs typically used to create embeddings have some limitations in the length of the input text and because you don’t want your embeddings to encode too many unrelated pieces of information, it is common practice to split a long document into smaller snippets. There are multiple strategies to do that, but the overall goal is to make your chunks as concise and cohesive as possible.

Retrieval

Once we have a collection of documents’ embeddings in a vector stored, we can easily leverage those embeddings using semantic search. Semantic search is a modern approach for fuzzy search technology which promises to deliver more accurate results because it attempts to map queries and retrieved documents according to their intents, meanings, or concepts, instead of simply matching plain words. Imagine a search engine that is able to understand the intent and to handle the contextual meaning of complex questions, instead of simple isolated keywords.

There are two basic requirements to use semantic search:

We need to encode the input query using the same representation as the documents. In our particular use case, this implies creating query embeddings using the same LLM used to create the embeddings of the knowledge base articles.
We need to compute the distance between the query embedding and the rest of the document embeddings in the vector space, and find the top-k most similar ones (using eg. some version of k-nearest neighbors (k-NN) algorithm).

Additionally, at this point, in order to maximize the precision of these retrieved results, we may want to use the language understanding powers of the LLM to re-rank the retrieved results or to filter the documents that are likely to contain the answer. The main goal is to identify which documents (and only those) contain the information required to properly answer the user’s question.

Answer generation

Lastly, once we have retrieved a set of articles likely to support the answer, it is time to leverage the language capabilities of the LLM again. In this case, we want the model to understand the user’s question and to force it to generate a factual answer.

This can be accomplished by instructing the model to respond to the question by using only the information provided in the supporting documents, which represents a new paradigm of interacting with machines, a new way of coding using natural language, known as prompt engineering. It’s still a young discipline, which involves so much trial and error, but to some extent an LLM can be tamed to provide accurate, reproducible, and consistent behavior.

Building LLM-based apps is challenging

Most of the time, a RAG system will outperform any other traditional Conversational Assistant. When everything goes well, the user experience is unbeatable. But in such a new field, whose tools and practices are rapidly evolving, there are a number of interesting problems yet to be solved.

In general, shipping LLM-based products is very complex. Evaluating their performance, testing them end to end, interpreting and explaining their behavior, and scaling up the whole system are hard tasks, mostly due to the probabilistic behavior of the LLMs. Not every unexpected behavior or incorrect answer corresponds to a bug in the code that needs to be fixed.

We are still learning and investigating new ways to improve the overall experience of our Conversational Assistants. Do you have experience productizing LLMs? Are you interested in working on some of the challenges mentioned above? It turns out we are hiring… Come join the Krew!!

Conclusions

In this post, we have delved into the transformative impact of Generative AI and Large Language Models (LLMs). Generative AI, with its ability to produce new content mirroring the statistical properties of its training data, ranges from creating texts and images to more complex outputs like videos and 3D models.

At Kustomer, we’ve progressed from early models like BERT and RoBERTa to modern, more capable LLMs. These models excel in a variety of natural language processing tasks, enhancing the functionality of conversational assistants. They can handle complex language tasks, maintain coherent conversations, and offer insightful responses. However, they are not without limitations, such as their static knowledge base, inability to access real-time or domain-specific information, potential for misinformation, and inherent biases.

To address these limitations, we have developed AI-generated Responses within our Conversational Assistants. These responses understand and accurately answer complex inquiries, benefitting from a knowledge base. Moreover, our implementation of the Retrieval Augmented Generation (RAG) system combines semantic search with LLMs’ language capabilities, improving the precision and relevance of responses in customer support contexts. This system involves indexing, retrieval, and answer generation processes, each leveraging modern LLMs and semantic search technology.