Revolutionizing Business with Large Language Models Powered Systems

Neta Barkay
8 min readMar 3, 2023

--

Co-authored by Itamar Friedman

By Tom Parandyk with Midjourney

The introduction of ChatGPT ignited people’s imagination and dramatically increased interest in Large Language Models (LLMs) and Generative AI. From revolutionizing customer service to powering virtual assistants and writing code, the potential of LLMs is exciting. Besides individual users and generative AI-focused startups, many mature companies realized they would like to join this up-and-coming new technology and its disruptive capabilities. These mature companies want to integrate language models within their products to add new features or enhance existing NLP capabilities. Whether we look at new products that are enabled thanks to generative AI, or existing products adding or enhancing features, the language model could not be standalone but combined with additional abilities.

What do developers need to leverage Large Language Models and create robust intelligent business solutions?

This post discusses various components that may enable the creation of a complete LLM-powered product. We start with fundamental building blocks, including prompt design, model finetuning, database retrievals, and API integrations. In the second part, we claim that customer-facing systems, or those in critical domains, require developing more advanced capabilities, such as adhering to company policy and tone of voice, user safety, protection from malicious activity, privacy management, and quality assurance.

Creating an LLM-powered System

Like other software solutions, a system powered by LLMs would address a specific need and accomplish distinct tasks. It would be linked to your business operations, using your data, and integrating with your existing systems. It should provide accurate and valuable outputs and references to support its answers. Here are some concepts that would help achieve this.

Prompt design: crafting inputs for task-specific outputs

Prompt design, or prompt engineering, gets vast attention as a customization method of the LLM’s output. In a few lines of text, you assign the language model a role, writing style, and interaction objectives. For example, it can act as a customer success representative and write content in a brand’s style. There are many lists of recommended prompts available. For question-answering use cases, you can use few-shot learning: provide within the prompt input-output pairs as examples. Using the prompt, you perform in-context learning as you “teach” the model to solve your specific tasks.

Under the hood, prompt design takes the language model, which was trained on vast amounts of data, and brings its attention to a specific context it would use for the response. As the impact of the prompt on the model’s network is unknown, manual prompting often becomes an iterative process, taking longer for more complicated scenarios.

Introducing additional data: enhancing LLM knowledge

Pre-trained language models are trained on lots of data, but not your data. For example, if you build a customer support solution, relevant material could be in the documentation recently added to your website or in private data sources containing customer interactions. There are several ways to introduce new data to a language model, including

  1. Hardcoded prompts — giving your information as a part of the interaction. This method suits short text, as prompts are limited to a few thousand words.
  2. Fine-tuning the model — using open-source models and retraining them on your data or creating a fine-tuned model available through API. In this process, you update the model parameters, obtaining a new LLM tailored to your data and objectives. Since the continuous training of LLMs is still unavailable, this will be an intermittent process.
  3. Database querying — store your data in a database based on vector embeddings or more traditional methods. Then given a query, search it and present the top options in the LLM prompt as a resource the model would use to compose a response. This option suits large, frequent data updates such as querying a Q&A or recent news.

Architecture design and integrations: building an LLM-powered system

We already discussed two parts of an LLM system: prompts and databases. As you develop an LLM-powered application, you see that an architecture with various components on top of the LLM is needed.

Imagine multiple prompts, LLMs, databases, internal and external API integrations, pre and post-processing of text chained together. For example, using several prompts to adjust the tune of an interaction as it progresses. APIs can be used for user authentication, database retrieval, obtaining previous purchases for personalization, and external calls like searching Google. An additional component in such architectures should determine if and when to invoke each part.

Staying grounded: avoiding hallucinations

A central fear when using LLMs is writing erroneous content: factually incorrect outputs that the model “hallucinates” by combining phrases out of context. Examples include wrong answers to questions, like Google’s chatbot Bard had in its demo, making up code packages and functions, and false promises during a customer service interaction. Hallucinations happen as language models are trained to predict the following words in a text, not for guarantees on the “correctness” of the ideas presented in complete sentences.

To address this issue, some app developers split the process into smaller tasks with lower rates of creativity, like paragraph summarization, and create multiple-component architectures. Others restrict the output to a fixed set of examples and use LLMs for intent classification. Another option is to query a database instead of relying on the pre-trained model knowledge. Or verify the correctness of output, such as compiling the written code. These measures mitigate but do not fully resolve the groundedness problem, especially for precision-focused applications.

Supporting decisions: enabling references and explainability

Given that language models can hallucinate, and in other cases: for controversial topics, domains with fake news, and to avoid plagiarism, we like to know the source of the presented information. LLMs produce text without giving a reference: a webpage, news article, piece of code, or a part of a video. Even more, they make up plausible references and citations.

Referencing is challenging as language models are trained on data without source pointers, and in inference, they mix data from multiple sources. Checking for the similarity between model outputs and potential references could offer some mitigation, but without guaranteed quality.

Making the LLM-powered System Ready for Business

The capabilities listed in the previous section provide basic functionality for the LLM-powered system. We expect that additional capabilities handling output quality would be required for business applications, especially those that don’t have humans in the loop. Some specific examples of such applications include

  1. Customer-facing applications, like sales and support bots, where the language model is the front-line and should hold the brand image and policy.
  2. Information transfer systems that perform updates via API, such as CRM, in which the output should be accurate.
  3. Critical domains like healthcare, where output quality is crucial for safety and wellbeing.

Let’s dive into those advanced capabilities.

Brand consistency: adhering to company policy and tone of voice

Companies with user-facing applications would like to control their brand image by placing constraints on the content and form of the responses during an interaction. For example, define how the model should answer inquiries regarding competitors and set the apologeticness level used in the responses. Setting the tone in the prompt could be enough to maintain it throughout the interaction. Company guidelines and specific answers could be given as additional data to the LLM-powered system, but could be challenging to follow and “remember” throughout a long interaction.

User safety: protection from harmful content

Language models could provide harmful recommendations, for example, in domains like health. They can show offensive or biased content about countries, races, politics, or subjective topics. Safety is a big concern for foundational model providers, and some are trying to restrict the output as they create the model. Some use cases would require additional in-app enhancements, such as output moderation by classifiers. In some domains, applications might remain assistants to professionals for some time.

System safety: protection from malicious user behavior

People engaging with the model may have hostile intentions. They can try to make the model write false information in a CRM, extract false promises from a chatbot support representative, or lead an interaction down the path of a public relations disaster. Malicious behavior can also result from “curious” people exploring the system’s limitations. Technically, this can be done by phrases as we see in ChatGPT of “imagine that,” “write a function,” or jail-break style hacks. A layer of protection with stopping conditions for the interaction should be implemented to prevent these scenarios.

Failing gracefully: fault tolerance and escalation policy

The previous points highlighted problematic scenarios for an LLM-powered system: going off-brand script, outputting biased data, or receiving inappropriate user input. In such cases, in other non-supported scenarios, or upon user request, a system would need to alert, stop the interaction, and continue in another medium. Depending on the case, escalation to humans in real-time or through a ticketing system may be necessary. It is important to pay attention to customer experience in these cases.

Quality assurance: meeting service standards

Quality assurance processes are essential to verify the system’s functionality, performance, and output quality. Specifically, when working with LLMs, three aspects are important to evaluate: quality of the content, safety, and groundedness. Ideally, unit tests and scenario testing could handle these. However, this could be challenging as there is no straightforward method to evaluate these aspects. In addition, the LLM output is randomized, so more complex output comparisons are needed.

Protecting user privacy: managing private data

During an LLM-powered interaction, Personal Identifiable Information (PII), such as names and emails, might be revealed by either the user or the language model through its API calls. Handling this information properly is crucial to protect the user’s privacy, comply with regulations, and ensure data is used only for its intended purposes. Managing privacy could include masking PIIs during processing, ensuring it’s not stored in the system, and post-interaction deletion or anonymization.

Real-time adjustments: adjusting to changing business needs

As a language model runs in production, it might require real-time changes. For example, a chatbot may need to inform the customers about a promotion, a delay in shipment times, or a problem on the website. The system’s architecture can address these real-time modifications by options ranging from hardcoded outputs to prompt adjustments, database updates, or intended failing scenarios.

Powerful analytics: visibility and insights

An analytics layer would allow visibility into the system’s operations. Its structure depends on the application, but as with any regular analysis of textual data, it could include raw interactions data, intent classification, sentiment analysis, trends detection, and more. A primary difference between any analytics system and one with LLMs is that it should also analyze the LLM part of the conversation. Measuring how well the language model performed across tasks and under different scenarios, and serving as a data source for optimization.

Future Thoughts

The ecosystem of LLMs is flourishing, with AI infrastructure, foundation models via API or open-source, model hubs, generative AI startups, and mature companies creating LLM-powered offerings. Funding for startups increased in the past 2 years, across various domains, from infrastructure and engineering to business applications. The industry and academic leaders show rapid developments in solving fundamental challenges in the LLM space. All these developments make it difficult to predict how this ecosystem will evolve and which players will dominate the market.

Nevertheless, the usability and potential value of LLMs are evident. As language models are perceived as crossing the Turing test, they have reached a level of maturity as the new interface between humans and machines. As LLMs can be added to almost any software quickly, we expect LLM-powered systems to be integrated into a wide range of applications in the coming years, transforming many of our textual interactions.

Acknowledgments: Thanks to all the people behind the posts, articles, and papers mentioned in this post.

--

--