Challenges of building LLM apps, Part 1: Simple features

Aditya Challapally
Data Science at Microsoft
7 min readSep 19, 2023
Photo by Ilya Pavlov on Unsplash.

Let’s start with the basics that many folks know about LLMs at this point, especially if you’re a regular Data Science at Microsoft reader:

  • Large Language Models (LLMs) are super cool!
  • They can be finnicky and can hallucinate.

Those observations are fairly typical, so we’ll try not to rehash them again in this article. However, we note that things get exponentially trickier when you also must:

  • Make an LLM work in a specific use case of more than 10 million times in an hour.
  • Adhere to responsible AI principles.
  • Follow standard enterprise engineering principles (e.g., privacy, security).
  • Integrate this feature into other infrastructure.
  • Combine the LLM signal with other sources of data to rank and discard suggestions.

Your author and members of his team have been working with large language models for a few years, almost back before large language models were called “Large Language Models,” and yet some of these things still stump us. So, we want to share some tips about putting LLMs into production.

Instead of talking about complicated projects like our Copilot work right away, which we can save for a later article, in this article we pick a small LLM feature and explain how it alone can get really complicated and then slowly build on that.

The feature

The feature is relatively simple. Microsoft has a new product called Answers in Viva that allows people to ask and answer questions. For the implementation of this feature within Microsoft (i.e., not for the public implementation), questions are routed to experts whom we identify within the company (and it’s pretty cool).

When a user posts a question, we want them to tag Topics (kind of like hashtags) that would allow us to route their question even more efficiently to the right expert. As part of that, we wanted to build a feature that suggests the closest Topics to the question text. This feature provides a great example to understand the benefits and challenges of using LLMs.

So, for our base approach, we explore the implications of a feature whereby the user’s question is sent to an LLM (GPT 3.5/4), which then returns some themes that we then match to Topics.

The challenges

The challenges we discuss below surround which model to pick, when LLM responses don’t match the desired output, when edge cases are pointy, and the importance of implementing responsible AI and other enterprise standards. We relay our experiences with each one in turn.

Which model to pick?

Picking a model should be as easy as just running the same prompt through multiple models and just seeing what’s better, right? In other words, “just pick GPT3.5 Turbo!”

But because LLMs produce such different output every time, you need to do this comparison against a very large set of runs (more than 1,000) per model. And you’ll need others to judge these samples to determine what is better, or else you’ll bias the outcome.

So, you need to test out the output with tons of other people to pick a model (e.g., Babbage, Davinci, GPT 3.5 Turbo, and internal models).

What we learned: Using an LLM to pick the best output from different models can be super effective and time saving.

Then, once you’ve picked a model, you need to change the prompt, some of the parameters, and even the post processing steps.

What we learned: Each model requires different prompts! The best prompt for GPT 3.5 isn’t going to be the best prompt for GPT 4 (and so on for Babbage, DaVinci, and others).

Ironically, the more niche your LLM application, the wider your set of models becomes. If you’re picking chat, GPT 3.5 Turbo is the easy choice. But if you’re extracting content from text? Not so simple. It took us 30 percent of the time simply to decide what model we should pick, which was way more than we expected.

What we learned: Budget at least 10 percent of your overall timeline for picking a model.

At this point, many folks might say “just go with Chat GPT 4, it’s awesome and they’ve reduced the cost.” But those costs are still significant at the scale we’re working on. To put it into perspective, if 300 million Office users were to have used this even once daily, we would have racked up $40 million in cost per day. It’s critical to choose wisely to not have unintended costs.

When LLM responses don’t match the output you want

Some of the time, we found that the output from the LLM would not fit the JSON format we wanted it in.

What we learned: Without Function Calling or other mechanisms to keep the output correctly formatted, it takes 10 percent of the time to get to 80 percent good results, and 90 percent of the time to get to the remaining 20 percent.

The recently released Function Calling feature solves most of this, and TypeCheck by Microsoft does this in a clever way using Typescript objects as well. There are also a few other ways around this:

  1. Ask the model to explain step by step how it reaches an answer, known as Chain-of-Thought or COT. Keep in mind, however, that using COT might make the process slower and more expensive due to the need to generate more output.
  2. If you have a large prompt, break it down into smaller, simpler prompts.

What we learned: To solve these problems, people often say we should fine tune these models. But in the vast majority of cases, fine tuning a model is not worth it.

When edge cases are pointy (ouch!)

The bane of every large software app in production becomes a bit — no, becomes significantly — worse with LLMs.

Let’s consider a specific scenario where the LLM returns topics that do not directly match any existing topics. For instance, if the user asks a question like “What is better? Apples or oranges?” and the LLM returns “apples, oranges, comparison questions,” as possible Topics. But “fruit” is already an established Topic and the LLM hasn’t returned “fruit.”

One approach to solving this issue would be to pass the entire Topics directory each time we make a call to the LLM. However, this approach would be impractical due to absurd latency and cost increases.

An alternative approach could involve receiving the topics from the LLM outputs and then conducting a semantic search of our existing Topics database. By doing this, we can significantly save latency on the initial LLM call. However, semantic matching can be time consuming and requires some setup effort, making this option less straightforward.

What we learned: Vector databases could be a solution here. But be careful about this! People are super quick to recommend vector databases. As much as you can avoid it, keep your entire LLM pipeline as simple as possible. Vector search is awesome and we do use it in certain cases but we’ve found that it’s best to avoid it if you see only marginal results (i.e., improvement of about 5 percent).

We found that if we stick to lexical matching, we create unnecessary Topics. So, we ultimately settled on a mix of lexical and semantic matching for certain use cases — but we wouldn’t have had all these problems if we had directly chosen an embeddings-based approach.

What we learned: It’s straightforward to get an LLM to 70 percent of a functioning product, but at that point you might realize that you need to replace it with a different Machine Learning (ML) mechanism (or combine it with one) to get all the way to 100 percent.

Adhering to Responsible AI and other enterprise standards

The typical challenges with most enterprise products around privacy, security, and general appropriateness are significantly heightened with LLMs. What happens if the LLM starts extracting “layoffs” when people talk about financial distress? Not great!

At Microsoft we take our commitment to Responsible AI very seriously. It is one of our most significant commitments to our customers and our features go through rigorous checks to meet our standards for Responsible AI.

To meet this bar, we implement significant infrastructure on top of OpenAI’s already powerful features for Responsible AI, such as a service that filters out specific words, among many others.

Conclusion

Although none of these issues are too complicated individually, taken together they can easily double your initial estimates — and that’s just for the most basic LLM feature! Now imagine building a sophisticated experience that automatically creates full Excel worksheets, PowerPoint presentations, and more.

Chatbots and Copilot experiences are expected to handle a wide range of user requests beyond mere data retrieval or simple tasks. As such, they must be capable of understanding complex commands such as “schedule a new meeting for me.” But that’s for another article.

On a closing note, although broad use of AI is just in its infancy, what we’ve learned already is that putting AI into production has its own challenges. We hope this article has provided some ideas about overcoming some of the ones that we’ve encountered.

Aditya Challapally is on LinkedIn.

Check out the additional articles in this series:

--

--