Challenges of building LLM apps, Part 2: Building Copilots

Published in

Data Science at Microsoft

5 min readOct 31, 2023

In our previous article on the challenges involved in shipping even a simple LLM feature, we delved into the formidable task of crafting LLM (Large Language Model) features within enterprise settings. In this article, we dive even deeper into the intricacies of creating an entirely new LLM product, such as our Microsoft Copilots.

Microsoft features a rich lineup of Copilots, including GitHub Copilot, Dynamics Copilot, PowerApps Copilot, Microsoft Office Copilot, and more. In this discussion, we continue drawing insights from various Copilots, but our primary focus is on one remarkable companion — the Viva Engage Copilot.

We’re still building and iterating our Copilots with our customers, so if you’re interested, please do reach out to your Microsoft representative! We are really happy with the current experience and we’re hearing from so many of our users that they are really loving it as well!

A new journey

It was a long journey to get here. With tools still under development and the UX (User Experience) patterns not yet crystallized, navigating the uncharted waters of LLM feature development has been a considerable challenge. In fact, based on our experience, as a rule of thumb we believe you should anticipate doubling your estimated development time for an LLM feature compared to a non–LLM-based counterpart.

Now, let’s explore a few specific hurdles we’ve encountered on this journey, along with the invaluable lessons we’ve gleaned.

Picking the right UX

The most challenging part of launching the LLM product was constructing the UX for how the user should interact with it. We grappled with a fundamental choice: Should we adhere to familiar patterns like chat interfaces, or should we seamlessly embed the experience within our product?

Lesson: The amount of iteration to get this right is considerable. We redid our entire design a few times to make sure we’re wow-ing users. If you’re launching a net new experience with LLMs in your product, budget for at least two to three UX iterations.

Key questions loomed over us: How to position it? How intrusive should it be? What skills should we allow? The most important question was how are we adding value on top of ChatGPT? This is such a big question that we’ll come back to it later.

Typically, to answer these questions, we would do some user research or do a first version and quickly get feedback. The challenge here is that it is also hard to test with users because until it’s great, users won’t use it — they’ll simply use ChatGPT instead. So, you can’t just release an MVP, because you won’t get great usage or test results. In this case, UX Research (UXR) helps but doesn’t answer the full question because people are generally looking at design mocks or prototypes as building out a V1 that adds a lot to ChatGPT (and isn’t just a thin layer), and this takes a lot of time. It is challenging to quickly release a V1 and then iterate because of the high bar for users and the enterprise environment.

Lesson: To navigate this challenge, we conducted internal testing and made informed decisions based on established design patterns. Don’t rely solely on user research feedback as most users don’t yet know what all the possibilities are. Instead, you have to test with real experiences, which can take some time.

For Viva Engage Copilot, we iterated across multiple UX versions, from a simple wizard to a full chat bot. We found that people generally like chat and so we kept chat. However, we constrained it to keep the experience super simple for users.

Lesson: Users tend to prefer one distinct function per interface. Avoid overwhelming users by trying to encompass multiple functions within a single chat. For scenarios that necessitate multiple functions, prioritize a hero scenario.

Adding value on top of ChatGPT

The challenge loomed large: How could we augment the already potent capabilities of ChatGPT? We really had to ask ourselves how do we add value on top in a way that that’s not simply the convenience of having ChatGPT inside an app?

The solution lies in incorporating app-specific context directly into the application. However, the simplistic approach of merely feeding app context to the LLM doesn’t significantly enhance its value. The real breakthrough occurs by combining the LLM with the context within the app and integrating it with another Machine Learning (ML) model.

Lesson: The combination of LLM, app context, and additional ML models is a game-changer. The magic unfolds when you seamlessly integrate LLM with other ML capabilities.

Consider our approach in PowerPoint, where we harnessed a custom ML model to craft comprehensive presentations, complete with images, augmenting the text generated by ChatGPT. Similarly, in Excel, our proprietary model harmonized data comprehension with LLM functionality. For Viva Engage, we explored trending topics within a tenant, offering users valuable insights. This includes features like conversation starters, utilizing a blend of embeddings and a separate LLM. In the Copilot experience, users have the ability to inject these threads to provide immediate context.

Parsing user queries

When incorporating multiple distinct ML models into a Copilot’s capabilities, careful consideration must be given to how they are activated. For instance, within PowerPoint Copilot, distinguishing between user requests such as creating a presentation or adding text to a slide is crucial. To address this complexity, especially when dealing with multiple skills, we implement a pre-processing step to determine the appropriate action before involving the LLM. We have developed an innovative framework called Office Domain Specific Language, which is specifically designed to map user queries to the correct actions. Additionally, we maintain a centralized library of skills that this engine can access. This intricate framework, which we can delve into more extensively in a later article, is our secret ingredient for ensuring precise skill activation.

Parsing GPT responses

When it comes to dealing with GPT responses, we’ve read numerous articles discussing the challenges of parsing them accurately. Functions do lend a hand in this process, but even they aren’t flawless. It’s a journey that takes weeks, if not months, to truly nail down.

Applying LLM to consumption

Now, when it comes to applying LLM to consumption, we know users place immense value on this aspect — whether it’s summarizing trending topics in a Viva Engage tenant or condensing lengthy word documents. Users love to absorb information and then take action, but summarizing insights without losing that crucial nuance? That’s the real challenge.

Lesson: We’ve found it incredibly helpful to lean on other models, like in Engage where we use embeddings to group threads together and extract themes from similar ones.

While there are still plenty of hurdles ahead in our quest to deliver the best possible user experience, we’re absolutely thrilled with the progress we’ve made so far. If you’re curious and eager to give any of the Copilots a try, don’t hesitate to reach out to your Microsoft representative! We genuinely hope you’ll enjoy the experience too!

Aditya Challapally is on LinkedIn.

Check out the other articles in this series:

Challenges of building LLM apps, Part 1: Simple features

Let’s start with the basics that many folks know about LLMs at this point, especially if you’re a regular Data Science…

medium.com

Challenges of building LLM apps, Part 3: Building platforms

A journey to create a platform for generating summaries and themes from lists of threads using Large Language Models