Challenges of building LLM apps, Part 3: Building platforms

A journey to create a platform for generating summaries and themes from lists of threads using Large Language Models

Published in

Data Science at Microsoft

9 min readJan 2, 2024

In the first article in this series on the challenges involved in shipping even a simple LLM feature, we delved into the formidable task of crafting LLM (Large Language Model) features within enterprise settings. In the second article, we went even deeper into the intricacies of creating an entirely new LLM product, such as our Microsoft Copilots.

In this article, I share the story behind Summarization as a Service, a platform built by the team I’m on for Microsoft 365 and Viva Copilot. I explain why we needed this platform, how it works, and the main challenges we faced along the way.

Why Summarization as a Service?

Viva Engage is a product that aims to enhance the employee experience by providing personalized and relevant content, insights, and recommendations. One of the key features of Viva Engage is the Copilot, a conversational assistant that helps employees discover and interact with the most important topics and people in their network. The Copilot uses natural language processing and Artificial Intelligence (AI) to understand the user’s intent and context in order to provide the best possible responses and suggestions.

One of the challenges we faced while developing the Copilot was how to provide concise and meaningful summaries and themes for the top threads in the user’s network. Threads are conversations that happen in Yammer, a social networking service that is part of Viva Engage. Threads can contain multiple messages, attachments, reactions, and comments, and can span across different groups and topics. Threads are a rich source of information and knowledge, but they can also be overwhelming and time-consuming to read and digest. We wanted to help users quickly grasp the main idea and sentiment of each thread, as well as identify the most relevant and interesting themes across their network.

To achieve this goal, we decided to leverage the power of Large Language Models (LLMs), such as GPT-3, which are neural networks that can generate natural language text based on a given input or prompt. LLMs have shown impressive results in various natural language tasks, such as summarization, text generation, question answering, and sentiment analysis. We thought that LLMs could help us create high-quality summaries and themes for threads, and to do so in a scalable and efficient way.

However, we also realized that using LLMs for this purpose was not a trivial task. We had to deal with several challenges, such as:

How to design effective and consistent prompts for different types of threads and scenarios?
How to ensure that the summaries and themes are accurate, relevant, and grounded by the original thread content?
How to handle privacy, security, and compliance issues related to the thread data and the LLM outputs?
How to manage the performance, cost, and availability of the LLM service?
How to integrate the LLM outputs with the Copilot and other Viva Engage features?

To address these challenges, we decided to build Summarization as a Service, a platform that provides a general purpose solution for processing and returning summaries and themes from lists of threads using LLMs. Summarization as a Service is meant to be a reusable and extensible platform that can support different use cases and scenarios across Viva Engage and other products that require summarization of threads. The next section describes how Summarization as a Service works and its main components.

How Summarization as a Service works

Summarization as a Service is a platform that exposes a set of APIs and services for requesting and retrieving summaries and themes for lists of threads. The platform consists of three main components: a job management system, a single-thread summary cache system, and a multi-thread summary system.

The job management system is responsible for managing the job requests that are sent by the service customers, such as the Copilot or the network analytics feature. A job request contains an arbitrary list of thread IDs that need to be summarized and themed, as well as some parameters that specify the type and format of the output. The job management system stores the job requests in a persistent queue and ensures that they are processed in a FIFO (first in, first out) order. The job management system also handles retries, failures, and timeouts of the job requests, and ensures that no job is lost or duplicated.

The single-thread summary cache system is responsible for generating and storing summaries for individual threads. A summary is a short text that captures the main idea and sentiment of the thread starter message, which is the first message in a thread. The single-thread summary cache system acts as a cache, meaning that it checks whether a summary already exists for a given thread ID, network ID, model type, and model version. If it does, it returns the summary immediately, without making any downstream calls. If it does not, it generates a new summary by calling the LLM service with a suitable prompt, and stores the summary in CosmosDB, a database service that is used by the platform. The single-thread summary cache system ensures that the summaries are up to date by subscribing to message update and delete events and by setting a time-to-live (TTL) for the summaries in the database.

The multi-thread summary system is responsible for generating and storing summaries and themes for groups of threads. A summary is a short text that captures the main idea and sentiment of the group of threads, and a theme is a keyword or phrase that represents a common topic or aspect of the group of threads. The multi-thread summary system can support three different scenarios: returning a summary and themes for a single thread, returning a summary and themes for a group of threads, and returning multiple summaries and themes for a group of threads using semantic clustering to find natural subgroups. The multi-thread summary system uses the single-thread summary cache system to obtain the summaries for each thread in the group, and then concatenates them into a single text. It then calls the LLM service with a suitable prompt to obtain a summary of summaries and a list of themes for the group of threads. It also performs some optional post-processing operations, such as filtering out negative content or customizing the output based on the use case. It then stores the summary and themes in CosmosDB and publishes an event to notify the service customers that the output is ready.

The service customers can use the APIs and services provided by the platform to request and retrieve summaries and themes for lists of threads. They can either request a single-thread summary, which is returned immediately or calculated in real time, or request a multi-thread summary, which is returned asynchronously after a batch process is completed. They can also request a multi-thread summary with semantic clustering, which returns a set of IDs for multiple subgroups of threads, each with its own summary and themes. The service customers can also specify some parameters that affect the output, such as the model type and version, the output format, and the negative content filter. The service customers can also monitor the status and progress of their job requests and receive notifications when the output is ready.

What were the main challenges we faced?

Building Summarization as a Service was an exciting and rewarding journey, but it also came with some challenges and trade-offs that we had to overcome. Here are some of the main ones:

Prompt design: One of the most important and difficult aspects of using LLMs for summarization and theme extraction involved how to design effective and consistent prompts for different types of threads and scenarios. A prompt is a text that is used as an input for the LLM service and guides the LLM to generate the desired output. A good prompt should be clear, concise, and informative, and should elicit a high-quality and relevant output from the LLM. However, finding the optimal prompt for a given task and domain is not easy, and often requires a lot of trial and error, experimentation, and evaluation. We had to iterate on several versions of prompts, and test them on different samples of threads, to find the ones that worked best for our use cases. We also had to consider the trade-offs between using generic prompts that can work for any thread or scenario, and using specific prompts that can tailor the output to a particular thread or scenario.
Grounding and accuracy: Another challenge we faced while using LLMs for summarization and theme extraction was how to ensure that the output was grounded by the original thread content, and that it did not contain any errors, inaccuracies, or inconsistencies. LLMs are very powerful and flexible, but they are also prone to generating text that is not based on the input, or that contradicts or misrepresents the input. For example, an LLM might generate a summary that includes a fact or a detail that is not present or true in the thread, or that omits or changes a key piece of information from the thread. This can lead to confusion, misunderstanding, or misinformation for the users who consume the summaries and themes. To mitigate this issue, we had to carefully design the prompts to include grounding signals, such as references to the thread content, and to evaluate the output using various metrics and methods, such as ROUGE scores, human ratings, and error analysis.
Privacy, security, and compliance: A third challenge we faced while using LLMs for summarization and theme extraction was how to handle the privacy, security, and compliance issues related to the thread data and the LLM outputs. Thread data is sensitive and confidential, and it belongs to the customers who use Viva Engage. We had to ensure respect for customers’ privacy and ownership, and compliance with the relevant laws and regulations, such as GDPR and CCPA. We also had to ensure protection of the thread data as well as the LLM outputs from unauthorized access, modification, or leakage, and the use of secure and encrypted channels and storage for data transmission and persistence. We also had to ensure that we were following the best practices and guidelines for using LLMs, such as applying responsible AI principles, monitoring the LLM outputs for potential harms or biases, and providing transparency and accountability for the users who consume the summaries and themes.
Performance, cost, and availability: A fourth challenge we faced while using LLMs for summarization and theme extraction involved how to manage the performance, cost, and availability of the LLM service. LLMs are complex and resource-intensive models, and they require a lot of computing power and time to generate text. This means that using LLMs for summarization and theme extraction can have a significant impact on the performance, cost, and availability of the platform and the product. We had to optimize the platform architecture and the data flow to ensure that the LLM calls are made efficiently and effectively, and that the summaries and themes are generated and delivered in a timely and reliable manner. We also had to balance the trade-offs between the quality and the quantity of the LLM outputs, and between the frequency and the freshness of the LLM outputs. We also had to consider the potential risks and contingencies of the LLM service, such as downtimes, failures, or changes, and how to handle them gracefully and robustly.
Integration and adoption: A fifth challenge we faced while using LLMs for summarization and theme extraction involved how to integrate the LLM outputs with the Copilot and other Viva Engage features, and how to ensure the adoption and satisfaction of the users who consume the summaries and themes. LLMs are a new and emerging technology, and they are not yet widely used or understood by the general public. We had to ensure that the LLM outputs are presented and delivered in a user-friendly and intuitive way, and that they are aligned and consistent with the user’s expectations and needs. We also had to ensure that the LLM outputs are integrated and compatible with the existing features and functionalities of the Copilot and other Viva Engage features, and that they enhance and complement the user experience and value proposition of the product. We also had to ensure that we collect and analyze the user feedback and behavior, and that we use it to improve and iterate on the LLM outputs and the platform.

Conclusion

The team hopes that this platform provides a valuable and useful service for Microsoft 365 Copilot and Viva customers and users, and that it enables them to discover and interact with the most important topics and people in their network. We also hope that this platform serves as a reusable and extensible solution for other products and scenarios that require summarization of threads. We are excited and proud of what we have built, and look forward to hearing your feedback and suggestions.

Aditya Challapally is on LinkedIn.

Earlier articles in this series:

Challenges of building LLM apps, Part 1: Simple features

Let’s start with the basics that many folks know about LLMs at this point, especially if you’re a regular Data Science…

medium.com

Challenges of building LLM apps, Part 2: Building Copilots

In our previous article on the challenges involved in shipping even a simple LLM feature, we delved into the formidable…