Leveraging LLMs at scale for real-time business critical decisioning

By — Saumil Shah (Engineer, Marketplace)

Published in

Urban Company – Engineering

7 min readJun 3, 2024

Urban Company (UC) professionals deliver services across a wide spectrum of categories. Over the course of a request journey, right from a booking being made to a service starting, the request may get cancelled under several circumstances —

Professional is running late from a previous service that got extended
Professional or customer has a medical emergency
Customer has mistakenly booked service under an incorrect address
Professional has exhausted required product / kit inventory
Service requirements are out of scope upon further inspection on site

In this situation, either the professional or customer can initiate request cancellation under an appropriate reason. It is, however, challenging to validate this cancellation context and accurately identify who is responsible. Hence, it is crucial for UC to establish a cancellation attribution system to prevent unjust cancellations and reduce discomfort to both parties involved.

Fundamental Challenge

So, why is it challenging to attribute cancellations? Because largely all context pertaining to a cancelled request is stored in the call recordings and chat history existing between a professional and customer. Traditionally, agents would manually review these unstructured data signals to identify the cause of cancellation, whether it was due to the customer, professional, or system.

Of course, the present era of AI boom and the advent of powerful & accessible Large Language Models (LLMs) changes the status quo. This enabled us to build our first LLM powered decisioning system for cancellation attribution.

Design Principles

A set of key design goals were pre-determined for our new infrastructure to satisfy —

Should be general purpose: A central decisioning system that avails both unstructured and structured data signals to holistically serve any inference use case within UC
Should scale well: Meets SLA requirements at scale
Is accurate: Exhibits a high precision and recall rate for all auto-inferred decisions
Is transparent: Leads with a trust based approach where reason is tagged against every decision for transparency and is challengeable by the recipient (if need be)

LLM-powered Decisioning System

To fulfil above requirements, we built a platformised decisioning system with three major components: the Audio Translation Engine, the LLM Inference Layer and the Decision Engine.

1. Audio Translation Engine

We needed an Automatic Speech Recognition model that could accurately translate audio in Indian languages. Speech-to-Text models across various vendors were compared with OpenAI emerging as the clear choice for us —

Lowest word-error-rate (WER) on Indian languages
Low to almost zero hallucinations depending on audio language; resistant to noise in audio

OpenAI’s translation API leverages its open-source large-v2 Whisper model. Thus, we were faced with two choices: either to use its API or to self-host the model. Benchmarking both revealed the open-source model to be superior both in terms of translation accuracy and timestamp estimation against every dialogue segment.

With Whisper model, we use a two-phased approach wherein the model first identifies spoken language in given audio which is then used as an input for its translation phase. Also, a closer inspection of Whisper’s source code revealed potential enhancements we could implement to further boost translation accuracy.

In-house Whisper Translation (w/ enhancements highlighted in yellow)

Enhancement 1: Tweaked Whisper to utilise 10 to 40 second audio segment instead of first 30 seconds of audio to determine spoken language. This was based on the observation that first 10 seconds of an audio often contain noise, short greetings followed by pauses or attempts to establish a common language for communication — all of which can lead to incorrect language detection.

Enhancement 2: Whisper generates a probability distribution over all languages it supports, identifying the most probable one as the spoken language. However, UC only operates in specific regions. So ideally, only a subset of those languages should be detected or have a correction logic in place. For instance, Whisper might mistake Hindi audio as Urdu as the two languages are similar. Therefore, it would be logical to check if the next most probable language for given audio is Hindi and rectify accordingly.

Instance of noisy audio: Whisper API mistakenly detects audio language as Urdu instead of Hindi and suffers from hallucination. In contrast, in-house Whisper transcript is closer to actual audio content

Hence, we decided to self-host Whisper. g4dn.xlarge instances were chosen to serve the model with focus on overall cost-to-serve and performance per dollar. Added monitoring over GPU utilisation and consumer lag (async) to fine tune its serving performance within SLA requirements.

This solution only addresses half of the translation challenge. A correct translation would answer the question: who said what? So far, we’ve only concentrated on the ‘what’ aspect. The ‘who’ component is referred to as speaker diarisation.

Speaker diarisation involves assigning each piece of dialogue to a specific speaker. This is a complex issue to tackle when dealing with mono channel audio. However, there are a few methods we can attempt when working with stereo audio —

Split audio by left & right channels and run translation on each. The issue with this approach is that Whisper loses out on half the conversational context leading to low translation accuracy
Translate the entire stereo audio while simultaneously processing each audio channel to determine when the left and right channels were speaking. The difficulty arises in aligning the approximate timestamps from the Whisper transcript with the timestamps from the speaker data

So close yet so far! However, if we assume the approximate Whisper transcript timestamps to be correct (which is usually the case), then the issue becomes much simpler. Once we receive the Whisper transcript, for each audio segment, we just need to determine whether left or right channel was active or break tie in case of any overlap. This can be easily achieved using a finely-tuned silence detection algorithm.

Speaker Diarisation in conjunction with Whisper Translation

2. LLM Inference Layer

Responsible for analysing unstructured text data, this layer utilises LLMs for classification and text completion tasks, via a prompt framework. Key elements of a prompt include —

Persona: Describes who the prompt represents and who it interacts with
Task: Outlines the task at hand with step by step instructions on how to approach solving it (chain-of-thought reasoning)
Examples: Mutually exclusive & exhaustive case examples are included in the prompt. This aids the LLM in understanding / generalising the problem well such that accuracy metrics measured over a a limited sample size hold in production as well
Input data & Output schema: Input data can be any form of unstructured text data, such as audio transcripts or chat data. Output schema defines inference attributes: classification case, reason for classification, etc.

In reference to request cancellations, we consulted various SMEs inside UC to understand their SOP for attributing cancellations solely based on audio & chat data. Key objective was to discern patterns. Typically, SMEs would categorise each request cancellation into one of many pre-set scenarios such as a professional’s vehicle breaking down, a professional’s unwillingness to travel long distances for a job, a customer not being at home, a customer requesting the professional to arrive outside the booking time slot, etc.

We decided to adopt their method of reasoning, integrated over 25 scenarios into the prompt and tested iteratively over a sample audit data comprising ~1k cancellations. Key focus was on understanding the LLM’s reasoning process behind classification to help us identify cases that were ambiguous to the LLM and required additional context as well as instances where the LLM was having difficulty distinguishing between similar cases. This strategy helped us build a comprehensive yet concise prompt, guaranteeing high precision & recall rates.

Prompt Anatomy for Cancellation Attribution

3. Decision Engine

This is the central layer that orchestrates analysis and collation of system signals to arrive at a final accurate decision. Rule tables are utilised here wherein each entry represents a unique combination of system variable values with a decision mapped against it. These are easily extensible for any use case or any number of system variables and are an asset co-owned by business, product and engineering teams.

Let us understand this better in the context of attributing cancellations —

Suppose, for a given request cancellation, LLM Inference Layer extracts out intent from audio transcripts, chat data to determine cancellation case as: “professional denied entry into customer’s society”. This intent is then validated against professional’s location history to confirm if he / she indeed reached customer’s location. In case location data is not available, who cancelled the request and at what time can provide further insight. Thus, system signals help supplement LLM inference to ensure final verdict is accurate and trust-worthy.

Impact & Scale

Presently, LLMs govern roughly 2500 cancellation attributions daily within UC, achieving precision and recall rates of over 85% for cancellations caused by professionals.

Team

Engineering — Prabhat Singh, Saumil Shah, Ashish Agrawal, Mayur Garg

Product — Yatish Sharma, Abhinav Garg

Design — Sagar Bhardwaj

Sounds like fun?
If you enjoyed this blog post, please clap 👏(as many times as you like) and follow us (@UC Blogger). Help us build a community by sharing on your favourite social networks (Twitter, LinkedIn, Facebook, etc).

If you are interested in finding out about opportunities, visit us at http://careers.urbancompany.com