Lessons Learned from Our First Chatbot

Published in

Salesloft Engineering

11 min readFeb 21, 2024

written by Mengyu Xie and Sneha Subramanian

Last year, the data science team at Salesloft built and released our first chat interface using retrieval augmented generation (RAG) on large language models (LLMs). It was a fun experience — we were solving an incredible problem, the space is new, the tech is cool — and it taught us a lot. We want to share our experience building this and what we learned from it.

The Problem

Melissa Perri, in her book Escaping the Build Trap, describes the craft of product management as:

“the domain of recognizing and investigating known unknowns and of reducing the universe around the unknown unknowns.”

The voice of our customers is not just a crucial part of this investigation; it is also how our product team at Salesloft lives our “Put Customers First” core value every day.

Our product managers, product leaders, and UX researchers spend countless valuable hours speaking to our customers, understanding their processes and pain points, and brainstorming ways to bring value. They also spend countless hours sifting through product feedback data and conversations history to understand trends, topics, recommendations, and other important insights. This work is important yet tedious, and easy for things to slip through the cracks given its manual nature.

We thought: is there a better way to allow our product team members to sift through product feedback data without it being a huge time crunch?

The Solution

Why Use LLMs To Solve This?

Given all the hype around large language models and generative AI, it’s easy to answer the above question with an emphatic “Yes! LLM!” and solve everything using the shiny new gadget. But below are the reasons why this problem is actually a good use case for an LLM and RAG-based chat interface.

Reason #1: Product practitioners look for a large variety of different signals in the product feedback data. This means that quantifying each of these signals into neat analytics will be a pain, especially since those signals change as time changes.

For example, Salesloft released Rhythm in the summer of last year. After that release, our product practitioners were naturally sifting through product feedback to understand customers’ reaction to Rhythm and understand their ideas about prioritization, about signals, about focus zones, and so on. But those questions were not pertinent before that release.

Reason #2: Reading through tens of thousands of pieces of text to gauge high-level insights is a painstaking process (and can sometimes lead to inaccurate conclusions).

Think about this in terms of Amazon reviews. If your task was to buy the best coffee cup from Amazon and to make your decision you’re given pages and pages of Amazon reviews but no review score to go with it, that would be a difficult and time-consuming task. Even with the review scores, we’ve all been at the place where multiple products have approximately the same scores and you’re trying to discern from the reviews which coffee cup may get too hot to touch or crack too easily. (Which is why Amazon’s AI-generated summary of reviews is a very useful feature!)

Reason #3: Great product practitioners (and Salesloft is full of them!) will want to dig a level deeper into every insight.

Going with the Amazon coffee cup example above, it’s not enough for our product managers and UX researchers to know that a specific coffee cup cracks easily; we’d also want to know when it cracks, what causes it to crack, how often it happens, and what impact it has.

Our product practitioners also need easy access to the original data points that support a certain hypothesis so they can vet the customer’s voice for themselves.

Reason #4: The product review data keeps changing and increasing. New reviews keep coming in, and older ones start mattering less and less.

In other words, we needed a solution that understands language and supports easy, low-cost appending and updating of the data.

And Why RAG?

At this point, you’ve probably realized that a simple static summary of product feedback and/or manually uploading feedback data into GPT models to get answers is not the right solution.

But why is a RAG-based chat interface the right solution in this case?

To answer this question, we’ll go into what RAG is and what use cases it solves. We won’t go into a ton of details here, as you’ll find plenty of very well-written articles out there that go into this in detail.

Retrieval augmented generation (or RAG) is a way to give an existing LLM context from sources that aren’t a part of its training data. While Generative AI services such as OpenAI are typically tight-lipped about the training data they use for their advanced LLMs, models such as GPT-4 cannot help answer questions from specific domains that rely on knowledge the LLMs aren’t privy to.

As an example, if you give ChatGPT the command: “Describe the Riemann hypothesis to me.” — it’ll do a good job explaining the famous unsolved problem, and that’s because the data it was trained on has plenty of context about the Riemann hypothesis.

Instead, if you give ChatGPT the command: “Tell me about the recent challenges faced by Salesloft’s Forecasting users.” — it’ll either say it doesn’t know or hallucinate. This is because feedback data on Salesloft’s Forecasting product is not a part of its training data nor a part of the data it has access to at query time (at least not by default).

On the flip side, advanced LLMs such as GPT-4 or Claude are really good at deriving contextual information from data they do have access to. In other words, it’s not a real option to build an LLM from scratch that’s trained on product feedback data alone and expect that LLM to have the semantic understanding and language capabilities that our favorite LLMs have. Even if one could pull that off, as new product feedback data comes in, it will be alarmingly expensive to continuously retrain the model.

RAG gives us the best of both worlds. It lets us use the advanced language capabilities of LLMs such as GPT-4 or Claude, while giving them relevant context from data those LLMs do not have access to. If you’re interested in learning more about RAG and how to implement it, this article is a good place to start.

Architecturally, our RAG-based chat interface looks like:

At a high level, the process involved:

Extracting the relevant data. In our case, this came from APIs where we could retrieve product feedback data from different sources.
Transforming it into a format that made sense. This involves cleaning the data and standardizing it (since the raw format will differ by source).
Applying an embedding model to get embeddings, which are numerical representations that capture the semantic meaning in the data. We used OpenAI’s text-embedding-ada-002 model for this work.
Storing the documents and embeddings into a vector database. We used ChromaDB, which is an open-source vector database.
Using a semantic similarity method to retrieve relevant documents when a new question is asked through the chat interface.
Leveraging those relevant documents and an LLM to answer that question, and subsequent questions on that topic. We used OpenAI’s GPT-4 for the first question and GPT-3.5 for subsequent questions.

The Challenges

At first glance, the idea of RAG is appealing and seemingly straightforward — just embed all the unstructured data and toss them in the vector database, all set! However, reality begs to differ. The devil, as they say, is in the details, and developers are likely to need to address some technical challenges. In our experience, every challenge is unique and demands a tailored approach.

Are we loading and tagging the right data into the vector databases?

The success of a chatbot heavily relies on how we handle data ingestion pipelines, how we efficiently categorize data, and how we handle data segmentation (i.e. text chunking). When dealing with customer feedback, these texts are often short enough to let us bypass the chunk-splitting overhead. But at the same time, it introduces other problems — they contain a lot of noisy snippets: they might be too vague (for example, a feedback that only says “comfort and convenience”), or riddled with Salesloft-related jargon. Take a look at the following example:

Feedback #1 — “Looking for more than just prioritizing tasks like analyzing text from previous emails, and notes from phone conversations to make suggestions on what to say, issues they brought up in the past, etc.”
Feedback #2 — “We wish we could filter the recordings by account or participant.”

When the user asks:

Query — “what do customers wish to have in Rhythm?”

The semantic similarity (measured by L2 distance) result is:

+--------------------------------+---------------+
| Pairs                          | Distance      |
+--------------------------------+---------------+
| Query x Feedback #1            | 0.6           |
| Query x Feedback #2            | 0.8           |
+-----------+--------------------+---------------+

In reality, #2 is about Conversations and #1 is in fact valuable feedback for Rhythm that we do not want the user to miss. In other words, we want to give the most relevant information to the LLM model. Although the current GPT4-turbo model we are using in our implementation supports a context window of 128k tokens, reducing the noises from irrelevant context not only improves the results but also reduces cost.

Our workaround is to utilize metadata filters. We introduced a metadata field called “product surface” to tag a particular feedback to one of the six Salesloft product surfaces: Cadence, Rhythm, Deals, Conversations, Coaching, and Forecasting. For unrelated comments like “comfort and convenience”, we left them untagged. This classifier is powered by a blend of rules and machine learning algorithms, that we could achieve 89% accuracy with.

The end result of this is a more focused retrieval process. For instance, when using the “product surface” filter, unrelated feedback such as a comment pertaining to Conversations gets automatically excluded from the database during the retrieval for a query related to Rhythm. This strategy solves a chunk of the puzzle. But then, how does the model discern the context of Feedback #2? We’ll delve into that in another segment of our discussion.

We retrieved the top k candidates, are they relevant?

When we use a Vector Database, the usual method is Approximate Nearest Neighbor (ANN) search. This method quickly fetches the top K candidates that are semantically similar to the query term. OpenAI’s text-embedding-ada-002 model stands out in performance, but it’s not always the magic solution. Especially for a Q&A bot like our Product Feedback Bot, similarity in semantics doesn’t necessarily mean relevance to the query.

For instance, a query like “summarize top topics in the feedback” might bring up a response such as, “I have a number of product enhancement requests, feedback on the current product and potential bugs. Is there anyone I can speak to about this?” It’s easy to see why this response was chosen — it’s about giving feedback. However, it’s not the actual product feedback we’re looking for, thus we aim to filter out entries that are similar but not relevant.

Cross-Encoder models, unlike ANN search, use a classification mechanism for data pairs, rating the relevance between two sentences on a scale from 0 to 1. In our strategy, we combine the two methods to benefit from the strong aspects provided by both models!

First, an ANN is used to retrieve top k potential candidates. Then, a Cross-Encoder sifts through this list, pinpointing the most relevant results. This approach leverages the quick retrieval capability of ANN search and the precise relevancy evaluation of Cross-Encoders, making it ideal for large-scale datasets. For our project, we chose ms-marco-MiniLM-L-6-v2 from the plethora of pre-trained cross-encoders on Huggingface.

The addition of the Cross-Encoder reranking process significantly cuts down on irrelevant documents. While this addition does increase database retrieval latency, this is linearly correlated to the number you choose for top k (in our experience, that’s about 3s for top 5 and 6s for top 100). It’s not a huge impact however, as it is < 20% of GPT4’s response time (~35 seconds). So, in essence, we’re trading a bit of speed for a lot of accuracy.

Is the chatbot well-versed enough in our domain to effectively assist users?

This question brings us back to the initial challenge. Providing the Large Language Model (LLM) with the right context from a vector database is only part of the solution. The real task is teaching the LLM to understand and communicate in the unique vernacular of Salesloft. This is where the art of prompt engineering comes into play.

Incorporating OpenAI’s recommended prompt engineering tactics, we first asked the LLM to adopt a persona by telling our chatbot: “You are a Salesloft’s UX and Product assistant, your role involves analyzing feedback provided by Salesloft’s customers.”

We then used bullet points to list out definitions of our six different product surfaces, including the names, functionality, and key features, followed by clear instructions such as “you need to differentiate which surface each feedback is likely to belong to”, “do not fabricate information that is not present in the context”, and “put more focus on the information that has more context support.”

Think of this process as onboarding and mentoring a new hire at Salesloft. This method has significantly uplifted the quality of responses, tailoring them to address the queries that our users truly care about.

The Feedback

Empowering users is just as crucial as the technical implementations for a chatbot to truly excel. It all begins with how users phrase their queries. Even the most sophisticated technical setup can only do so much if the user queries are suboptimal. Over the past year, our team partnered closely with our product partners to help them hone their skills in crafting effective prompts and learn the nuances of interacting. These skills are now being transferred to our product feedback chatbot.

Moreover, we’ve incorporated source documents displayed in the UI. This not only aids in fact-checking but also helps users understand what works and what doesn’t. In the near future, we hope to refine and scale up the chatbot applications that can better serve a broader customer base, thus, we took a step further and implemented event logging for our chatbot. The logging data will help us to study how users interact with the chatbot.

The feedback from our product partners has been phenomenal. They’ve found a lot of value in this tool and really appreciate that they’re able to dig into the source documents from the summarized answer.

The Wrap-Up

In summary, our experience has vividly illustrated that despite its challenges, Retrieval Augmented Generation (RAG) as an emerging architecture harbors immense potential in real-world business scenarios. We foresee a significant surge in the adoption of this technology, both in the near future and beyond. In this ever-evolving landscape, the ability to deftly navigate and leverage proprietary data will set apart trailblazing RAG applications from the rest, paving the way for innovative and impactful solutions in the business world.