From idea to reality: Elevating our customer support through generative AI

How we prototyped and enhanced the Vimeo Help Desk through rigorous testing.

Jon Ruddell
Vimeo Engineering Blog
13 min readAug 11


In the ever-evolving digital age, the implementation of artificial intelligence or AI across various business operations is becoming the norm. Here at Vimeo we’re deeply invested. We’ve recently launched several AI-powered features that dramatically reduce the barriers of creating and editing videos, enabling anyone to record in one take and edit in seconds. Our users can generate a script from a text prompt, record from the camera or screen using a smart teleprompter, and instantly remove all unwanted content such as ums and uhs and long pauses.

This article explores a project at Vimeo where the primary scope was to test and demonstrate the power of generative AI, leveraging existing support content as a data source. Specifically, this took the form of a help desk chat prototype that combines AI technology with the knowledge found in Vimeo’s Help Center articles to improve customer support with efficient, useful, and personalized answers to questions.

The opportunity

In our exploration of AI capabilities, we used support content as our testbed because it provided a clear use case with an available public dataset. It’s key to understand that our primary focus was on the broader applications of AI and not on revolutionizing our customer support system. That said, it was important for us to delve into the real-world challenges faced in the support domain.

Customers demand swift and accurate solutions to their queries. Traditional customer support systems often fall short, grappling with overwhelming volumes of queries, delays in response time, and a lack of personalized solutions. This inefficiency can lead to dissatisfied customers and missed business opportunities.

The impact of a poorly functioning customer support system extends beyond business operations — it affects the customers directly. Frustration from delayed responses and inaccurate solutions can lead to customer churn. Hence, enhancing customer satisfaction and retention is paramount, and an efficient support system is key.

Today, customers with an issue or question about Vimeo have a few options. They can either open a ticket with the Vimeo support team, search for a related article in the Vimeo Help Center, or chat with a bot that leverages a third-party platform to understand the intent of a question and, based on that, either provide guided workflows or specific answers parsed from existing help content.

For instance, consider a user looking for information about Vimeo’s feature for restricting the embedding of a video to their domain. Figure 1 displays what the user finds when searching for domain restrict embed.

Figure 1. A search query for “domain restrict embed” in Vimeo’s Help Center.

Unfortunately, none of these options provides any relevant information at a glance. Querying the chatbot with this phrase doesn’t yield immediately obvious answers, and it may end up directing the user to submit a ticket to the support team.

The solution

Vimeo’s prototype bot has the potential to improve an experience like this by leveraging the latest in generative AI. Specifically, by integrating AI technology with our existing resources, we can develop a more responsive and effective support system, where customers input their questions and receive immediate, accurate, and helpful responses, as shown in Figure 2 with my favorite subject.

Figure 2. A question asking about domain restrict embeds in Vimeo’s AI help desk chat prototype. A complete response with actionable instructions for this precise query appears in the chat.

Here’s how it works.

Indexing Zendesk articles in a vector store, or “What’s our vector, Victor?”

The Vimeo Help Center is powered by Zendesk, so we began the process by indexing our Zendesk articles in a vector store. A vector store is a storage space for vector representations of text, also known as embeddings. Embeddings look like an array of numbers, with similar phrases having closer numbers. In a simplified example, the embedding for the word animal could be [1,2,3], a dog could be [1,2,4], and coffee could be [5,8,1]. You can see that the embeddings for animal and dog are closer together than the unrelated word coffee.

Filling our vector store with the embeddings for our Help Center articles enables high-speed retrieval and comparison of the embeddings, which is ideal for matching customer queries with relevant articles. See Figure 3 below for a visual of the process.

Figure 3. The scraping, splitting, and ingestion of the help articles into the vector store. Starting from the scraped articles, the process begins by parsing the document, splitting it into chunks containing several lines of text, transforming the chunks into embeddings, then finally storing those embeddings in the vector store.

The first step was to scrape all of the published articles from Vimeo’s Help Center instance through Zendesk’s Help Center API. The Zendesk API exposes all the needed information and metadata, including the article’s content, title, tags, and full URL.

While it’s possible to download and index into the vector store directly without an intermediate file, the downloaded articles can be handy to help debug responses later on. The articles are saved in the standard format shown in the code block below, which enables easier ingestion from a variety of sources other than Zendesk, such as GitHub or Confluence:

"body": "A trademark is a brand name, slogan, or logo that serves to identify...",
"metadata": {
"title": "What is a trademark?",
"html_url": "",
"label_names": [
"last_modified": "2023-03-08T16:16:31Z"

The next step was to load the JSON from the above and split it into chunks, using the HTML tags as delimiters. This enables querying for the specific sections of a relevant article, instead of returning the entire document. We converted the document chunks into embeddings through an AI provider’s API, then indexed these embeddings into the vector store. The metadata was added in plain text to the entry in the vector store, so that when an article was returned as a result, we had the full link, title, and any tags from the data source. The metadata also came in handy for handling updates to the documentation — a webhook from Zendesk to our backend enabled adding or removing articles, as well as replacing an indexed document with the latest version.

Which vector store provider is the best? That really depends on the use case. The simplest implementation is something on a local disk like HNSWLib, which served our purposes for the initial prototype with a dataset of less than a thousand articles. Using a local disk also keeps any sensitive information out of the hands of third parties, which eliminates the risk of a potential third-party security breach. This is especially important for data such as internal documentation, but doesn’t matter as much for us here since the help articles are already public.

Once the articles were ingested into the vector store, we were ready to chat.

Chatting with our data

This is where the fun began. We used the ConversationalRetrievalQAChain class from Langchain to connect the vector store to the provider of our large language model, or LLM. Langchain provides a simple interface for interacting with embedding and LLM APIs from various AI providers, coordinating multiple prompts to an LLM, which is the chain portion of the conversational retrieval QA chain, as part of a request. This is what powers a large part of the backend business logic.

First, we took any available chat history from the current session and combined it with the latest question from the user. The chat transcript was sent to the LLM to rephrase the input as a standalone question, which enabled us to use the context from the existing conversation to provide the best answers. This also helped to fix issues such as misspellings. For example, if someone asks about embedding a video, then asks about live videos in a followup question, they presumably are wondering about embedding a live video. If we take the second question at face value, without reference to the previous question, we likely won’t get the expected results when searching for the relevant help articles. Figure 4 gives an example of how this is generated.

Figure 4. The generation of the standalone question from the chat history and a followup question. The question and any available chat history are sent to the LLM to provide it context to output a standalone question.

Next, the standalone question was transformed into its embedding representation using the same APIs as before when ingesting the articles into the vector store. This enabled us to query the vector store for articles with similar content to the question. The vector store returned the matching chunks from the relevant articles, along with the associated metadata, as shown in Figure 5.

Figure 5. The process transforms a standalone question into an embedding that’s used to query the vector store, which returns relevant document chunks.

Finally, the relevant documents were passed along together with the standalone question to the LLM, to generate the final answer. Figure 6 shows how this process is wrapped up.

Figure 6. The generation of the final answer, given the standalone question and relevant sections from Help Center articles. The process begins with the standalone question and related documents, which are sent to the LLM to output the final answer.

What does all of this look like in the code? Thanks to Langchain, it’s an incredibly small implementation, requiring just a few lines to accomplish all of the above. The only difference for a followup question is that the chat history of the current conversation is loaded from a database and added into the flow — that’s all there is to it.

Here’s a simplified example of the code for Langchain logic using Vertex AI to answer questions over Zendesk articles:

const question = "How can I embed a video?";
const chatHistory = [];
const chainOptions = {
returnSourceDocuments: true,
const vectorstore = await HNSWLib.load('./vectorstore', new OpenAIEmbeddings());
const aiModel = new GoogleVertexAI({
verbose: true,
timeout: 20000,
temperature: 0,
maxOutputTokens: MAX_TOKENS,
const chain = ConversationalRetrievalQAChain.fromLLM(
const response = await{ question, chat_history });
// {
// text: "Go to the video page and ...",
// sourceDocuments: [
// { title: "Embedding videos", html_url: "" }
// ]
// }

The versatility of our solution

Langchain’s flexibility also enables several other nice-to-have features for our help desk chat prototype, detailed below.

Switching out models

We can easily switch out different LLM and embedding APIs based on the specific needs and complexity of the questions asked. This ensures that our prototype can handle a wide array of queries, providing the most accurate and relevant responses possible. It also helps us compare the performance between different LLMs and APIs, and provides redundancy during outages, as the AI provider can be swapped out seamlessly.

Switching out vector stores

In addition to switching AI models, we can also switch out the vector store that is queried during a request. This capability enables us to tailor our data retrieval system to suit different types of queries and datasets. For example, one vector store can be used to index internal developer documentation from its various sources, such as GitHub, Confluence, Google Docs, Zendesk, and so on, which could give employees a one-stop search for all the information they seek.

Comparing AI models

As mentioned above, the ability to switch between LLMs provided us with a way to compare the models. While all models performed decently when given the standalone question and relevant article chunks, there were some differences. The four models we tested were Google Vertex AI Chat Bison, OpenAI ChatGPT 3.5 Turbo, OpenAI ChatGPT 4, and Azure OpenAI ChatGPT 3.5 Turbo.

Here’s how they fared in our comparison.


One aspect in which the Google Vertex AI Chat Bison model excels is its concise answers. Rather than generating paragraph after paragraph to answer simple questions, Bison provides a shorter answer, utilizing bullet points. This shows that it follows the instruction prompt better than OpenAI’s ChatGPT models, which generate longer responses. It also reduces the time taken for Bison to generate an answer compared to ChatGPT, as the number of generated characters is much lower. This leads to some cost savings as well, as pricing is based on the number of characters in an input and output.

A benefit to using Google Vertex AI models while deploying an application to Google Cloud Platform is that the application can use APIs such as Workload Identity. This enables Kubernetes containers and other deployments to automatically authenticate with Vertex AI instead of having to pass around an API key like with OpenAI.

Another difference between the Bison model and ChatGPT is that Bison waits to generate the entire answer before returning any information. While the user is waiting a bit longer for a response, the answer comes back all at once, similar to what users have gotten used to with standard non-AI-related API requests.

On the other hand, OpenAI’s ChatGPT models have a feature called streaming, where the tokens are streamed to the UI upon generation rather than waiting for the complete answer to generate before returning information. This gives the user a sense that the LLM is typing a response. The benefit with streaming is that the user receives immediate feedback to indicate that their question is in the process of being answered. However, one downside is that when OpenAI’s APIs are receiving heavy usage, the streaming and overall response speeds seem to slow to a crawl. In standard cases, it responds quickly, but you can see an example of the slowness in the following demo video.

A video showing the slowness of GPT APIs when the API is under heavy load.

When comparing OpenAI’s models against each other, ChatGPT 4 delivers stronger and more concise answers than ChatGPT 3.5 Turbo, but the response speed is dramatically reduced and the price per token is more than doubled.

We also tested the option of deploying OpenAI’s ChatGPT models to a custom Azure instance. This has higher costs than only using the OpenAI API, but provides more reliability, security, and privacy over your data. It has relatively the same performance as the public OpenAI API, but unlike the issue mentioned above about a heavy load slowing down the API, other users of the API shouldn’t affect your Azure deployment.


Pricing is always a topic of discussion when comparing models. At the time of this writing, the Google Vertex AI Chat Bison model costs $0.0005 per 1,000 characters for both input and output, while OpenAI ChatGPT 3.5 Turbo charges $0.0015 per 1,000 tokens of input and $0.002 per 1,000 tokens of output. At a glance, this appears to be nearly 3.5 times the cost, but there’s an important distinction between tokens and characters: the number of characters per token ranges from two to five characters, depending on the words and language used. So, in the end, the prices are generally close to each other for all options using a managed API.

Comparison verdict

The Google Vertex AI Chat Bison model excels in terms of its concise response generation, adhering closely to the instruction prompt, which in turn leads to cost effectiveness and efficient processing. Additionally, the seamless integration with Google Cloud Platform and automatic authentication contributes to its practicality in real-world applications.

Ultimately, the choice of model depends on the specific requirements of the application, and it’s important to consider the trade-offs between response quality, response time, pricing, and features like streaming. We’re going with Google Vertex AI Chat Bison for now, but as we continue to experiment to build the best user experience, we may choose a different LLM or even a combination of multiple different providers.

Interested in speed? Check out the following speed comparison video that includes all the models we tested.

A video comparing the response speed of various AI models for our use case.


What’s a project without a few challenges? While building this prototype, we encountered some interesting aspects of LLMs.

Training data

One of the reasons we tag the metadata with the source URL is to provide links to the documents below the chat response. We attempted to have the AI model provide its own links to the documents, but ChatGPT would regularly return outdated or nonexistent links that were unrelated to the reference docs. After some debugging, we found out that ChatGPT contains an old copy of Vimeo’s Help Center in its training data! So even without providing any relevant docs, the model can still return somewhat correct information based on the version of Vimeo’s Help Center available in late 2021.

Quality assurance

Another challenge is ensuring the responses meet a certain level of quality. When relying on the AI to provide a response, the output can vary drastically. One workaround to this is setting the temperature parameter in the API request to 0, which reduces the creativity of the responses and ensures the same response is generated for the same question. Even then, the number of possible responses to an unlimited number of possible questions is challenging to verify by a quality assurance team.

One protection used to prevent questions unrelated to Vimeo is to include detailed instructions in the prompt, directing the AI to refuse to answer any questions not related to Vimeo and its features. This helps for unrelated questions and also dangerous questions such as, “What’s the recipe for dynamite?” In this case, the generated response informs the user that the AI can only answer questions about Vimeo and to avoid asking inappropriate questions.

Both AI providers offer safety and moderation features as well. Vertex AI has safety filters for flagging prompts that are considered harmful. For example, for the prompt about dynamite, the response correctly notes that the question is related to firearms and weapons, enabling us to flag those types of troublesome questions. OpenAI offers a separate moderation API endpoint with similar capabilities, but it requires extra integration efforts, since it’s not built into their LLM responses.

What the future holds

Our journey in building the Vimeo AI help desk chat prototype application underlines the transformative role of AI in enhancing customer support operations. The integration of our help articles into a vector store, combined with Langchain’s seamless interplay with embedding and LLMs, has resulted in a resilient system that’s adept at handling a broad range of customer inquiries.

Through careful evaluation, we went with the Google Vertex AI Chat Bison model for this use case, for its exceptional ability to provide concise responses quickly and its effortless authentication process. Despite initial hurdles relating to training data and quality assurance, we turned these challenges into learning opportunities that contributed to the evolution of our AI help desk chat prototype as a true proof of concept.

As we look ahead, we’re filled with anticipation about the possibilities that AI holds. This initial prototype is just a stepping stone to our larger vision. We’re keen on advancing the capabilities of AI models, exploring more use cases, and pushing the boundaries of what we can build.

Our goal is to introduce applications of this cutting-edge technology to our end users in various contexts in the near future, since we believe it can revolutionize customer experiences — whether in the context of video tools, support, or beyond. This project demonstrates the potential to offer our customers immediate, precise, and contextually relevant solutions to their inquiries, thereby streamlining their Vimeo experience. Stay tuned for more exciting updates on this project!



Jon Ruddell
Vimeo Engineering Blog

Principal Engineer at Vimeo, member of the Office of the CTO team