Moving GPT from Cool to USEFUL — Part 3: From Playgrounds to Production

keira
Ekohe
Published in
8 min readMar 25, 2024
Image from Dlabs.ai webnar

Previously in our series, we have demonstrated WHY custom GPT integrations are essential to the future of businesses by showing a few emerging capabilities of GPT models. In addition, we shared our tips on HOW to find your niche in envisioning a successful GPT pilot app to address related business needs and to promote later adoptions.

This part of the series, we will go one level down to practicality, discussing a few key steps that you and your collaborators wouldn’t want to miss when transforming an idea — carefully curated and tested feasible in playgrounds, to fruition — a reliable, consistent application in production.

Choose the right model

The first and foremost step is to choose the best LLM to use. The landscape of LLMs is evolving too fast to keep up with, thanks to the many leading providers of LLMs — OpenAI, Anthropic, Cohere, Google, Meta & Mistral. Models of various purposed are trained and released in multiple ways. Specifications of LLMs includes size of parameters, training datasets, fine-tuning strategies, context length limitation and license of use. To beat all of the technical myth, there are a couple tips to guide you choose the right model.

  1. The chosen model should first and foremost meet the quality criteria.

LLMs perform differently to various tasks. The LLMs leaderboard comes really handy to find out the general performance of different models across various tasks and benchmark datasets. But it’s more important to test if the model works in your particular task. LMSYS.org also provides an arena to compare results from different models. So not only can you check out if the model is powerful enough to handle across tasks, you can also find one that suffices in the specific use case of yours.

There is also research on how to visualize the evaluation of LLMs to make choosing the right model easier. LLM comparator from Google is a good one to keep an eye on.

2. Open-source vs proprietary

Additionally, LLMs can be accessed differently.

Some models, such as OpenAI’s GPTs and Anthropic’s Claude, can only be accessed via a proprietary API where prompts containing your data will be sent to the model provider and you pay as you go by the tokens. This way is easier to get onboard with from the start and it is usually how the most powerful models are served. However, we do need to consider in terms of the data privacy that we are not sending sensitive data to the model APIs.

For a better option regarding data compliances, GPTs hosted on platforms like Azure can be better, where you can get both infrastructure and a more robust Service Level Agreements (SLAs)

If you don’t want to rely on large corporations and care for total control of the versions of both LLMs and the data, open-source LLMs hosted on your chosen infrastructure offer more flexibility. For tasks that are very particular, it’s also possible to fine-tune your variation of the LLMs to get better performance! However, fine-tuning often costs a lot of hours and dollars, so it’s only recommended on rare occasions when necessary.

3. Input length & function calling

LLMs come with different limitations on the maximum tokens that can be processed within each call. So the context length that the task needs is also a factor to consider when choosing the right model to use. A hidden factor regarding the context length limitation comes from intermediate steps with function calling. Even though the results are not designed to be shown upfront, they need to be included in the context during the generation process. In this case, you might need a model with a larger context window.

4. Estimate cost

The cost aspect of different models is important to consider too! If you go with the open-sourced version, the cost would only come from the infrastructure for hosting. On the other hand, for proprietary models, it is usually paid by the tokens sent to the models. In general smaller models are cheaper to query, as well as models with shorter context windows. There are other factors need to be considered to estimate the cost such as expected data payloads, if there’s additional steps for function calling and how many requests would be sent in what frequency etc.

Understand the components of the GPT app to build

What is shown down below is all of the possible components to include in a GPT app. Some are essential while others are optional. But it’s important to be aware that it isn’t just the prompts that needs to be engineered. Rather on the contrary, it’s the whole workflow from data aggregation, data preprocessing, data extraction (often referred as retrieval), prompt engineering, output parsing, task creation and execution, and all of the steps requires to be carefully engineered. We will delve into the 2 most important ones today.

source

Prompt Tuning

We’ve touched base on what prompts are and consist of from our last article. And today, in the perspective of making an overall GPT app, there are mainly 2 things to consider for effective prompt engineering.

First of all, aside from the original task prompts, consider designing contextual prompts to handle more complicated tasks. Contextual prompts are invisible to users but can help with the overall quality of the results.

For example, if the app is to answer user questions regarding an article, the task prompt would be something like “based on the presented information, answer the user’s question”. But we can also add contextual prompts like “Come up with search terms based on the user’s question”, “Do we have enough information to answer the user’s query? If not, what is missing?” These prompts can guide the model to handle various situations step by step and are usually working like a charm.

Secondly, do a comparative analysis between prompt iterations across various cases. When you improve the prompt for one case, there’s a chance that you might break another one. Keeping an eye out for all of the cases while you are tuning for prompts, so that we can continuously push further for better results without causing new problems.

Building an external knowledge base and retrieving relevant information

From all of the constraints that GPT applications come with, hallucinations and limited context windows are the most common ones, that a working retrieval mechanism with external knowledge base can help alleviate greatly. That’s why today, there is a lot of research and solutions on Retrieval Augmented Generation (RAG).

One key factor to affect the quality of retrieval is data embedding. Embedding (or vectorization) is the process that will convert a word/sentence into a point in a coordinate system where similarity can be easily measured. To its core, embeddings semantically link questions to the answers. And it’s not only applicable for text, but also for images and audios. Currently it is the gold standard to construct a knowledge database.

Then let’s move to the actual process of retrieval. One of the trickiest parts we find is that, sometimes, it would require a lot of trials and errors to find the right threshold to separate relevant content after you’ve calculated similarities, as relevancy is quite a relative concept itself. Often, we need to go back to the prompt tuning step to generate more accurate terms to search for or to involve additional rounds of retrieval steps to get a more refined context. However, for whatever it takes, the quality of retrieved results directly affects the quality of the final outputs and it’s without a doubt a crucial component where a lot of business logic resides.

Test, test, test

The 3 “test” aren’t just put in the title to emphasize, however, there are actually 3 areas that are particularly important when testing GPT applications.

Form a well-rounded question bank

A well-rounded testing question bank should include questions from all of the following aspects and it’s recommended to be constructed collectively with different teams.

  1. If the GPT agent is telling the truth — hallucinating or straying from the referred knowledge
  2. If the GPT agent is giving accurate answers both in terms of the content and the format of the answers
  3. How the GPT agent is handling edge cases for data compliance or data privacy needs.

Keep the records of intermediate results

This is learned in somewhat a hard way from one of our first practices. When refining for some edge cases found after deployment, we put a lot of effort in tuning the prompt, rather than checking the trace of results in all of the intermediate steps. It turns out that one of the functions is not being successfully triggered under a particular way of asking certain questions.

A lot of time would have been saved if we designed the system to save the intermediate results that can be traced back. The records would have also facilitated the inter-team communications during that time to ensure the team have a common understanding of the problem. So this is highly recommended!

Test again when integrating the GPT app into systems.

It is not only about building a standalone GPT feature, but more about how the app fits into your broader business ecosystems. Usually, the core functions of GPT apps take a fraction of time to build, but a ton more to integrate and to test after integration.

For this phase of test, we might need to test in terms of response time, response format and different edge cases relating to the whole system. We will need to collaborate and coordinate closely with front-end, back-end and product teams to ensure the smooth user experience for the end product. What can also be considered during this phase is to scale up the tests to a more systematic approach. We can involve tools and frameworks, such as Langsmith to build automatic testing pipelines. And most importantly, as the app is rolled out to more usage and a broader audience, we will be as adaptive as possible to handle future changes!

Voilà!

There you have it! A detailed walk through of fostering a GPT idea to production with important tips we gathered from our past practices, including both successful tries and lessons learned.

As it’s the same with all of the technologies, they unlock unexpected values while they expose unexpected challenges to us. We are very curious about where GPT and Generative AI would take us into the future and can’t wait to practice more and to find it out!

--

--