Quizaic — A Generative AI Case Study

Part 5— Lessons Learned

Published in

Google Cloud - Community

11 min readJul 15, 2024

This is the fifth and final article in a series about a demo application I created called Quizaic (rhymes with mosaic), which uses generative AI to create and play high quality trivia quizzes.

Here’s a table of contents for the articles in this series:

In the previous article in this series, we covered some ways to automatically verify and quantify the accuracy of generated quizzes. In this article, we reflect on our experience building Quizaic, with a focus on some of the lessons we’ve learned while harnessing this strange new technology. This list was compiled in collaboration with my good friend and colleague Mete Atamel.

It’s surprisingly easy to do hard things with GenAI

An anecdote will help illustrate this point. While preparing for a talk on this app in Stockholm, I decided it would be much more interesting to this particular audience if I could generate a quiz in Swedish. I proceeded to add the words “in Swedish” to my prompt and, voila!, Quizaic suddenly spoke fluent Swedish. This was an eye opener for me because I effectively enabled support for another language by adding two words to a text file.

Of course, proper multi-lingual support required a user interface enhancement (to provision the desired language for a given quiz), API and storage support for the desired language, and templating a language parameter in the generator prompt. However, the guts of the feature came for free by simply adjusting the prompt provided to a large language model, which still feels like magic to me.

But it’s also hard to do things well and consistently

One of the tricky challenges we encountered was extracting JSON encoding from an LLM. When asking for something “in JSON”, I’ve received responses that violated strict JSON encoding rules in various ways:

prefixed with ```json and postfixed with ```
single quotes instead of double quotes used for tag names
substrings not properly escaped, which terminated the JSON document prematurely
extraneous material before or after the JSON document, for example: “Here’s the JSON document you requested…”, which is great for a human but undesirable for a software consumer.

Since having this experience, more reliable ways to generate JSON output have been implemented (see the new controlled generation feature of Vertex AI), but this example speaks to a broader point, which is that you won’t always get the format you expect.

Lots of issues can crop up when trying to get an LLM to generate something for consumption by a program. The approach I’ve settled on is a modern variation on Postel’s Law, which advises: “be conservative in what you send, and be liberal in what you accept”.

My translation of this principle into the LLM world is “be as specific as possible in your prompt, and be as tolerant as possible when parsing the results”. So, as we saw in Part 3 in this series, my prompt is fairly specific about the JSON format I want to see:

RULES:
- Accuracy is critical.
- Each question must have exactly one correct response, selected from the responses array.
- Output should be limited to a json array of questions, each of which is an object containing quoted keys “question”, “responses”, and “correct”.
- Don’t return anything other than the json document.

This gives me a relatively high probability of obtaining a usable response, however, I don’t assume the model listened to my requirements. I use the following conventional programming logic to make sure the response fits my specification, and to fix it, if needed:

response = requests.get(url)
                prediction = response.text.strip()
                prediction = re.sub('.*``` *(json)?', '', prediction)
                prediction = prediction[prediction.find('['):]
                parsed = ""
                level = 0
                for i in prediction:
                    if i == "[":
                        level += 1
                    elif i == "]":
                        level -= 1
                    parsed += i
                    if level <= 0:
                        break
                prediction = parsed
                print("prediction=", prediction)
                quiz = json.loads(prediction)
                print(f"custom quiz: {quiz=}")

In this example I’m ignoring anything other than a JSON document starting and ending with square brackets ([ and ]).

I call this “belt and suspenders” programming — I tell the model “Don’t return anything other than a json document” and then I make sure it listened to me.

Accept the uncertainty of LLMs

LLMs are like people. They keep changing their minds. Sometimes the same model with the same prompt you used yesterday will suddenly give you a different answer. Not surprisingly, upgrading to a new model will often produce different results than you’ve been getting in the past, even for the same prompt. And, of course, switching to an entirely different model will often result in a different response to the same prompt. To summarize:

Same prompt, same model ⇒ different output
Same prompt, same model gets updated ⇒ different output
Same prompt, different model ⇒ different output

Your code has to be written in such a way that it can detect when a response doesn’t match its expectations. In other words, it’s important to spend some time designing validation code, which verifies LLM results in real time, and takes appropriate recovery actions when the validation step fails.

Pin model versions

When making LLM requests, you can refer to a model family, or a specfic release of a model, or even a specific version of a release. In order to minimize future surprises, I recommend pinning your requests to the most specific model name you have access to.

To be more specific, it’s tempting to use “gemini-1.5-pro” as your model name because that will always provide access to the latest and greatest version of Gemini 1.5 Pro without requiring any code changes. But if, for any reason, a newer model doesn’t work as well with your particular app, you may be in for a negative surprise. Even worse, your users will likely experience this surprise before you know about it. You never want your end users to also be your testers.

By using model names like “gemini-1.0-pro@002” you know your users will have the experience you and your automated tests suites have exercised. When you try a newer version and decide you want to deploy it, you can update your code to use that new model name, and introduce changes to your app on your terms.

Free upgrades with new/updated models

One of the coolest things about using an LLM is that when the model gets better, your results can automatically improve without needing to change your code (apart from which model/version you use). We saw this first hand when we found the quality of quizzes generated by Gemini 1.0 to have so-so quality. When we switched to Gemini 1.5, suddenly our quiz quality improved drastically.

By leveraging a new and improved model, our application quality got much better, without us needing to do anything special, other than select the new model. This is the AI version of “a rising tide lifts all boats”. Of course the opposite can also be true — a new version of a model can reduce the quality of your application results. This is why you need to have an automated regression testing mechanism and you should also pin your model/versions, as mentioned in the previous section.

Do you even need an LLM?

When you start working with LLMs, they feel like magic and it’s easy to fall into the “when you learn how to use a hammer, everything looks like a nail” trap. It’s just so easy to do so many things that it feels like the solution to every problem. But this power comes with the cost of dealing with breakage, uncertainty, and hallucinations, among other things.

As an example, we considered adding free-form quizzes to quizaic, where instead of generating multiple choice questions, a quiz creator could request free-form response questions. An interesting challenge is then how to grade such responses. Multiple choice grading is trivial (did the user select the correct one of four options?), but imagine the following free-form question:

Who was the first US President?

Some possible responses might be:

George Washington
Washington
President Washington
George
GW
Joe Biden

Some of these are clearly correct, others are clearly wrong, and some are in the fuzzy space in the middle. This is not a trivial thing to automate. My first idea, while enthralled with the power of LLMs, was to use an LLM to grade the response with a prompt like this:

Is {response} a reasonably correct answer to this question: {question}?

While this would probably work most of the time, my collaborator noted that solution might be overkill. For one thing, it required an LLM round trip for every single player response, which adds significant cost and latency. A better solution turned out to be to use a well known Python library that implements fuzzy matching. That algorithm is well established, fast, computationally inexpensive, and good enough for the problem at hand.

LLMs are so powerful that you may find yourself tempted to use them to obtain a “perfect” but unreliable solution to any given problem. But sometimes perfect is the enemy of good enough, and you’d get more done, with greater reliability and better performance, by using a good old tried and true approach.

Manage prompts like code

There is sometimes a tendency to embed a prompt as a multiline string in your source code. Having everything in one file may make things a bit simpler while testing a prototype but your prompt will rapidly expand, and eventually be templated as well. Having prompts embedded in code makes them hard to find and easier to accidentally break.

So keep your prompts in separate files so that they can be easily found and managed as separate units. If you have a collection of prompts, organize them into an aptly named folder. For example, a structure like this might work well for an app like Quizaic that generates quizzes and images:

prompts/
    quizgen.txt
    imagegen.txt

In short, treat prompts like first class citizens in your source code.

Code defensively

We’ve already explained how LLM calls can fail. Anyone who has written database code knows that bad things often happen. While completing a transaction, the server you’re connected to could have a power outage, or the network connecting you could temporarily fail. You need to account for such failures in your code.

The same philosophy applies here, except the failure modes are more extensive and more common. Assume that at any time an LLM request can fail in a “hard” way, i.e. time out waiting for a response, or a “soft” way, i.e. your response makes no sense. Following age old engineering practices from the database world, add logic to detect LLM response failures, retry when appropriate, and keep the requestor/user informed at all times.

Some examples from our app:

LLMs sometime gave us malformed JSON. We handle this by dynamically parsing the returned JSON to validate it in real time and refuse to store or return any content that fails this validation step.
LLMs sometime returned empty results. In our design phase, we consider the severity of an empty result and act accordingly. For example, an empty quiz is useless, while an empty image is non-fatal. So when we detect an empty quiz, we abort the request, whereas if we detect an empty image, we simply substitute a default image (the Quizaic logo works fine here).
LLMs can be too cautious in terms of assuming your topic or the generated quiz includes sensitive information. Often this is because the safety setting in your requests is too restrictive. Consider changing your safety settings if that improves your success rate without introducing too much risk in your user experience.

Consider using a higher level library like LangChain

We started Quizaic relatively early in the timeline of LLM ecosystem evolution. Langchain existed but it didn’t offer much value at the time vs. using the Google Cloud client library directly, and using the library avoided some unnecessary complexity.

But since then, the number and range of models and the complexity in ways to use those models have rapidly increased. For example, nowadays RAG (Retrieval Augmented Generatation) is a popular grounding technique. Adding RAG suport is easier when the library you’re using is built specifically to make it easy to do things like that.

Quizaic was built as a demonstration vehicle for Google services, so we never needed support for other providers. But in the real world, you may want to test or deploy your app with the ability to work with LLMs from multiple sources, or switch sources over time. Industry standard libraries like Langchain are inherently provider agnostic so they make it easy to switch models or use multiple different models at the same time, even from different providers.

By the way, Langchain is no longer the only player in this space, though it still predominates. Just as we’ve seen an explosion of web programming frameworks over the years, we’re seeing an ever growing range of options in the LLM abstraction space. That’s a good thing because it means there will be competition and market forces will shake out the best options over time. But, in the meantime, you’ll need to do your homework to choose the best framework for your needs.

Whether you end up using a framework or not, consider adding Google search grounding or your own custom grounding to improve the accuracy of your results.

Good old software engineering tricks still work

One of the biggest current problems with embedding LLMs in our apps is that they tend to be slow. Of course, that’s a relative statement but “slow” in this context means noticeable by humans, i.e. on the order of seconds or tens of seconds. For readers who date themselves back to the early internet, I like to say that we’re in the “2400 baud modem” period of LLM development. It’s really cool, really exciting technology, but also quite slow. We can mitigate this latency’s effect on the user experience by employing the tools we’ve been using for decades to address other forms of latency, for example:

batching combinable requests to minimize number of round trips
parallelizing LLM requests (for example, quiz and image generation can be run in parallel in quizaic)
caching common responses

Automated testing is as important as ever

Traditionally, it’s been a best practice in the software industry to run automated test suites every time we change our software, and especially before we deploy a new version. This is just as important as ever, actually more important, because now we’re subject to the vagaries and performance of the LLMs used by our apps. So regular unit, function, integration, and regression tests are critical for ensuring continuous, reproducible, and quality behavior for our end users. And automation is as important as ever.

Check out some of the emerging frameworks designed specifically to help in this area. For example, DeepEval looks promising. Here’s an example of how to use DeepEval with Vertex AI.

Testing quality and accuracy is more difficult

Because our query results can take the form of text or imagery or other forms of media, we can’t apply the same quantitative or logical tests we’ve been using for decades to assess the quality of a result. We have to be more creative. For more on this, see Part 4 — Assessing Quiz Accuracy in this series, where we show how to use an LLM to assess the accuracy of LLM output.

Closing thoughts

This completes our five part series on how we built a modern, cloud-native, engaging and fun AI app called Quizaic. We’ll be open sourcing the code shortly and I will link to the github repo in these article as soon as it’s available. In the meantime, if you have any questions or comments, feel free to reach out to us quizaic@google.com and thanks for reading!