Using Generative AI in Custom Applications

Rob Tabiner
KPMG UK Engineering
6 min readAug 8, 2023
Image by macrovector on Freepik

After ChatGPT burst onto the scene several months ago, the race to embed anything ‘AI’ into custom applications is gathering pace. Like many, ChatGPT totally blew me away when I first used it. After a few interactions however, I soon learned the need to structure my questions (or prompts) in the right way to get the results I was hoping for. Prompt Engineering is a hot topic at the moment, and one that KPMG have recently underlined the importance of for the future workforce.

So the obvious question is, how could I use this technology inside the applications that I’ve built, using application-specific data to tailor the experience. Furthermore, how can I achieve this in a performant and consistent way to please end users?

The answer isn’t simple, but it’s also not as complicated as you might first think. Most examples and demonstrations of this pivot around the concept of Retrieval Augmented Generation (RAG) — which again, isn’t as complicated as it sounds. The idea is to enrich the prompts that you pass to your generative AI model by initially querying a data source that is local to the application. This means that there is no requirement to ‘train’ your models on private data but instead treat your model purely as an interpreter to take a prompt containing useful datapoints and create something that looks and sounds as if it was written by a human. I guess it’s a bit like a new apprentice joining your team — they are literate and intelligent but may lack deeper understanding of topics without guidance or supervision. One activity for an apprentice may be to summarise lengthy documents in exactly 500 words. Both the apprentice and a generative AI model can do this, but one can perform this activity on any topic, in multiple languages, in a matter of seconds.

Disclaimer: I am certainly not implying that generative AI models will replace apprenticeship schemes, but hopefully you get the point I am making…

Getting started

Microsoft’s OpenAI Service provides some great functionality and Azure AI Studio provides a playground which makes experimentation really easy. I will use this to run through a few examples of RAG. Let’s start with a simple example:

No explanation required here, but obviously the model (in this case gpt-35-turbo) doesn’t know where I live. In order for us to get a meaningful answer, we need to first provide some context for the model to interpret:

This is of course a very basic example, but one that demonstrates the idea of RAG. The retrieval aspect here concerns the querying of our user store to get Rob’s location (in this case, Manchester) and the model parses and interprets the prompt to generate the output.

Taking this concept, it’s easy to extend this functionality using the power of generative AI, for example:

Now that we understand the concept of RAG, lets take a closer look at its constituent parts, which are:

  1. Retrieval
  2. Building your prompt (or Prompt Engineering)

For this post I’ll focus primarily on my experiences of Prompt Engineering, but before that I’ll cover some basics on Retrieval.

Retrieval

In my analogy above, retrieval is represented by the apprentice thoroughly reading and comprehending a lengthy document in order to craft their own summary.

In software engineering, retrieval concerns the querying of a data source to obtain something meaningful to enrich your prompt. This data source can be absolutely anything — ranging from a simple relational database to a fully-fledged vector store. Azure recommends leveraging its Cognitive Search service to get the best results, which extends a traditional search index and layers on functionality such as semantic search, entity extraction and sentiment analysis.

The key message here is that the complexity of the retrieval stage depends on the problem you are solving, and a simple query may be enough.

Prompt Engineering

Being a software engineer I’m used to compilers, linters and code reviews. There are a lot of controls and guardrails around writing code — for example, you can’t fabricate keywords or operators and you’ve got the internet at your fingertips when you get stuck. Tools like Github CoPilot take this even further and provide valuable code snippets and guidance throughout the development experience, which provide yet more rigour around programming.

Prompt Engineering however is certainly more of an art. It is the inverse of writing code — there are basically no rules. On the one hand, it’s amazing because all of a sudden ‘everyone is a developer’ as the ‘programming’ language is plain English. However, there are so many levels of complexity to the prompts that can result in a perfect result or one that totally misses the mark. Accuracy is affected by tone, context, instructions of what to do and what not to do, grammatical quirks and the order of the requests, but even then there is the possibility that the model can go totally off-piste and provide responses that are entirely made up (commonly referred to as ‘hallucinations’).

OpenAI themselves have acknowledged this and have provided some “Best practices for prompt engineering with OpenAI API” which has some useful pointers. The OpenAI SDK from Microsoft also helps tackle some of these recommendations by introducing the concept of ‘User’ prompts vs ‘System’ prompts which allow you to provide context (“Rob lives in Manchester.”) and questions (“Where does Rob live?”) without having to worry about the science of knitting these together yourself.

Adding additional information into the prompt poses an obvious risk of overinflating the request payload. For example, if there are many potential answers to the question posed, a reliable pre-filtering mechanism would be required to first filter and retrieve the relevant data (e.g. using Cognitive Search). Microsoft sets a default max_tokens value of 4096, which covers both the input characters and the characters that make up the answer that the model provides. The docs state that this value can be changed, but I suspect overloading the prompt would only dilute its accuracy, not to mention increasing the time it takes to reach its answer.

Then, there’s the content and structure of the output, or what the AI community refers to as steerability. Let’s imagine we want to provide some interactivity on our website, for example:

Here, the ChatGPT UI creates a clickable link — In our custom app, we’d likely want to do something similar, but receiving a plaintext output from the model doesn’t provide the ability to do this.

The best approach to this appears to be providing more instructions in the prompt, for example by requesting the response in JSON format:

Great, we’ve now got a structure to our data and can do whatever we want on our UI. Let’s go to the pub…or maybe not.

I don’t know about you, but this makes me uncomfortable. We’re in that grey area again of no rules, and a reliable contract between an API and a UI is critical, so allowing this to be determined at runtime does make me nervous. Furthermore, it’s almost impossible to test this as the responses from the model cannot be guaranteed.

Taking things further

There’s clearly a long way to go. This investigation leaves obvious gaps in the retrieval space, and we haven’t explored long-term memory or having interactions with a model that learns over time. That said, we have come quite a long way. We’ve understood the concept of RAG, explored the challenges of Prompt Engineering and some of the hurdles that developers will need to overcome concerning steerability.

There is obviously a lot more to uncover, and I hope to expand on these as I learn more about the tech, but for now, why not have some fun and start thinking about how you might enhance the user experience of your website with the power of generative AI.

--

--

Rob Tabiner
KPMG UK Engineering

Rob is a Principal Software Engineer at KPMG, specialising in designing and building highly-scalable and resilient progressive web applications.