Prompt Prototype Tests: how to quickly validate AI feature ideas.

Introducing a technique for product teams to quickly validate AI ideas early in the discovery process.

Published in

Atlassian Product Craft Blog

10 min readJun 28, 2024

Lightbulbs being fed into a grinder with confetti coming out into a bag labelled Prompt Prototype Test

One of the core roles of product development teams is product discovery. Discovery entails understanding and addressing an idea’s usability, feasibility, viability, and value risks. We do this to maximise the potential of what we build to deliver customer and business value — or, put another way, minimise the chance of building features that no one wants or uses.

The release of ChatGPT fired the starting gun on a new AI wave. CEOs and investors are clamouring to leverage this new paradigm, and product development teams are excited to push the limits of new technology. Despite the hype, there remains significant uncertainty around GenAI. Anyone who has played with the technology has had experience of wonder at what it can do, juxtaposed with bafflement at its inability to do a simple task. This uncertainty raises an interesting question for product development teams: how can you quickly address core product risks when developing a new AI idea?

This blog outlines an approach, Prompt Prototype Tests, which helps to tackle the value and feasibility risks inherent in a new GenAI idea.

Why prompt prototype?

The goal of a Prompt Prototype Test is to quickly validate whether an idea can be successfully solved by a GenAI. It aims to be a lightweight process where you can quickly test ideas — and help to spur new ones. When successful, Prompt Prototype Tests also provide real and tangible examples of the problem and possible solutions. These are invaluable for discussing with colleagues as you plan how to take the idea further. Being able to demo something working is much more engaging than pitching an idea in a slide deck or on a page. Finally, it is quick: it should be possible to run a Prompt Prototype Test in as little as 30 minutes.

Approach

The basic approach for Prompt Prototype Testing is as follows:

The idea

As with any testing, you start with an idea. Product teams generally have ideas coming out of their ears — stored in backlogs, whiteboards, and rattling around in the dusty corners of our minds. There are a few quick screener questions which can be helpful as you assess whether an idea might be suitable for an LLM.

💡 Is it a problem worth solving? GenAI is exciting, and many of us want to get our feet wet. Given this, we must all ensure we don’t get swept up in the hype and fall in love with AI as a solution without fundamentally doing the necessary product work of validating that the problem is worth solving in the first place.
👩🏽‍💻 Can it be solved using traditional software development techniques? GenAI is expensive and unpredictable: if a problem can be solved with conventional methods, it is likely not worth the time and effort to use GenAI.
⛓️‍💥 Can you afford to be wrong frequently? GenAI is probabilistic. This means it can generate a different output, even with the same input. In addition, it can go widely off-piste (“hallucinate”) and make things up. This chaotic nature precludes it from some use cases: you wouldn’t want a rogue AI handling your prescriptions.
⚖️ Can your business model support it? GenAI can be expensive, with requests potentially costing dollars at a time. If your product has thin margins, it’s worth asking whether it can support the additional cost of goods sold inherent in an LLM.
🙌🏾 Can your business and customers support it? GenAI is a legal and ethical minefield. Concerns around data privacy, copyright, and ethics (among others) can form a red line for organisations and their customers. It is important to understand these lines before you start.

These questions are an excellent guide to help manage expectations and understand some of the other risks inherent in AI before spending too much time attempting to validate the idea with a Prompt Prototype Test. Assuming you are comfortable with the answers, you can proceed with your idea by crafting your initial prompt.

Craft the initial prompt

With your idea in hand, it’s time to craft your prompt. When creating a prompt, there are three key elements:

🫥 Establish the role of the LLM: All LLMs have a default role built-in — usually grounded in being a helpful agent. Overriding this with a more specific role can help anchor the AI’s output to a more useful place. If I were to build a feature that generates job descriptions (a “JD Generator”), I might ground the AI as a talent acquisition professional. Alternatively, suppose I were building a product that reviewed and critiqued business plans (a “Strategy Reviewer”); I might ground it around the role of a chief strategy officer or an analyst like Ben Thompson.

🧭 Explain the data and its structure: Ideally, GenAI features should leverage some data or unique context from your existing product. The context/data you feed the model is central to what it can do: this might be a block of text you want to summarise, some documentation you want it to understand, or a table of data you want it to interpret. For the JD Generator, I might have some dot points for the next role I want to hire. For the Document Reviewer, it might be a recent strategy page.

When providing the data to the AI, it is worth explaining the data and how it is structured. This gives the AI the best possible chance of interpreting it — and doing a good job with the task you are giving it.

⚠️ Explain the request being asked: Finally, you need to tell the AI what it is supposed to do. For the JD Generator, this might be “Generate a job description based on these inputs. Limit it to 500 words. Include sections “Requirements” and “Benefits”. Try to keep this as simple as possible to start with — you can always add more information and guardrails as you iterate.

Iterating on the prompt

After creating and running the prompt with the AI, it’s time to tinker. Some judgment is needed here: can you gauge what a good response would look like? To help, you can do the activity without AI (or find some previous examples), so you have something to compare against. Then, when the AI inevitably falls short on its first try, there are a few ways you can try to tweak it:

Tailoring the role: There are various ways to tailor the role of the LLM. These included:

The background: You can adjust the background you give the LLM. If I were building an Agile coach for software teams, I could be more specific about the knowledge I expect the role to have—such as familiarity with Agile concepts. I might give it beliefs—“You believe the Agile Manifesto is more of a set of guidelines than hard rules,” for example. If you find it too opinionated or rigid, you can pare this back to a more general concept.
How it communicates: It is worth considering how you want the LLM to communicate. What tone do you want it to adopt? Do you want it to avoid jargon? Do you want it to be concise or verbose? These can all be specified as part of the role.

Playing with the data: I sought to adjust the data in three key ways: adding data, removing data, and providing context.

Adding or removing data: Consider the JD Generator. I might be able to improve this by providing additional data, such as an example JD I have written before or some guidelines from my organisation, such as the tone/voice to use. Alternatively, I could give it the headings I want included. These will both help ground the AI around the output I am expecting. If this additional context improves the output, you can double down. What difference does providing three previous JDs make? What about 10? These are great ways to iterate.
Changing the context: When sharing data, it is helpful to explain the data and how the LLM should approach it. For example, if I were building a work estimation tool for software teams, I might explain what an Epic is and how customers use epics. This context can help inform how important that data is in the overall request. It can also be helpful to tell the LLM which data is more/less important if there are a few datasets.

Tailoring the request: The biggest space I spent my time was tailoring the request I was sending to the LLM. There were three ways I did this:

Tinkering with the request: I spend most of my time tinkering with how I phrase the request I send to the LLM. Sometimes, the LLM would do what I wanted. Other times, no matter how much I tried, I couldn’t get the LLM to respond well. Tinkering included adding more context or conditions to the request and breaking the request into a series of requests to help “steer” the LLM.
Guardrails to improve relevance: I provided guardrails and guides to the LLM to help it do a better job. For example, if I knew a specific team (project names, team names, etc.) was in the data, I might let the LLM know to look for them. These guardrails were necessary to obtain a consistent good result from the LLM. They also surface requirements for data preparation for the feature to succeed if productionised.
Getting the desired output format: To simplify your job, specify the format in which the LLM should provide its output. This might mean asking for less information (e.g., provide me with the project titles, not a summary of each one). Other times, it might mean describing a set of headings under which you want the data (return the data structured like ‘Title’, ‘Acronyms’, ‘15-word overview’).

Judging how the LLM did

Below is a rough 5-option scale to judge the results of the test. To help make the judgement, you can rely on your own experience. In addition, it can help you select real cases to test where you have an answer so you can compare the LLM output to an exemplar.

😁 Great — Found things I could not see. Perfect application of an LLM, worth taking forward.

🙂 Good — Performed as well as I would myself. Good application, worth taking forward.

😕 Inconsistent/hallucination prone — Performed well but gave a vastly different response each time it was prompted. Consider rescoping the idea to remove uncertainty or focus on more promising ideas.

😢 Not Good — The answer was weak/generic/not correct. Don’t continue to pursue.

😱 Bad — Failed/unable to provide an answer to the question. Don’t continue to pursue.

In addition, when judging the output from the LLM, I took notes on potential improvements/enhancements. These scores were rough but enough to signal whether the question was worth exploring further. It also helped me think through what might be needed to improve the output — such as additional data I cannot source myself.

Learnings

As product managers working with GenAI, there is a lot we can do to evaluate how suitable the model is for a given prompt. Some key learnings from my own explorations are:

To improve the prompt, keep asking why. LLMs are a black box. The results sometimes could be more useful, clear and consistent. The best advice to overcome this is to ask, “Why do you think the LLM gave you the answer it did?”. By generating a hypothesis, you can adjust something in the prompt and see if it performs differently. In this way, you can glean insight into how the black box works — and do a better job with your prototyping.

Multiple questions are better than single questions: I found the LLM performed better when the task was broken into more manageable sub-tasks. For example, rather than asking the LLM to judge whether a piece of work is ready (1 question), I instead asked it to create scoring criteria based on previous issues, apply these criteria to the issues I was interested in, and then rank the results (3 questions). The output was more accurate and useful as the LLM understood the context better. It also allowed me to optimise individual elements of the job rather than trying to optimise all at once.

A little data is good. More data is usually better — but not always. More data does not always lead to a better result. It is worth considering what you can add and remove when prompt prototyping. Knowing that a comparable result can be achieved with less data likely decreases the chance token limits will be hit and improves the performance of the LLM.

Be open to new ideas as you prompt: When exploring a new space, I started with a set of questions I thought an LLM might help with. As I explored each, I often uncovered new ideas and opportunities I hadn’t yet considered. I found that running tests with the LLM forced me to think more critically about why this particular problem was a good fit for the LLM. Several of the questions I’d hypothesised the LLM might help with ended up dead ends. Others were run-away successes. As you approach the early stage of using an LLM, be prepared to throw away your first ideas as they get replaced with better ones.

Timebox: I set a 30-minute timebox for my Prompt Prototype Tests. This hard limit helps me keep moving and iterating on ideas. This was appropriate for the early-stage exploration, where I was optimising for breadth and whittling many possibilities into a few candidates. It also mitigated the risks of Parkinson’s Law, which is that work expands to fill the time allotted. If I can’t get a good signal in 30 minutes, I feel that for an early idea, it is likely not a great fit for an LLM.

Wrapping up

I hope this blog gives you the confidence to start testing your ideas in a lightweight way using GenAI. The value of this activity is more than helping to mitigate the inherent risks of AI-powered features (even though it helps with that). By immersing yourself in this new technology and pushing yourself to explore how suitable it is for solving real problems, you will equip yourself with the skills to make better product decisions in an AI world. Given the transformative nature of AI, I encourage all members of product design teams to prioritise these explorations.