We’ve got the picks and shovels — so what’s stopping the gold miners?

It’s becoming increasingly clear that there’s a massive chasm between the amount being invested into AI infrastructure and foundation models vs the amount of end-customer value being created on the application layer.

Published in

Balderton

6 min readNov 21, 2023

We try out dozens of new AI tools every month but I can count the number of tools I’m actually a WAU of on one hand. After talking to a range of compute providers we estimated that ~90% of GPU usage for LLMs today is for training vs inference, which isn’t going to be sustainable (assuming the trend towards consumption based models continues).

We’re still incredibly early but this post explores why there’s a lack of products making it into production and what could help catalyse a cambrian explosion in AI-products we’ll all actually use!

Prompt engineering is weird

There are many hurdles in building LLM-powered products, such as context window limits, unstable APIs and legal / privacy concerns but one issue, which doesn’t seem to be getting enough attention, is prompt engineering. We recently co-hosted a hackathon with Anthropic where over 200 of London’s best engineers hacked away over the weekend to build LLM-powered products. Regardless of how technical their projects were, pretty much every single team I spoke to said prompt engineering was their hardest challenge over the weekend.

It’s hard to get LLMs to produce desired outputs but it’s even harder to get them to do this repeatedly, especially when you’re using LLMs to solve tasks (rather than just generate content) and require a constrained output. One team spent hours trying to get Claude to produce a binary output consistently. There are many techniques out there to help e.g. few-shot prompting, chain of thought and self-prompting. Anthropic even published a 100 page deck on prompt engineering techniques — but it really feels like the wild west. Especially when you start to factor in the fact that what works for one model probably won’t work for another model.

Prompt Engineering Techniques by Anthropic

Prompt engineering is important and here to stay

It’s becoming clear just how valuable prompt engineering can be today. We’ve seen time and time again that a good prompt is the difference between a “cool demo” and something which customers are willing to pay for. OneShot is a great example — they’re able to shine amongst competitors in the noisy AI-SDR space by spending the time to optimise prompts for customers, rather than just trying to implement something that works for everyone. The other alternative is to allow users to alter prompts but it’s clear that the average person isn’t very good at prompt engineering and this isn’t going to change anytime soon.

Our portfolio company, Writer, has seamlessly embedded prompt engineering procedures into their sales motion to ensure customers can see repeatable value from the first demo — unlocking explosive growth in the enterprise segment. We’ve seen multiple companies grow to many millions of B2B ARR with nothing more than a few very good prompts as their differentiation (this isn’t sustainable but it could definitely make for a great wedge!).

Most people have underestimated just how hard prompt engineering is — 6 months ago, it was consensus that prompt engineering would become less important as models improved. However, a post from Francois Chollet, a leading AI researcher at Google and the man behind Keras, has made me shift my thinking — if we treat LLMs as statistical computers and prompts as programs, we need to remember that AI can’t read our minds (yet). We need to provide enough information to the model (via a prompt) to achieve our goals. So as AI gets more powerful and the scope of problems it can solve increases, we’ll need to provide more and more information to the model i.e. prompt engineering is only going to become harder.

Gap in tooling

There’s a wealth of tooling to help people get started (e.g. LangChain, Haystack, Weaviate) — some frameworks, like LangChain, try to abstract prompting away leading to mediocre responses and difficulties in making tweaks. There’s also plenty of awesome tools out there to help optimise an application which works, whether it be analytics platforms to understand what users are doing (Context.ai), defence mechanisms against prompt injection attacks (Lakera) or platforms to help scale compute when needed (Banana). These tools are great for scaling from 1 to 100 but don’t really help with the 0 to 1 problems.

We’ve noticed a gap in tooling for the “experimentation” phase to shift an application from demo to production. Most of the engineers at the hackathon spent hours with a blank box and a note pad iterating and testing prompts, some set up spreadsheets to help compare and evaluate prompts and models. A recent report uncovered that 35.2% of engineers are manually evaluating prompts and 23.4% are not tracking anything — there’s a lot that can be done in automated AI-Evals to help people iterate prompts quickly but very few engineers are leveraging this today.

A request for startups: DBT for Prompts

DBT became the go-to tool for data engineering and became the heart of the modern data stack and the Analytics Engineering community — what started as an open-source side project, is now a $4.2bn company. It brought analytics engineering out of the code-base and allowed for a clear distinction between software engineering and analytics engineers. There’s a paradigm shift towards AI Engineering and today it’s unclear whether these are data scientists, ML engineers or product managers. If a company can streamline prompt engineering today, helping teams iterate and evaluate prompts to ensure they’re building products of sufficient value, we believe they can become the heart of the AI Engineering ecosystem going forward.

It’s exciting to see a range of emerging projects in this space — last week x.ai released Prompt-IDE, which is a huge step in the right direction but is very tied into the Grok ecosystem for obvious reasons. LangChain is also thinking about this issue having released LangChain Hub to share and discover high quality prompts. Wordware, LMQL, Playfetch, Humanloop, BrainTrust and Composable Prompts are some companies attacking this space.

Looking ahead

We’re optimistic that we’ll all be leveraging LLMs if we can give developers the correct tooling to keep up with the pace of research and iterate prompts in-order to extract maximum value from these incredible models.

Today, the hardest problem for AI tooling companies is a lack of TAM due to the small no. of companies actually pushing anything to production — a “DBT for prompts” solves for this as it can sell to the masses of engineers in the experimentation phase. It’s unclear what a winner will look like but some obvious features are version control, A/B testing and AI-eval construction. At OpenAI’s recent dev-day — it was made clear that the LLM-application development cycle will always begin with prompt engineering, making this idea a great wedge for a broader LLM-development platform incorporating RAG and fine-tuning as well.

In the past 12 months we’ve backed many AI-native companies including PhotoRoom, Supernormal, Deepset and Writer. If you’re building in the space we’d love to speak with you — feel free to reach out to us at jwise@balderton.com, ssukumar@balderton.com