Introducing Promptimizer – an Automated AI-Powered Prompt Optimization Framework
An argument that LLM Practitioners often make is that prompt engineering is more of an art than a science. It requires gut feel, manual tweaking, and lots of practice to create the perfect prompt that conforms to your goals and expectations.
But what if… it didn’t have to?
Close your eyes, and imagine a world in which you could give an AI model a list of inputs and expected outputs, and it automatically generates the best possible prompt for your specific use-case.
Now open your eyes. You are now in this world.
The Automatic Prompt Optimizer
I created an open-sourced prompt optimization engine. Utilizing genetic algorithms, the Promptimizer iteratively improves any arbitrary prompt to reach the best possible performance.
Practically, this means:
- Robust, accurate prompts: With this automated approach, you can create the best possible prompts, while you spend your effort on determining the real use-cases and shaping how you want it to respond
- Evaluation framework: Get concrete metrics on how your prompts are performing (in and out of sample)
- Unparalleled steerability: Control everything about how your AI responds, including the tone, response length, accuracy, or any quantifiable metric
In one example, the accuracy of the prompt increased from 70% to 84–85%. This is a dramatic improvement that used a relatively small (40 examples) dataset.
Here’s how you can achieve similar improvements with any prompt you can imagine.
How to use the Promptimizer
Step 1: Define Your Goal
When creating your system prompt, you need to understand the desired behavior of the LLM.
Most commonly, the goal is to create syntactically valid JSON that corresponds to the user input. For example, if I’m creating an LLM-Powered Stock Screener, I want to return a valid SQL query that I can run against my database.
However, the goal doesn’t always have to be returning data in a certain format. Perhaps you want the model to ask certain questions or get clarification before diving in.
For example, if I’m creating an LLM-Powered legal assistant, if the user is asking a legal question, the first step is NOT to just start talking about how different jurisdictions have different laws and that they should consult a lawyer for legal advice…
The very first step is for the model to ask the user where do they live. Then, a reasonable next step is for the model to fetch information about the laws in that jurisdiction.
Whatever you want your agent to do, it must have a concrete, definable, and quantifiable goal and sub-goals. Then, you will create a list of system prompts that accomplishes those goals.
Step 2: Create a list of approximately 5 (different) system prompts
This is the step that will undoubtedly take the most time, but if you’re familiar with LLM Applications, it shouldn’t take you more than an hour.
You must sit down and create a population of prompts. While it is possible to get an LLM to generate prompts for you, you will have much more success if you do the leg work in creating unique, different prompts that accomplish your goal.
Within this framework, a prompt is an object with the following 3 attributes:
- systemPrompt: Instructions that steer the model towards the desired behavior
- examples: A list of conversations that you’d want to have with the model
- model: What specific model that you’re using, e.g. GPT-4o mini
Currently, only the system prompt and examples are changed during the optimization process. However, you can imagine a world in which the model that is used is also optimized automatically. Maybe Haiku is great at certain tasks and GPT-4o-mini is better at others!
After creating your list of prompts, get ready to define your model behavior.
Step 3: Create a list of inputs and populate their desired outputs
Similar to supervised learning, in order to steer the model towards the desired behavior, we need to know exactly how we want the model to respond to a wide range of inputs.
To do this, you will update the file input.ts
. You will add filenames and inputs you want the model to understand. There’s already a concrete example populated in the repo.
Then, you will execute the script populateGroundTruth.ts
. This script allows you to create ground truths in a semi-automated way.
The script is likely more involved than you need. For example, it includes logic for querying a table in big query and presenting the results. This is because my specific use-case required evaluating the outputs of queries, but again, this framework can be used to optimize any arbitrary prompt.
The more examples you include, the more accurate results you’ll get. But be careful! It’s also true that the more expensive it will be to optimize. You may have to get creative in figuring out how to balance between cost and accuracy.
Step 4: Create a scoring heuristic for your model
Using some method (such as a large language model), you need to be able to quantify how close your output is to your desired output. You can do this using the LLM-based “Prompt Evaluator” within the repo.
The “Prompt Evaluator” takes the output of the model and the expected output and returns a score. While, in theory, the scores can be unbounded, a good start is to score each answer on a scale from 0 to 1. Another alternative is to have a range from 0 to 5 or -1 to 1. As long as the scoring guide makes sense, you’ll create an algorithm that works towards it.
This scoring mechanism will allow our LLM to strive towards a goal. The closer the model follows the desires output, the better the score it will get.
Just like in reinforcement learning, you can give the model positive reward for behaving like you want it to, and a punishment (or negative reward) for behaving how you don’t want it to.
After we have our scoring system, our final goal is to utilize AI to change the desired behavior of our prompt over time.
Step 5: Use AI to improve your prompt towards your goals
Using the genetic optimization algorithm in main.ts
, optimize your prompts to make them closer to the goal state.
I’m biased towards genetic algorithms and chose it as the optimization algorithm for a number of reasons. For one, I graduated Cornell with a degree in biology… I like to think that my degree wasn’t a complete waste of money!
More importantly, the algorithm quite simply works very well for nearly any problem. It is used to generate a population of viable candidate solutions.
Here are the 5 phases to genetic algorithms:
- Initialization: An initial population is generated
- Selection: Individuals in the populated are “selected” to reproduce, with more fit individuals being more likely to be selected
- Crossover (or recombination): We combine the “genes” of our parents to create new offspring (or solutions)
- Mutation: We create unexpected change to our offspring that can have positive or detrimental effects on its fitness
- Evaluation: We then calculate the fitness of our offspring using the
AI Stock Screener Prompt Evaluator
or other quantifiable methods (like the length of the string).
While this article won’t go into how each step is implemented, you can check out my past article or browse the GitHub repo to see how it works.
The Promptizer automatically handles most of the advanced data science stuff for you. For example, it will split the ground truths into a training and validation set, so that we can figure out how well our prompts generalize to unseen data.
The end result of the optimization process is several prompts, each of them objectively better than the original.
Step 6: Graph the change in your prompts performance
When you’re done, you will likely be curious how much (if at all) your prompt improved over time. Did the prompt actually improve? Or did you waste $80 for nothing?
The repo contains utilities for helping you determine this.
First, there is a function within the main.ts
that formats the data into a JSON file.
Then, there is a Python script (graph.py
) that generate graphs so you can see how the performance of your prompt changed over time.
Concluding Thoughts
We are leveraging the strength of different AI algorithms.
Large Language Models are great at generating text, specifically text that conforms to a certain specification. In contrast, the old-school genetic algorithms are great at optimizing pretty much anything, because it doesn’t require gradient information like neural networks.
The combination of the two is extremely powerful. It creates a robust framework for optimizing any prompt, eliminating the need for tedious prompt engineering!
However, please be cautious when utilizing this framework. Due to the number of API calls to OpenAI, the optimization process is surprisingly very expensive. It absolutely saves you time and will improve the accuracy of your prompt, but it will cost you a pretty penny, even with relatively small sample sizes.
Overall, I’m happy to release my technique to the wild, and allow others look at it, copy it, and contribute to a world where manual prompt engineering is a thing of the past.
Contributions to the repo are welcome!
Thank you for reading! If you’re intrigued by the potential of AI in finance and want to see the results of optimized prompts, I invite you to explore NexusTrade, where this optimized AI Stock Screener is just one of many innovative features.
Follow me: LinkedIn | X (Twitter) | TikTok | Instagram | Newsletter