Navigating LLM Application Testing with Orangepro.AI

Aamir Siddiqui
3 min readDec 10, 2023


The journey of crafting AI applications is complex and nuanced. While it’s relatively simple to whip up an impressive demo, the real challenge lies in developing an AI application that reliably serves real users in a production setting. In the realm of traditional software development, there are established practices like the deployment of continuous integration/continuous deployment (CI/CD) and test writing to ensure software robustness and ease of future development. However, these practices are not directly translatable to the development of Large Language Model (LLM) applications, where the path to creating effective tests and processes is less clear.

In the current landscape, the AI community is still exploring the best methods to evaluate AI applications. The value of a solid evaluation framework becomes evident once in place, as it significantly reduces the time and effort needed for testing changes, resolving bugs, and determining the readiness of an app for launch.

Let’s embark on an exploration of the AI app development trajectory, highlighting the importance of evaluations:

1. Initial Prototype Development
Your first step is composing a simple Typescript program to gather and integrate your product documentation into a vector database using OpenAI’s assistant API. This is followed by the creation of a basic backend endpoint that processes POST requests with an OpenAI generation function.

2. Initial Testing Phase
You start by testing the endpoint with a range of sample queries like “Where is the Q2 2022 sales report?” or “Show me the latest customer survery report?”. The responses are a mix of accurate and inaccurate answers.

3. Involving Colleagues for Feedback
After developing a rudimentary frontend, you invite your colleagues to try it out. They provide feedback, requesting additional features and pointing out issues like inaccurate company-related answers or excessively long responses.

4. Enhancing and Debugging
You delve into the task of incorporating new features and rectifying issues, adjusting the prompts and trying different data embedding methods. Each alteration requires you to manually rerun the app with various inputs to evaluate the changes.

5. Realizing the Need for Systematic Evaluations
The repetitive nature of manual testing and the need for consistent quality checks lead you to the realization that a systematic evaluation process is essential. You decide to automate this process using an evaluation script, comparing the app’s responses to a predetermined set of correct answers.

This is where Orangepro steps in, streamlining the entire process. Orangepro offers comprehensive libraries in both Typescript and Python to execute evaluations, utilize prebuilt scoring functions, and provides a user-friendly web interface for analyzing and inspecting results. Additionally, Orangepro helps manage and log test cases, simplifying the setup of your evaluation workflow to under ten minutes. This shift allows you to focus on the more creative aspects of AI app development.

6. Product Launch 🚀
With the new features refined and their performance validated through Orangepro, your team gains the confidence to release the new AI feature to your user base. Post-launch, you continue to rapidly incorporate user feedback into further enhancements.

In Summary

This story represents a typical trajectory in AI app development. Setting up an efficient evaluation system is key to conserving team resources. Orangepro provides a comprehensive suite for testing LLM Apps, including sophisticated evaluation tools, datasets, tracing capabilities, prompt playgrounds, and more, facilitating a smoother development journey.

— -