How to Measure AI Product Performance the Right Way

Use KPIs, proxy metrics and OKRs to quantify the value of an AI product

Simon Schreiber
The Startup
6 min readJun 29, 2020


Like any digital product, an AI product’s success should be determined by its profit contribution. Business performance indicators such as operating cash flow (OCF) or the monthly recurring revenue (MRR) are suitable for measuring the product’s contribution to the company’s success.

But in order to quantify the accuracy of the underlying algorithm or the interaction of the AI product with the user, business indicators are not enough. This requires operational KPIs and proxy metrics for data driven decision making:

Operational KPI

Reminder: what is a key performance indicator? I like this short and memorable definition:

The introduction and iterative further development of AI products is capital-intensive. In this respect, it is important to measure product performance on an ongoing basis and to make data-driven investment decisions in roadmap planning.

The goal is now to determine the influence of AI capabilities on operational product KPIs such as cost per acquisition, new sessions, retention rate, conversion rate or number of trial sign-ups. The hypothesis is: The more pronounced the AI capabilities in the product are, the greater their influence on operational KPIs:

For a product with high AI capability, it is legitimate to put KPI changes in direct connection with it. For example: Google Home Devices in combination with the Google Assistant have a high AI capability. Hence the hypothesis: Rising sales figures or the number of active users correlate with the quality of the Google Assistant.

The other extreme: in an email client, an ML algorithm ensures the classification of new emails into “work”, “private”, “social” and “marketing”. The AI capability of the product is low. In this respect, operational KPI changes are not due to AI features.

AI and Machine Learning Proxy Metrics

Regardless of the form of AI capability, the operational KPIs mentioned are not suitable for assessing the AI aspects of a product in isolation. Let alone improve the product iteratively in the build-measure-learn rhythm. The accuracy of a spam filter or a recommendation engine does not correlate directly with new sessions or the cost per question. And if it does, it is only possible to isolate the effect with great effort.

AI proxy metrics are better suited to quantify questions about the accuracy of the algorithm or the contribution of AI to the user experience. Proxy metrics only show part of the big picture, but ideally correlate positively with the high level KPI. The following proxy metrics are suitable from a product perspective:

Objective Function

A common approach to measuring success is an objective function. I consider it a proxy metric because it does not necessarily have an impact on financial KPIs.

An objective function is a mathematical formula that needs to be minimized or maximized in order to solve a specific problem using AI. If the goal of the algorithm is to optimize a travel connection, the objective function must be minimized in order to achieve a short travel time. If the objective function quantifies accuracy, the goal is to maximize the function:

Classification Accuracy

A classification algorithm assigns a new data point to an existing category [2]. Now I want to know the accuracy with which the algorithm makes the classification. In the simplest case, it is a “binary classifiers” with two target categories. For example, the classification of e-mails into junk or relevant. Four results are possible:

- True positive: Correct positive prediction.- True negative: Correct negative prediction.- False positive: Incorrect positive prediction.- False negative: Incorrect negative prediction.

The key figures True Positive Rate (TPR) and False Positive Rate (FPR) [3] are now suitable for determining the classification accuracy:

TP is the absolute number of true positive results. FN is the absolute number of false negative results.

Mean Absolute Error (MAE)

Mean Absolute Error is one of the simplest and most popular metrics for determining the accuracy of predictions from regressions.

The starting point is the assumption that an error is the absolute difference between the actual and the predicted value. MAE uses the average of the absolute difference of all errors and gives a final result between 0 and infinity. MAE is best used in scenarios where the magnitude of the error is irrelevant because the errors are not weighted.

The mean absolute error (MAE) can be shown in a currency or other unit. For example, if the goal of an ML algorithm is to predict the development of property prices, the unit of the MAE is euros or dollars. One example of the result of the MAE calculation is: “The forecast of property price developments deviates by an average of EUR 50,000 from the actual value.”

Root Mean Squared Error (RMSE)

Like the MAE, the RMSE is used to determine the accuracy of predictions. Because of the fact that all errors are squared before they are averaged, the RMSE gives weight to larger errors. If the magnitude of the errors plays a role, RMSE is suitable for determining the average model prediction errors.

Sensibleness and Specificity Average (SSA)

SSA [4] is a metric developed by Google to quantify how natural a dialogue with a chatbot feels to people. The Sensibleness and Specificity Average (SSA) focuses on two central aspects of human communication: whether something makes sense (sensibleness) and whether it is specific (specificity).

How natural does a dialogue feel? // design.absurd

In the first step, the tester judges whether the chatbot’s response is reasonable in the context of the dialogue. In the second step, he judges whether he considers the answer to be context specific. The SSA score is the average of these two results.

AI and Machine Learning OKRs

Objectives and key result are ideal for measuring the success of an AI product. The aim of the OKR framework is to set ambitious goals. The progress on the way to goal achievement is tracked with key results [5].

OKRs shouldn’t be top down // design.absurd

OKRs are so well suited because key results are created bottom-up. The operational team determines which metrics are best suited to continuously improve a product. In an AI/ML context OKRs are for example:


“To develop an AI product which generates predictions for our B2C users.”

Key result 1

“Generate over 10000 B2C user signups via the Android and iOS apps.”

Key result 2

“No more than 1% false positives on the ML prediction model.”

Key result 3

“Introduce continuous deployment to speed up ML model deployment.”

It is important to formulate ambitious goals (experience has shown that 60–70% goal achievement is a good value). OKRs should always be transparent for the entire organization to give others an insight into the own projects.


The success measurement of an AI product is possible to a certain point with the usual operational product KPIs such as sessions or number of sign-ups. To isolate the accuracy e.g. of an Machine Learning algorithm, proxy metrics are used. OKRs are ideal for higher-level progress measurement on the way to achieving goals.