Benchmarking LLMs in Real-World Applications: Pitfalls and Surprises

By Jean V. Alves and Ferran Pla Fernández

13 min readNov 25, 2025

Moving beyond binary classification provides novel insights.

In the real world, scams rarely present themselves in black and white. Fraudsters exploit nuance, impersonate legitimate brands, and mask malicious intent with seemingly ordinary behavior. That’s why Feedzai has launched ScamAlert (patent pending), a Generative AI-based system innovating on the current paradigm of scam prevention, in response to this growing challenge.

Traditional detection systems treat the problem as a binary choice: scam or not a scam, often outputting an estimated “scam likelihood” measure. This value, even if accurate, doesn’t tell users why something is risky or what they should watch out for, leaving them with little guidance on how to stay safe.

The binary approach can often suffer from a lack of context. While a text message may look suspicious in a vacuum (e.g., a payment request via a less safe method) the user may have other reasons to believe in its legitimacy, such as a past history of such requests. Consequently, an incorrect risk estimate based on missing context may lead users to distrust the system’s abilities.

A traditional binary classification system outputs only a risk estimate

ScamAlert, on the other hand, makes judgements on what it knows. Users submit a screenshot of the suspected scam, and ScamAlert identifies observable red flags, patterns or behaviors that are often associated with fraud, such as suspicious links or spelling errors. This approach empowers the user with interpretable insights into the detected risk signals, instead of a vague numeric value. This places the user in the driver’s seat, by presenting them with the facts and enhancing their awareness and judgment.

To fully understand a systems’ ability to perform this task, we pair this labelling approach with a rigorous evaluation and benchmarking protocol. We test the consistency of model outputs for the same input; the model’s ability to justify its decisions and its overall performance, embodying the Robust, Transparent, and Tested pillars of Feedzai’s TRUST Framework.

By identifying red flags, the user is given insights to aid in their judgment

ScamAlert is a good example of an application of GenAI for a purpose other than increasing productivity, focusing on improving fraud prevention. We will show how a systematic evaluation and benchmarking framework is crucial to faithfully measure the performance of the full system, mitigate known GenAI limitations (such as hallucinations) and increase the trust in AI-powered workflows.

In this post, we’ll walk you through:

Why binary scam detection fails in ambiguous or unverifiable cases.
How a red flag-based multi-label classification system works.
How we create a thorough benchmark to reliably evaluate ScamAlert.
How we evaluate ScamAlert and multimodal models using red flag recall, precision, and instruction adherence.
Practical trade-offs between accuracy, cost, and latency in scalable deployments

Let’s explore how a more nuanced, interpretable system can raise the bar for scam prevention.

Red Flags

Rather than determining the likelihood of a message or an online listing being a scam, ScamAlert focuses on identifying which specific red flags are present. This shift leads to more interpretable outputs, helping users understand why something appears suspicious instead of collapsing everything into a binary decision or a score. It also introduces a more flexible framework, allowing domain experts to take control over what ScamAlert has detected.

This structure also transforms how we evaluate ScamAlert’s accuracy. By breaking predictions down into individual red flags, we gain visibility into which behavioral patterns (e.g., urgency, impersonation, or financial requests) are being consistently recognized, and which ones are being missed. This kind of transparency enables more focused analysis, helping teams identify specific areas for improvement.

Just as importantly, it supports ongoing auditing of ScamAlert as fraud tactics shift. As new scam patterns emerge, analysts can evaluate whether ScamAlert’s current red flag coverage is sufficient, and introduce new flags as needed. This modular evaluation strategy ensures ScamAlert remains aligned with a constantly evolving threat landscape, without requiring a full reset every time scammers change tactics.

To show how this works in practice, let’s look at a fictitious example used for evaluation purposes with a few red flags:

In this case, four key behaviors may be identified that are often linked to scam attempts:

Unusual Channel: The sender claims to be a colleague but reaches out via text message. This is unusual, as formal internal communication typically happens through company email or approved platforms.
New Phone Excuse: The unfamiliar phone number is explained with the excuse of using a “new” phone. This is a common tactic used to lower suspicion and discourage questions, masking the fact that in reality an attacker device is being used.
External Transfer: The sender asks for a significant money transfer to an external bank account. This is a strong indicator of a potential scam.
Heightened Urgency: The message stresses urgency, framing the request as time-sensitive and business-critical. Scammers often rely on this pressure to get quick, unquestioned responses.

To formalize these observations, we could codify the identified behaviors using the following labels: Unusual Communication Methods, Unknown Sender, Requests for Money, and Urgency Tactics and Pressure. Other common indicators, not used in this example, include, Suspicious URL Shortening, frequently used in phishing attacks to obscure malicious domains, and Suspicious Attachments, which may contain malware capable of compromising the recipient’s device.

Evaluating System-level Performance

However, to systematically evaluate the performance of ScamAlert as a versatile system, we must test it under a wide variety of possible scam patterns and mediums, such as emails, listings, phishing websites, etc. Furthermore, since ScamAlert can be powered by different multimodal models, we also need a way to evaluate how the system performs under different models, and whether these meet our needs and limitations in regards to cost and/or latency.

Therefore, designing a comprehensive benchmark is crucial: one that includes a wide and diverse set of red flags reflecting real-world variation. With a standardized benchmark in place, we can evaluate how the system’s performance varies when changing not only the underlying AI model, but also its parameters, and other elements within the system.

A reliable benchmark also allows us to keep up with the rapid pace of new multimodal model releases. Maintaining a well curated dataset allows us to quickly determine whether a new model can contribute to a better red flag detection performance, making it possible to iterate quickly and ensure ScamAlert evolves alongside the ever expanding field of GenAI, while mitigating known limitations like hallucinations.

To tackle this challenge, we developed a benchmark dataset composed of a large and diverse collection of screenshots, including product listings, emails, and text message conversations. These span multiple language and include attacks such as images unrelated to the use case. Each example in the dataset was manually annotated to indicate the presence or absence of specific scam-related red flags. As we created these annotations, we constructed a curated taxonomy of red flags, which together define the label space for our multi-label classification task.

Construction of the benchmarking dataset

Identifying Red Flags with ScamAlert

Using ScamAlert involves inputting a screenshot into the system, which has been set up to identify a wide variety of flags. Armed with expert-level knowledge on current scam tactics and how to identify them, ScamAlert begins by verifying the submitted user screenshot, employing a multimodal model to analyze the image and extract key information. The model response is then thoroughly validated by ScamAlert to ensure that the delivered insights are high-confidence and well-founded.

ScamAlert outputs a structured response with three parts: (1) a list of the detected red flags, using our predefined nomenclature, (2) a short explanation for why each red flag was identified in the given screenshot, and (3) a list of recommendations of next steps based on those detections.

The explanations serve two key purposes. As these are generated by a Large Language Model (LLM), having the model explain its reasoning, similarly to Chain-of-Thought prompting, can improve the system’s accuracy. Second, the explanations provide a layer of interpretability, helping us determine whether the predictions are based on actual, observable features in the screenshot.

The recommendations provide the user with safe courses of action, focusing on guiding the user in how to avoid falling victim to a potential scam. These often include suggestions not to click any links or attachments, or blocking further communications from a given source.

To produce a highly informative output, ScamAlert must balance two key goals: capturing all the red flags that are actually present, while minimizing the inclusion of incorrect or irrelevant ones. We can quantify these principles under our labeling framework by using red flag recall, the percentage of true red flags that ScamAlert successfully identifies, and red flag precision, the percentage of predicted red flags that are actually correct.

To measure performance across both dimensions, we then compute the red flag F1 score, which is the harmonic mean of precision and recall. This approach rewards models performing well on both metrics.

*How we score the performance of ScamAlert on a given instance.*

Evaluating System-level Reliability

Our evaluation looks beyond just ScamAlert’s ability to detect relevant red flags. We also assess how reliable the underlying multimodal model is at following explicit instructions. This is a critical requirement for an AI model intended to operate in structured or automated workflows. Specifically, we treat failures to follow the expected output format as errors. We focus on two types of mistakes:

Formatting errors, where the model’s output is not valid and cannot be parsed; and
Invalid predictions (hallucinations), where the model includes red flags that aren’t part of the predefined list it was given.

By including these criteria, we aim to evaluate not only what the model predicts, but how consistently and correctly it communicates those predictions.

Evaluating Operational Costs

Beyond detection accuracy, it’s important to account for practical considerations that would impact real-world deployment of the ScamAlert system, especially when operating at scale for a large client base. In these scenarios, both cost and latency of model queries become critical performance factors. A model that delivers similar accuracy but with faster response times or lower inference costs is clearly more suitable for production use. Evaluating these trade-offs ensures that our system is not only effective, but also efficient and scalable in real-world environments.

In summary, our evaluation considers four key dimensions: (1) the system’s ability to detect red flags accurately, (2) how the underlying model adheres to the specified output format and label set, (3) the monetary cost associated with each query, and (4) the latency of its responses.

Benchmark Results

To evaluate how different multimodal models impact ScamAlert’s red flag detection capabilities, we tested a variety of such models on our benchmark dataset. Each model was evaluated by sending several screenshots, one at a time, through the ScamAlert pipeline, allowing us to observe performance under realistic usage conditions. For each screenshot, we repeated the evaluation three times to account for the variability inherent to generative models.

Here, we report the results for a selection of commercially available multimodal models (as of November 2025) in terms of average F1 score across the dataset.

DISCLAIMER: These results are only valid for the tested prompt and are specific to the task at hand. Model rankings vary significantly across benchmarks and should not be interpreted as absolute evaluations of each model’s image processing ability.

Overall Results

In the following plot, we display the results for ScamAlert’s performance and operational costs when using each of the multimodal models being tested. To calculate these results, we assume that each instance for which the model failed to output a valid response corresponds to an empty prediction (i.e., no red flags). If a model hallucinates red flags which are not present in our curated list, these are ignored.

Several recent models (Gemini 3 Pro, Gemini 2.5 Pro, GPT-5, Claude 4, Claude 3.7) offer “hybrid” reasoning capabilities. These models allow users to set a “thinking budget” in which the model can generate intermediate reasoning steps to solve a problem, usually boosting performance at a far greater cost and latency. To keep comparisons balanced, we set the thinking budget of all models to 0, as speed is an important aspect of this task.

However, Gemini 2.5 Pro, Gemini 3 Pro, and GPT-5 do not allow for completely disabling thinking, allowing only for limitations on the reasoning effort: Gemini 2.5 Pro lets users define maximum reasoning tokens, while 3 Pro and GPT-5 allow for the selection of pre-defined reasoning levels. We set these models at their minimum allowed thinking budget (128 tokens for 2.5 Pro, “Low” for 3 Pro, and “Minimum” for GPT-5), and differentiate them in the following plots by including a box around the model. We also highlight the models in the pareto front, that is, models that dominate others in terms of cost-performance tradeoffs.

For a reference on how higher thinking budgets may impact these models, we run the 4 GPT-5 reasoning settings: Minimum, Low, Medium (Default), and High; and the 2 Gemini 3 Pro reasoning settings: Low and High. We display the values for the default setting (“Medium”) for both GPT-5 nano and GPT-5 mini.

*Note:

Gemini 3 Pro corresponds to its November preview.
Gemini 2.5 Flash corresponds to its May version.
Gemini 2.5 Pro corresponds to its May version.
These perform similarly on visual tasks to the latest stable versions.
Gemini 2.5 Flash Lite corresponds to its June 5th preview — just before its release on the 22nd.

We highlight the following findings (note that the x-axis is logarithmic):

GPT-5 provides a significant performance over competitors, but only if expending effort equal to or above the “Low” setting, incurring significant cost increases.
Gemini 3 Pro at its lowest reasoning budget performs better than GPT-5 at its lowest reasoning budget, while being significantly cheaper.
Notably, all of the Claude Sonnet versions do not match the performance of similarly priced OpenAI or Gemini models.
Gemini “Flash” and “Flash Lite” versions dominate the competing GPT mini and GPT nano offerings.

Plotting the models’ performance versus their latency, again on a logarithmic x-axis, only heightens these differences:

Higher reasoning levels produce substantial increases in latency for Gemini 3 Pro and GPT-5.
Notably, at their lowest budget, Gemini 2.5 Pro is significantly faster than Gemini 3 Pro and GPT-5.
While GPT-5 mini and nano provide significant latency gains over the main variant, at the default effort, they are still slower than all other models we tested.
Surprisingly, Gemini 2.0 Flash Lite and 2.5 Flash Lite versions take approximately the same amount of time to provide answers as their Flash counterparts.

Detailed Analysis: Instruction Following Impacts on Performance

In the following plot, we display the results for all reasoning levels for the 3 variants of GPT-5: main, mini, and nano. At a fifth of the token cost, nano is expected to perform worse than mini, but we observe the opposite.

This strange pattern is compounded by the fact that, as seen previously, both GPT-5 mini and nano perform worse than their equivalent 4.1 counterparts.

This discrepancy is explained by the fact that both GPT-5 nano and GPT-5 mini have a tendency to generate red flags which are not part of the input list, with the latter doing so at a much higher rate than the former.

Many times, these are semantically equivalent to the correct option, such as “Suspicious Shortened Link” instead of “Suspicious URL Shortening.” However, as we wish to prioritize not only image processing abilities, but also the model’s instruction following, these predictions are ignored, and only case-insensitive matches are accepted.

Other models exhibit the same pattern seen previously, albeit at lower rates. In the following plot, we show the 8 models with the highest rate of unknown flag generation. This phenomenon is possibly responsible for the lower performance of GPT-4.1 nano when compared to Claude 3 Haiku, for example.

Detailed Analysis: Robustness of Results

In order to assess the consistency of these rankings, that is, whether the performance significantly changed across different runs, we ran the entirety of the benchmark three times. In the following plot, we demonstrate that considering either the worst or best trial for each model would not result in a significant re-ordering of the results.

The Importance of Benchmarking

These results show the importance of having a consistent benchmark and the capacity to systematically evaluate the models for the task at hand.

The results we presented above may not be aligned with general public benchmarks, emphasizing the necessity of a rigorous and systematic evaluation for the specific task. At Feedzai Research this task is a key step in our projects, as we have shown in our public benchmarks and multiple datasets.

It is not our intent to generalize any conclusion to other use cases, as these results show the performance of the models for a very specific task, with the same prompt, for a highly specialized dataset. Different tasks and datasets, even if similar on a surface level, require a specific evaluation, nwhich may offer completely different results.

Conclusion

As scam tactics evolve to bypass increasingly sophisticated spam filters and exploit new communication channels, systems that rely on rigid definitions or static rules quickly fall behind. ScamAlert is designed to meet this challenge head-on, offering a flexible, interpretable framework that can evolve alongside the threat landscape.

By focusing on red flags rather than binary judgments, ScamAlert provides more transparency and control. It allows domain experts to define what suspicious behavior looks like, and gives analysts the tools to track exactly which patterns are being recognized and which are being neglected.

Tracking performance in this context is a multifaceted challenge. We need to know whether the system can detect red flags, but also whether the underlying multimodal model can follow instructions, produce valid outputs, and operate efficiently at scale. Evaluation, therefore, needs to cover both detection accuracy and operational reliability.

Our benchmark is designed with this in mind. It enables fast, consistent evaluation of new LLMs as soon as they are released, ensuring ScamAlert can quickly adopt improvements in model capabilities while maintaining a responsible and disciplined approach to system updates.

Update: This post was updated on the 26th of November to include the results relating to Gemini 3 Pro, currently in preview.

Feedzai Techblog