Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

6 min readJan 19, 2026

A simplified screenshot of an Open WebUI chat with GPT-4.1. The user uploads a scatterplot showing three distinct clusters and asks: “How many clusters are there in the scatterplot? Answer with a number in curly brackets, e.g., {4}.”. The AI responds “{3}”. The user input field shows the next prompt: “How many outliers are there in the scatterplot? Answer with a number in curly brackets, e.g., {3}.”.

When we need to visualize and interact with millions, or even just thousands, of individual points while analyzing data, we typically resort to rendering them in the browser using a canvas. The other common approach for the web, SVG, doesn’t scale when the number of individual elements increases to such quantities. However, while solving one problem, canvas charts introduce a new challenge: accessibility.

Although SVG charts are not accessible by default, they can be by design. Each part of an SVG chart has a corresponding element on the web page, allowing for a programmable, accessible experience for screen reader users. We can simply think of SVG as HTML. On the other hand, a canvas chart is just like a PNG image. If a screen reader user tries to learn more about a canvas chart, unless the developer has prepared a detailed description of it, they will just hear the word “image”. There’s no way to get an idea of what one of these charts represents, let alone extract any insights.

For static charts, the solution can be as simple as preparing a description and integrating it into the rendered chart. However, for platforms leveraging dynamic, large datasets, automatically generating these descriptions is not a simple task, especially for charts like scatterplots where data distributions can assume countless forms.

At Feedzai, we started exploring ways to generate data distribution-aware descriptions for scatterplots from their respective images using recent multimodal AI models. When the raw data is not available, or the datasets are composed of several thousand or million instances, relying on chart images and these models becomes tempting. This combination has the potential to generate such descriptions and serve them alongside their respective charts, significantly improving the accessibility of canvas charts.

That said, we focused on two main directions: using AI models to directly generate the descriptions, and using AI models to extract structured data (imagine a list of clusters and their respective center coordinates, for example) that will populate a predefined description template. The initial results, however, were mixed, and it wasn’t clear if we were on the right track to ensure adequate descriptions.

So, before moving on to further testing, we carefully reviewed the literature. Although benchmarks and other interesting findings are scattered all over the place, they rarely cover scatterplots and their related tasks. We therefore changed the plot and designed a dataset + benchmark to expand the general understanding of AI models when applied to charts, specifically scatterplots. We focused on evaluating the baseline performance of these models in identifying clusters and outliers at scale — both, if present, should be described properly.

The work culminated in the “Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks” paper, presented at the GenAI, Agents, and the Future of VIS workshop during IEEE VIS 2025. This blog post covers the main results and conclusions after running the benchmark in question.

Dataset and Benchmark

Examples for each of the data generators, ordered from top-left to bottom-right: Gaussian blobs with and without background noise, Gaussian blobs with outliers, random patterns (no clusters and outliers), relationships (no clusters and outliers), and geometric shape blobs.

The dataset consists of 18,921 synthetic scatterplot images created from 6 data generators (371 different data samples), 17 chart designs, and 3 image sizes. The number of clusters varies between 0 and 6, while the contamination level, when there are outliers, varies between 0.001 and 0.01. We chose to inject a relatively small number of outliers, keeping them well-distanced from the clusters to evaluate the detection of points that are clearly anomalous and relevant to report. In the end, each scatterplot was automatically annotated for clusters and outliers after converting the respective bounding box and point coordinates to screen coordinates (pixels).

The 17 chart designs for one of the plotted data samples, based on the default Vega-Lite styling. The chart designs are, ordered from top-left to bottom-right: Y-axis only, double-sized points, square points, randomly shaped points, randomly colored points, points only, half-sized points, points at half the default opacity, points at full opacity, default (light theme), dark theme, colored clusters, 21:9 aspect ratio, 16:9 aspect ratio, 9:16 aspect ratio, 4:3 aspect ratio, and 3:4 aspect ratio.

The benchmark was run on a stratified sample of 1,725 scatterplots and for 5 tasks, each defined by a prompt composed of an instruction and a response format:

Cluster counting for the number of clusters.
Cluster detection for the bounding boxes of each cluster.
Cluster identification for the point coordinates of each cluster center.
Outlier counting for the number of outliers.
Outlier identification for the point coordinates of each outlier.

In addition to the different tasks, 10 proprietary models from Google and OpenAI were evaluated, along with 3 different prompting strategies: zero-shot, one-shot (prompt + 1 example), and few-shot (prompt + 6 examples).

For the counting tasks, two performance metrics were computed: Accuracy and Mean Absolute Error (MAE). For the remaining tasks, Precision and Recall were considered with specific thresholds: an Intersection over Union (IoU) of 0.75 for bounding boxes and a Euclidean distance of 10px for point coordinates.

Results

The overall results for each task are summarized in the charts below. Each is accompanied by its main highlight. Feel free to explore each chart or jump straight to the next section for the key takeaways.

This is a grouped vertical bar chart. Its title is “performance for the cluster counting task”. The y-axis legend is “Accuracy”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Cluster counting results (Accuracy). Few-shot prompting is particularly promising for several models from both providers (over 75% Accuracy). The full alternative text for each chart is available in the GitHub repository for this paper.

This is a grouped vertical bar chart. Its title is “performance for the cluster counting task”. The y-axis legend is “MAE”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Cluster counting results (MAE). MAE is generally low (a lower MAE is better). The highest value is below 2 for Gemini 2.5 Flash-Lite (Thinking).

This is a grouped vertical bar chart. Its title is “performance for the outlier counting task”. The y-axis legend is “Accuracy”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Outlier counting results (Accuracy). Gemini 2.5 Flash excels at the outlier counting task when one-shot (~87%) and few-shot (~90%) prompted.

This is a grouped vertical bar chart. Its title is “performance for the outlier counting task”. The y-axis legend is “MAE”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Outlier counting results (MAE). MAE is higher for outliers in general. It is particularly high for two OpenAI reasoning models (o3 and o4-mini).

This is a grouped vertical bar chart. Its title is “performance for the cluster detection task”. The y-axis legend is “Recall @ IoU75”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Cluster detection results (Recall @ IoU75). The performance is very poor. None surpass 25% Recall. The results are similar for Precision.

This is a grouped vertical bar chart. Its title is “performance for the cluster identification task”. The y-axis legend is “Recall @ 10px”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Cluster identification results (Recall @ 10px). None surpass 25% Recall in the cluster identification task as well. The results are similar for Precision.

This is a grouped vertical bar chart. Its title is “performance for the outlier identification task”. The y-axis legend is “Recall @ 10px”. The x-axis legend is “model”. The chart is made up of 10 groups of bars: GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, GPT-4o, GPT-4o mini, o3, o4-mini, Flash, Flash-Lite, Flash-Lite (Thinking). Each group contains three bars: zero-shot prompt, one-shot, few-shot. — Outlier identification results (Recall @ 10px). Recall is also low, although Gemini 2.5 Flash, when few-shot prompted, seems promising (~65%). The results are similar for Precision.

Takeaways

Based on the results, the main considerations when combining scatterplot images and AI models are as follows:

Give priority to few-shot prompting. This prompting strategy consistently outperformed zero-shot prompting across all models and tasks (top models achieved over 90% Accuracy in counting tasks). It is also useful for handling zero-answer scatterplots.
Avoid localization tasks. OpenAI and low-cost Google models prompted with strategies similar to those evaluated are unreliable for localization tasks (e.g., detecting clusters) involving scatterplots.
Invest in other components first, not chart design (page 4 for more details). Chart design is fundamental for humans, but it’s a secondary factor when fed into AI models. Nevertheless, it can be beneficial to avoid chart designs with wide aspect ratios (16:9 and 21:9) or seemingly random colors.

Feedzai Techblog

Benchmark It Yourself (BIY): Preparing a Dataset and Benchmarking AI Models for Scatterplot-Related Tasks

Dataset and Benchmark

Results

Takeaways

Further Reading

Published in Feedzai Techblog

Written by João Palmeiro

No responses yet