Stories by Mradul Dubey on Medium

The Failure Catalog

Mradul Dubey — Fri, 24 Apr 2026 14:20:22 GMT

Ghost entities and false accusations, when VLMs trust data over their own eyes!

Your VLMs may miss the proverbial “Elephant in the Room”

In VLM: the more you tell it, the less it sees, I defined anchoring bias in VLMs and measured it across seven controlled conditions.

The findings were quantitative:

Structured data suppresses visual reasoning, and the delivery channel modulates how badly.

This post is different. These are not controlled experiments. They are stress tests. I have setup deliberately adversarial conditions (fabricated data, partial data, a weaker model) that I used during development to probe how the system breaks. The conditions are stacked: Gemini 2.5 Flash (weaker baseline than Part 1’s Gemini 3 Flash Preview), higher text:image token ratios, and intentionally incorrect or incomplete tracking data.

I am not claiming these failures will occur under normal operating conditions.

I am cataloguing the types of failures that anchoring bias produces when pushed, because they are qualitatively different from what Part 1 describes.

Part 1 shows suppression i.e. the model looks less, narrates more, misses events. The failures below are not suppression. They are confabulation, misattribution, and fabrication. An engineer reading Part 1 would reasonably expect “the model might miss things.” They would not predict what follows.

Full details and raw responses in the GitHub repository.

Identity Misattribution — Failure Type 1

Setup: Same shoplifting clip as Part 1. Same temporal grid + center frame.

T1 the mannequin

I added tracking data for only one entity: T1 which is a mannequin standing in the background of the aisle.

Not a person. A mannequin.

No data for T2 (the actual shoplifter) or anyone else. Text:image token ratio ~1.65.

The prompt included the instruction:

“Use the visual frames as ground truth when signals seem ambiguous.”

The model saw the shoplifting. It got the action right but got the actor wrong.

”A woman (tracked as T1), wearing a red top and dark pants, is standing at a clothing rack. She reaches out and appears to be examining dark clothing items on the rack.”

T1 is not a woman in red. T1 is a mannequin.

In other conditions with full data, the same model family correctly identified T1 as a mannequin. But here, T1 was the only tracked entity and the tracking ID was powerful enough to override what the model could see.

It described a mannequin as a living woman performing a multi-step action. Then it cited T1’s stationary tracking data as confirmation:

”The tracking data confirms she remained stationary throughout this entire period of selection and concealment.”

Near-zero speed, a measurement of a mannequin being a mannequin, repurposed as evidence of “careful, deliberate stillness during concealment.”

”The primary notable event is the woman placing a clothing item from the rack directly into her personal bag around 91.3s. Given the context of a retail environment, this action, without any apparent intention to purchase… is highly suspicious and suggests possible shoplifting. The tracking data confirms she remained stationary throughout this entire period of selection and concealment.”

Every sentence is confident. The action description is correct. The identity is wrong. The data citation is real but irrelevant.

The conclusion accuses a mannequin of shoplifting.

What this failure type looks like: Tracking IDs act as identity magnets. When the model observes an action and needs to assign it to an entity, it reaches for the only tracked ID available, regardless of whether that entity is a person, a mannequin, or anything else. It then cites the entity’s data as supporting evidence, even when the data describes entirely different behaviour.

Entity Fabrication — Failure Type 2

Setup: Same clip. Same temporal grid and center frame.

T2’s (actual shoplifter) tracking data was replaced with fabricated data.

All coordinate fields frozen to a single position (cx=307, cy=253), speed zero, area change zero. An internally consistent story i.e. a person standing still, doing nothing, the entire time.

Text:image token ratio ~7.32.

The ground truth: T2 picks up a garment, conceals it in her bag, and walks away.

Placing the fake tracking data for T2 on the image

This was the model’s response:

”Foreground Activity (Woman in Red — Not explicitly tracked in the provided data): A woman wearing a red top… is actively engaged in browsing the clothes. She repeatedly picks up dark garments from the rack, examines them closely, and appears to place them back.”*

”Mid-ground Activity (Track T2 — Woman in Pink): A woman (T2) in a light pink outfit is consistently visible standing still at a clothing rack on the right side of the aisle. The tracking data confirms her cx, cy, bbox_area… are all unchanging at 0.0 or 14000, indicating she remains perfectly stationary.”

The model invented two people. The real woman, the shoplifter, visible in every frame, was demoted to “not explicitly tracked.” Her shoplifting rewritten to “appears to place them back.”

The frozen coordinates point to the center-left of the frame, closest to T4 (a red dress mannequin on the left).

But the model described its ghost as “Woman in Pink” on the “right side of the aisle”, that’s T1, the pink mannequin, 240 pixels away from where the data says T2 is. The model recruited the visual appearance of a mannequin from across the frame, gave it T2’s tracking identity, and produced an entity that matches neither the data’s coordinates nor visual reality.

Conclusion:

”There is no overtly suspicious or significant activity observed.”

What this failure type looks like: When data describes an entity that doesn’t match anything visible, the model doesn’t question the data. It fabricates a visual counterpart, stitching together appearance from one location and identity from another to produce a coherent but fictional entity. The real entity gets demoted; the fabricated one gets the track ID and narrative prominence.

Why catalogue these

Both failures share a root mechanism:

The model treats structured data as ground truth and adjusts its visual interpretation to match.

And that’s anchoring bias for you. And it’s magnified in these cases due to higher text:image tokens ratios.

But they produce qualitatively different outcomes:

Identity misattribution bends who to fit the data. Actions flow to whichever entity the data names.
Entity fabrication bends what exists to fit the data. The model invents a person rather than conclude the tracking data is wrong.

Neither failure is a simple miss. In both cases the model produces confident, internally coherent, data-cited narratives. A downstream system consuming this output would have no signal that anything is wrong.

These are stress test results, not production benchmarks. The conditions were deliberately adversarial.

But the failure types : misattribution and fabrication, are categories that production systems should test for, because they won’t surface in accuracy metrics that only measure “did the model detect the event.”

For engineering guidance on defending against anchoring bias, channel selection, data filtering, baseline testing, see the part 1. Full methodology, raw prompts, and all model responses: github repo.

Citable archive: DOI 10.5281/zenodo.19557723

VLM: The More You Tell it, The Less it Sees.

Mradul Dubey — Fri, 17 Apr 2026 12:01:02 GMT

Defining Anchoring Bias in Visual Language Models (VLMs) & an engineer’s field notes.

[~_~]

One of the lingering challenges in computer vision, the video search problem, has been around as long as surveillance systems have existed. These systems generate hundreds of hours of video data with barebones analytics. Searching through such recordings is a huge and mostly manual undertaking. As a computer vision engineer, it has always appealed to me to solve it.

It was only natural that in the era of VLMs & AI, the video search problem should be revisited with fresh tools. In this spirit, I showed a vision-language model a surveillance clip of a woman shoplifting a garment. I was not ready for the unexpected and unintuitive behaviour of such models.

With no additional context, the model described the event precisely:

At 90.0s, the woman looks directly at the surveillance camera while quickly stuffing the dark garment into her large tote bag… selecting an item, checking her surroundings, looking directly at the camera, and then quickly concealing the item — suggests a case of shoplifting.

Then I gave it the same frames, same model, same prompt, but additionally, the structured detection data from my object tracking pipeline. Here is the response:

She stands in the middle of the hallway and looks at a cell phone in her hands… No suspicious activity is observed during this window.

Same frames. Same model. More data. Worse outcome.

[O_O]

If you are working with Vision Language Models with multimodal capabilities, or technically speaking, Multimodal Large Language Models (MLLMs), you must know about the following, because understanding these may determine how should you utilise these models in your CV pipeline:

Anchoring Bias & the form it takes in VLMs
Delivery Channels and their impact on anchoring bias
How to engineer against the impacts of anchoring bias in a complicated ML pipeline

Some of these terms may be new to you, so let’s define them:

Anchoring Bias

Anchoring bias is a cognitive concept: the tendency to over-rely on an early piece of information when making subsequent decisions. Any structured data that is fed to VLMs as input can bring this bias in effect.

Delivery Channels

Structured data can reach the multimodal models through different modalities: embedded in text prompt, drawn on the image pixels or split across both. The delivery channel is “how” the data enters the model’s input.

[O_o]

If you are short on time, here are the main three main things that I learned while designing the experiments to understand why the the VLM’s results regressed with extra information:

The delivery channel modulates the anchoring effect.
We know any structured data that forms part of the input to the VLM models may trigger anchoring bias in the model. But the same bounding box information, delivered as text vs drawn on the image vs a cross-modal reference scheme, produced dramatically different severity. No prior work has held the information constant and varied only the channel. The differences are not reducible to token count.
Plausible metadata passes unchallenged.
Every published adversarial attack on VLMs operates at the pixel level. I found that fabricated entries in structured text: the kind your own pipeline’s false positives already produce, are enough to override visual reasoning of the VLMs. The model doesn’t verify metadata against the image. It only rejects metadata the image actively disproves. That’s a much lower bar.
Every field of metadata has a cost.
In my scene description task, the degradation curve is monotonic. Even minimal metadata: just track IDs and carrying labels, costs visual perception. There is no level where “a little data helps.” Whether this holds for other tasks is an open question, but for anything that depends on visual perception, every field you add trades perception for narration.

The rest of this post proves each of these claims.

If your interest is limited to mitigating these risks, skip directly to the “What to do about it” section at the end.

Motivation behind the experiment

The core idea was to use VLMs to “describe” a sequence of video in enough detail. If the details capture outliers and notable events, we can search through the “video description” to quickly find whatever we are looking for.

The architecture was layered: YOLOv8 runs object detection and tracking on every frame, a math layer computes motion derivatives (velocity, acceleration, approach/retreat signals, turning sideways etc), and then a VLM receives sampled frames plus this pre-computed context to produce semantic understanding for “what’s actually happening in this clip.”

My naive attempt at solving the video search problem

The assumption was intuitive: give the VLM a head start. It knows who’s in the scene, what they’re carrying, where they’re moving. It has bounding boxes, track IDs, confidence scores. This should be strictly better than vision alone.

I was wrong.

The more structured data I fed the model, the less it looked at the actual frames.

I knew I had to design experiments to understand why.

The Experiment Methodology

I took a 10-second surveillance clip (85s — 95s) of a shoplifting event: a woman selects a dark garment from a clothing rack, conceals it in her shoulder bag around 90s, and walks away.

Every condition received the same two images:

A 3x3 temporal grid of 9 uniformly sampled frames from the window
The center frame (~90s) at full video resolution

Temporal grid to capture the sequence of visual information in lieu of tracking data from math layer

The center frame sent along the temporal grid

The temporal grid & the center frame remained the same across experiments. The text data sent alongside the prompt captured all our detection (real + fake) and tracking data except the only variable was how and whether the bounding box data from the YOLO pipeline was delivered. See the table below:

All cases: {detections from YOLOv8L + BoT-SORT} x {three delivery channels for the bounding boxes}

Baseline (0): Clean images, no detection data. Vision only. The base prompt was the following:

You are analyzing a surveillance video clip.
You are given two images:
- Image 1: a 3x3 temporal grid of 9 uniformly-sampled frames from an 85.0s–95.0s window, left-to-right top-to-bottom, each timestamped.
- Image 2: the center frame (~90.0s) at full resolution.
Describe what is happening in this window. Include timestamps.
Focus on: what each person is doing, what objects are present, notable events, any suspicious or significant activity.

B (Text Real): Clean images + a JSON block in the prompt containing real YOLOv8-L + BoT-SORT detections and tracks: track IDs, classes, confidence scores, and bbox coordinates as pixel values [x1, y1, x2, y2].

C (Text Fake): Same as B, but the JSON also includes three fabricated entries: a fake person in the empty right aisle, a fake handbag on the floor, and a fake cell phone, near the shoplifter’s hand.

D (Visual Real): Center image has coloured bounding boxes drawn on it (green for person, blue for handbag, etc.) with class+confidence labels above each box (e.g., “person 0.86”). The prompt gets metadata but no bbox coordinates, the spatial info is only in the image.

E (Visual Fake): Same as D, but the drawn overlays include boxes for the three fake detections too.

F (Cross-modal Real): Center image has only ID labels drawn on it (“T1”, “O164” — no class, no box). The prompt contains a JSON with detections (no bbox field) plus an id_labels mapping table. The model has to cross-reference: see “T1” on image → look up T1 in text → learn it’s a person carrying a handbag.

G (Cross-modal Fake): Same as F, but with the three fake detections included in both the image labels and the text mapping table.

The Metrics

To measure what was happening, I scored each response on two dimensions:
1. Visual Detail (VD): observations only derivable from pixels: posture, gaze, hand actions, clothing, spatial relationships.
2. Data Narration (DN): sentences that reference, restate, or cite provided detection data: track IDs, bounding boxes, carrying labels, confidence scores.

The Visual Reasoning Ratio = VD/(VD+DN) tells you how much of the model’s output comes from actually looking at the image versus narrating the data it was given. A ratio of 100% means pure visual reasoning. A ratio of 30% means the model is mostly reading you back its input.

Visual Reasoning Ratio(VRR) = VD / (VD + DN)

I know what you are thinking: these metrics are not really objective. And I would agree. But the task at hand was “scene description” and these proxy metrics were rigorous enough for the task at hand. To get a consistent number, I relied on Claude Opus model from Anthropic and let it assign the scores based on the raw VLM outputs.

Pattern 1: The delivery channel modulates the anchoring effect

As we know, structured data in the prompt suppresses visual reasoning (anchoring bias). The magnitude of that suppression depends on how the data is delivered. Here are the results:

Anchoring Bias quantified across experiments

The same bounding box information, delivered three ways, produced dramatically different outcomes.

Text bounding boxes (condition B): where the bounding boxes from the object detection pipeline were given as structured text, dropped the VD ratio to 53%. The model narrated track IDs and bag classifications instead of watching the scene. Shoplifting description was hedged to “noteworthy behaviour.”

Visual bounding boxes (condition D): where the boxes were drawn directly on the input image, held at 69%. The model still described the concealment: “appears to place the item she just took into her large shoulder bag… consistent with shoplifting behaviour.” The boxes directed attention without replacing observation.

Bboxes overlaid on the image input

Cross-modal delivery (condition F): where ID labels were drawn on the image but resolved via a text mapping table given in the prompt, dropped to 47% and produced the worst outcome of any real-data condition. The model fabricated movement that never occurred: “She stops, turns around, and walks back toward her original position.” The person never walked back. The cognitive load of cross-referencing visual IDs against text labels consumed the model’s “attention budget”, leaving nothing for independent scene analysis.

Why does the delivery channel matter?

I think it comes down to this: visual overlays are attention cues. They say “look here.” The model processes them as part of the image and investigates what’s inside the box. Text bboxes are propositional claims. They say “this entity exists, with these attributes, at these coordinates.” The model narrates the proposition instead of verifying it. I have also identified other potential mechanisms from existing research work, which may explain this behaviour, but I will discuss it a little later.

You may have one natural objection: may be this is just a token count effect. More text tokens in the prompt means less attention left for the image. I had the same doubt. Here’s the breakdown (image tokens are constant at 1,032 across all conditions):

Experiments ordered by increasing text:image token ratio

The channel effect is not reducible to token count. Condition F (crossmodal, 328 tokens) anchors harder than condition B (text bbox, 353 tokens) despite having fewer text tokens. Cross-modal reference resolution imposes cognitive load beyond what raw token count captures.

How this differs from prior work

Anchoring bias in text-only LLMs is established territory: models over-weight reference values in reasoning tasks (Jones & Steinhardt 2022). But that’s text-in, text-out. Nobody has tested what happens when anchoring data competes with visual perception in a multimodal model.

The closest related work comes from visual prompting. Set-of-Mark prompting (Yang et al. 2023) showed that visual overlays direct VLM attention. Contrastive Region Guidance (Wan et al., ECCV 2024) designed a better way to perform visual prompting to help VLM performance. A LessWrong analysis (“Biasing VLM Response with Visual Stimuli”) showed that visual highlighting shifts VLM answers toward marked options, even incorrect ones. But none of these held the information constant and varied the delivery channel. That’s the gap this experiment fills.

There’s also a possible mechanistic explanation:

“Seeing but Not Believing” (Liu et al. 2025) showed that VLM visual encoders capture the right information, but the language backbone overrides it during generation. Structured text data in the prompt may amplify this override.

Favero et al. (M3ID, CVPR 2024) showed that VLMs’ reliance on visual input decays as more output tokens are generated. My data suggests the suppression can begin before generation even starts: from the input side, by loading the prompt with structured text.

If you’re building a VLM pipeline and need to pass spatial data, visual overlays preserve more independent reasoning than text coordinates. Cross-modal reference schemes are the worst option.

Pattern 2: Plausible metadata passes unchallenged

VLMs don’t verify metadata against the image. They only reject what the image actively disproves. I tested this by injecting three fake detections alongside the real ones:

A fake person in the empty right aisle (confidence 0.52)
A fake handbag on the floor near a rack (confidence 0.47)
A fake cell phone near the subject’s hand (confidence 0.44)

T89 is a fake person, 0412 a fake handbag & 087 a fake phone (all in red)

The fake person and fake handbag were ignored in every condition, by both models. No visual evidence supported them, and the models declined to adopt them.

The fake cell phone was adopted in every condition it appeared in. Both models. Every time. Here’s the progression:

Baseline (no data): “quickly stuffing the dark garment into her large tote bag”

Condition C (text + fakes): “she appears to hold a cell phone… her behavior of taking an item and quickly walking away while looking at the camera could be considered notable for security purposes”

Condition G (crossmodal + fakes): “she stands in the middle of the hallway and looks at a cell phone in her hands… No suspicious activity is observed”

One fabricated entry at 0.44 confidence turned a confident shoplifting detection into “checking her phone before walking away.”

The difference between the adopted phone and the rejected person/bag is positional plausibility. The phone’s bounding box overlaps the subject’s hand region, it’s spatially consistent with what the model can see, even if the phone itself isn’t visible at that resolution. The fake person sits in an empty aisle with no visual evidence. The fake bag sits on the floor away from any person.

The model’s bar for adopting a fabricated detection is not

“Can I see this in the image?”

It’s actually,

“Is there anything in the image that disproves this?”

That’s a much lower bar.

And the model can be critical when it wants to. In condition E, it correctly flagged that mannequins were misidentified as persons by the detection system: “the detection system misidentifies these as persons in the 90.0s frame.” But it applied that skepticism only where visual evidence was unambiguous (a clearly static mannequin) and not where it was merely ambiguous (a small phone bbox near a hand in a low-res region). The model’s skepticism is asymmetric: it challenges data that obviously contradicts the image, but accepts data that merely lacks support in the image.

How this differs from prior work

Every published adversarial attack on VLMs operates at the pixel level. AdvEDM (NeurIPS 2025) injects semantics via adversarial image perturbations. Shadowcast (NeurIPS 2024) poisons training data. Prompt injection work in oncology VLMs (Nature Communications 2024) embeds sub-visual triggers in medical images. All of these modify the image itself.

What I found is a simpler, arguably more realistic attack surface: fabricated entries in the structured text metadata that accompanies the image. You don’t need adversarial perturbations. You don’t need to poison training data.

Your own pipeline’s false positives, the detections you normally wouldn’t worry about because they’re below your confidence threshold, are already doing this, if they happen to land near a real entity.

Pattern 3: Every field of metadata has a cost

In a separate experiment on the same shoplifting scene, I held the delivery channel constant (text JSON) and varied how much tracking data was included: from nothing to dense per-frame sampling with all motion fields. Note: this particular experiment used Gemini 2.5 Flash.

For the scene description task, every unit of this structured data traded visual perception for data narration. Since I varied data density across five levels on the same shoplifting scene I was able to quantify the curve of this loss:

The cost of metadata on visual perception of VLMs

The curve is monotonic. Every additional field of structured data costs visual perception. There is no level where “a little data helps”. Even G1 (minimal metadata: just track IDs, time ranges, and carrying labels) drops the visual reasoning ratio by 15 points.

At G4, the model latched onto a data artifact, a carrying-label change from “handbag” to “backpack” at 93.7s, a YOLO misdetection and treated that as the primary event: “A significant change occurs: the tracking data indicates she is now carrying a ‘backpack’ instead of a ‘handbag’.” The actual shoplifting, visible in the same frames, was not mentioned. Conclusion: “No suspicious activity is identified.”

Prior work suggests complexity hurts VLM reasoning: Multi-Object Hallucination (Chen et al., NeurIPS 2024) showed hallucination increases with object categories, M3ID (CVPR 2024) showed visual reliance decays with output length. But those findings imply “be careful.”

For my task: scene description, where visual perception is the whole point, the data says there is no safe amount. The degradation starts at the very first field of metadata and never reverses.

Whether other tasks (object counting, structured extraction) have a different curve shape is an open question. But for anything where you need the model to look at the image, every field you add has a measurable cost.

This isn’t just one model

All the results discussed above used Gemini 3 Flash Preview (except the pattern 3, gradient experiment). I also ran the same seven-condition experiment on Gemini 2.5 Flash. The anchoring pattern is identical in direction: same channel ranking (visual < text < cross-modal), same fake phone adoption in all injection conditions, same cross-modal fabrication.

The key difference is the baseline. Gemini 2.5 Flash’s vision-only response hedges: “no definitive evidence of theft or illicit activity.” It sees the garment, sees the bag, describes the posture , but can’t synthesize them into a shoplifting interpretation. Every data condition degraded further from an already-weaker starting point.

The most striking difference was in fake entity adoption. Gemini 3 Flash Preview adopted only the spatially-plausible phone and ignored the fake person and fake bag. Gemini 2.5 Flash, in the crossmodal+fake condition, adopted all three: inventing a “mannequin (T89)” to explain the fake person detection and an “empty red box (O412)” to explain the fake handbag. It never concluded the data was wrong. It fabricated objects to make the data true.

If you’re deploying a cost optimized smaller model in your pipeline, anchoring effects will most likely be worse.

What to do about it

Here’s what you can do about this, today:

Use structured data to select, not to describe.

Tracking data is excellent at identifying which 10-second window contains interesting activity: velocity spikes, proximity events, new entity appearances. Use it to choose what the VLM looks at. Don’t put it into the prompt to “help” the VLM understand the window. Let the model look with its own eyes.

If you must include data, use visual overlays, not text.

Visual bounding boxes preserved 69% visual reasoning compared to 53% for text-encoded bounding boxes with the same spatial information. The model treats drawn boxes as regions to investigate, not claims to narrate. This is a simple architectural choice with measurable impact.

Never use cross-modal reference schemes.

Numbered IDs on images with text mapping tables produced the lowest VD ratio (47%) and the most fabrication of any real-data condition. The cross-referencing task consumes the model’s attention budget before it gets to the actual scene. If you need to label regions, draw the labels directly on the image.

Always run a vision-only baseline.

Before adding any structured data to your pipeline, establish what the model can see on its own. If the baseline misses the event, data may not help: it’ll add narration without adding perception. If the baseline catches the event, adding data risks suppressing it. Depending on the use case, the baseline may be your ceiling, not your floor. If you must include the data from upstream CV pipeline, my humble advice is to be clever about how to utilize anchoring bias in your favour.

Treat detection data as adversarial input.

Your own pipeline’s false positives at moderate confidence can rewrite the VLM’s entire narrative, especially if they happen to be spatially plausible. Filter aggressively. If a detection doesn’t clear a high confidence threshold, it’s safer to omit it than to include it and hope the VLM applies skepticism. It won’t. Alternatively, make sure your upstream results are as good as they can be.

Going deeper

This post covers the controlled experiment and the engineering takeaways by taking one of the video examples.

I had also tested the hypothesis on another video of a busy street that has a collision at the end. The results were mostly the same, but the video was too chaotic and text:image token ratio was getting really skewed. Hence, it’s excluded from the report.

GitHub Companion Repo

Full methodology, all raw prompts, all model responses, images, and scoring rubrics are present in this companion repo: GitHub repository 🎄
Citable archive: DOI 10.5281/zenodo.19557723

But along the way, I found other kind of failure modes that are harder to summarize in a table and more unsettling in their implications: a model that invented a fictional person to avoid contradicting its data. A model that accused the wrong person of shoplifting and cited irrelevant tracking data as proof. A model that watched a car crash happen and declared “no suspicious activity” because the 2,341 lines of tracking JSON didn’t mention it.

Upcoming Part 2: The Failure Catalogue

Some of those stories are in Part 2: The Failure Catalog: ghost entities and false accusations, that captures the other kinds of failures that I noticed while working on this. I am planning to publish this shortly.

[~_~]

If you’re building a system that feeds structured data to a VLM: surveillance, medical imaging, autonomous vehicles, document analysis, test for this. I’d like to be wrong about the generality. I don’t think I am.

Disclaimer ⚠️

This is an engineer’s investigation, not a peer-reviewed study. I tested two models (Gemini 3 Flash Preview as primary, Gemini 2.5 Flash as secondary) on one primary scene across seven controlled conditions.
Both are multimodal LLMs with reasoning capabilities, not minimal vision-language architectures and the anchoring effects may interact with their reasoning layers in ways that differ from simpler models. The patterns I found are consistent and reproducible within that scope. They are not a proof of universality.
By “VLMs” I mean generative vision-language models: Gemini, GPT-4o, Claude, LLaVA, Qwen-VL, InternVL. Models that accept interleaved image and text input and produce free-form text output. Not contrastive encoders like CLIP. Open-source VLMs (LLaVA, InternVL) are likely equally or more susceptible; I’d welcome community replication.
All raw prompts, raw model responses, images, and scoring methodology are in the companion GitHub repository. I would encourage replication on other models, architectures, and domains before drawing production conclusions.
Citable archive: DOI 10.5281/zenodo.19557723

VLM: The More You Tell it, The Less it Sees. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Papers That Taught Machines to Comprehend

Mradul Dubey — Thu, 16 Apr 2026 01:15:16 GMT

And why the future is essentially, multimodal.

If you’ve been following my timeline of the 14 papers that changed Computer Vision, you know really enjoy putting things on a timeline. It offers a chronological perspective thats rare.

But before I was working on Computer Vision, working with pixels, I was a college student obsessed with NLP. My final year project attempted to solve “Sarcasm Detection in Text” using topic modelling. I worked for an year in the adjacent domains of text and audio AI. During this time, I optimized ASR for android phones, implemented NLU modules for production.

Back then, we were just trying to get machines to “understand” sentiment. Now, we’re talking to them like colleagues.

To understand the jump from basic text math to GPT-4, you have to look at these 12 milestones.

TF-IDF (1972) — Karen Spärck Jones

The OG. Jones showed us that a word’s importance isn’t just frequency; it’s about how unique it is to a document. It’s the statistical bedrock that proved we could turn language into a searchable, weighted numerical problem.

Word2Vec (2013) — Mikolov et al.

The “Big Bang” for NLP. This paper proved words could be vectors in a high-dimensional space where “King — Man + Woman = Queen.” It turned NLP from a dictionary problem into a geometry problem. Similarity between two words reduced to a trigonometric formula.

FastText (2016) — Facebook AI Research (FAIR)

As a dev, this was the practical hero. By breaking words into subwords (n-grams), FastText solved the “out-of-vocabulary” nightmare. It made NLP robust for the messy, typo-ridden real world. Even remembering “out-of-vocabulary” these many years laters leaves me anxious.

The Transformer (2017) — Vaswani et al.

“Attention Is All You Need.” This is the most important paper of the decade. By ditching recurrence (LSTMs) for self-attention, it allowed us to train massive models in parallel. Every LLM today is a descendant of this specific architecture. Also, what a title for a research paper! And, yes, the name is enough.

BERT (2018) — Devlin et al.

Google’s masterpiece. BERT taught models to look at context from both directions (left and right) simultaneously. It stopped being about “predicting the next word” and started being about “understanding the whole sentence.”

GPT-3 (2020) — Brown et al.

The paper that proved Scale is a feature. OpenAI showed that if you make the model big enough and feed it enough of the internet, it starts to exhibit “few-shot” reasoning. It was the moment the world realized AI could write better than most humans.

T5 (2020) — Raffel et al.

The “Text-to-Text” Transfer Transformer. Google reframed every NLP task: translation, summarization, classification, into a simple text-to-text format. It unified the field into a single, clean pipeline.

LoRA (2021) — Hu et al.

Low-Rank Adaptation showed we could fine-tune massive models by only tweaking a tiny fraction of the parameters. It made “personalizing” an LLM possible on consumer hardware.

InstructGPT / RLHF (2022) — Ouyang et al.

The secret sauce behind ChatGPT. It wasn’t just about being smart; it was about being helpful.

Reinforcement Learning from Human Feedback (RLHF) aligned model outputs with human intent, curing (some of) the “AI sociopath” tendencies.

Llama / Llama 2 (2023) — Meta

The democratization of LLMs. By releasing a high-performing, open-weights model, Meta broke the “closed API” monopoly. It sparked the massive open-source AI explosion we’re living in right now.

Mistral 7B / Mixture of Experts (2023) — Jiang et al.

Proved that small and smart beats big and bloated. By using “Mixture of Experts” (MoE), Mistral showed we can get GPT-4 level performance while only activating a fraction of the model’s parameters at any given time.

The Converged Future (2024+) — PaliGemma / LLaVA

The timeline ends where my CV article ended as well.

Vision-Language Models (VLMs) have effectively erased the border between NLP and Computer Vision. Tokens and pixels now live in the same latent space.

If I had to pick the three “pillars” here: Word2Vec (vectorizing meaning), Transformers (scaling context), and Llama(opening the gates).

The transition from my college NLP projects to the CV/Edge AI work I do now feels like a straight line. We’ve gone from counting words to simulating intelligence.

The future is essentially, undoubtedly, multimodal.

The Papers That Taught Machines to See

Mradul Dubey — Wed, 15 Apr 2026 16:24:46 GMT

Chronology of the science of deep learning for computer vision

If you want to understand where computer vision is today, you need to know the papers that got us here, specially technically if you work in computer vision.

But, sometimes, it's interesting to step back and analyse the situation from chronological point of view, to track the leaps made and what breakthrough happened when.

Also, let’s reflect on the real story, what each paper actually broke, what surprised people, and why it mattered more than anyone expected at the time.

Here’s my timeline. Fourteen papers. Three decades. One thread connecting handwritten digits to vision-language models.

LeNet-5 (1998) — Yann LeCun et al.

The paper that proved CNNs work. LeCun built a network that could read handwritten digits on bank cheques. Most people treated it as a neat trick for a narrow problem.

They missed the real lesson: learned features beat hand-crafted ones.

Every modern vision model descends from this idea. LeCun would wait 14 years for the world to catch up.

AlexNet (2012) — Krizhevsky, Sutskever, Hinton

The Big Bang. Two grad students and their advisor entered the ImageNet competition with a deep neural network , which was surprisingly, a fringe idea at the time, and obliterated the field by over 10 percentage points.

The secret ingredient wasn’t architectural brilliance. It was two NVIDIA GTX 580 GPUs. Alex Krizhevsky was just trying to delay getting a real job.

Hinton later said: “Ilya thought we should do it, Alex made it work, and I got the Nobel Prize.” Hinton won the 2024 Nobel Prize in Physics for this line of work. Before AlexNet, almost nobody used neural networks for vision. After it, almost everybody did.

VGGNet (2014) — Simonyan, Zisserman

The “just stack more 3x3 convs” paper. VGG proved that depth matters and you don’t need clever multi-scale tricks, just go deeper with a simple, uniform architecture. It didn’t win ImageNet that year (GoogLeNet did), but VGG became the default backbone everyone actually used. Sometimes the runner-up shapes the field more than the winner.

GoogLeNet / Inception (2014) — Szegedy et al.

The actual ImageNet 2014 winner. Inception modules ran multiple filter sizes in parallel and concatenated them. The first serious attempt at multi-scale feature extraction within a single layer. Architecturally clever, but the lasting contribution was the idea that you could design network topology itself as a learnable problem. The inception module was the ancestor of every “let the architecture do the thinking” approach that followed.

ResNet (2015) — Kaiming He et al.

The paper that broke the depth barrier. Before ResNet, deeper networks performed worse, which made no theoretical sense. Adding more layers should help, not hurt.

His team at Microsoft Research Asia added skip connections: just let the input bypass layers and add itself to the output. Embarrassingly simple. But it let them train 152-layer networks where 20-layer ones used to fail.

Over long time, the real impact wasn’t the ImageNet win. It was that skip connections showed up everywhere afterward, including in every single transformer model that exists today. Every time you use GPT or BERT, ResNet’s skip connections are in there.

YOLO v1 (2016) & SSD (2016) — Redmon et al. / Liu et al.

Before YOLO, object detection was a two-stage affair. First propose regions, then classify them. Slow.

YOLO reframed detection as a single regression problem: one forward pass, done. Real-time object detection became possible on commodity hardware.

The backstory is bittersweet. Joseph Redmon, YOLO’s creator, quit computer vision research in 2020. He said the military applications and privacy concerns became impossible to ignore. His YOLOv3 paper had already contained unusually sharp jabs at Google and Facebook over surveillance. There was no YOLOv4 from Redmon. The community forked his work and kept going. Now YOLOvX has been almost hijacked by Ultralytics which are notorious for not releasing any research paper with their new model releases.

Feature Pyramid Networks (2017) — Lin et al.

FPN solved a quiet but critical problem: detecting objects at vastly different scales in the same image. It built a top-down pathway with lateral connections that let high-level semantic features flow down to high-resolution layers. Doesn’t sound glamorous. But FPN became the backbone of essentially every competitive object detector for the next five years. If you’ve used Faster R-CNN, Mask R-CNN, or RetinaNet, FPN was doing the heavy lifting underneath.

EfficientNet (2019) — Tan, Le

The “stop brute-forcing depth” paper. Mingxing Tan and Quoc Le showed that scaling width, depth, and resolution together with a compound coefficient beats scaling any single dimension. EfficientNet matched state-of-the-art accuracy at a fraction of the compute. The real contribution was a mindset shift: efficiency is a first-class design goal, not an afterthought. Edge deployment people (like me) owe this paper a lot.

Vision Transformer / ViT (2020) — Dosovitskiy et al.

The paper that said: what if we just throw away convolutions entirely? Chop the image into 16x16 patches, treat them as tokens, and feed them to a standard transformer.

The computer vision community was skeptical: transformers were an NLP thing. But ViT worked. With enough data, it matched or beat CNNs.

The deeper lesson was uncomfortable: maybe convolutions weren’t a fundamental inductive bias for vision. Maybe they were just a computational shortcut for the data-scarce era. The debate isn’t settled, but ViT forced it open.

DETR (2020) — Carion et al.

End-to-end object detection. No anchors, no NMS, no hand-designed components. Just a transformer encoder-decoder with learned object queries.

The simplicity was shocking. DETR was initially slow and struggled with small objects, but it proved that detection could be reformulated as a set prediction problem. That opened the door to everything that followed: including RT-DETR, which eventually made transformer-based detection fast enough for real-time use.

DINO / DINOv2 (2021 / 2023) — Caron et al.

Self-supervised learning for vision that actually worked. DINO trained a ViT with no labels at all, just self-distillation and the resulting features were good enough to use as a general-purpose visual backbone. DINOv2 scaled this up and produced features that rival supervised models across tasks. The quiet revolution here: we might not need labeled datasets for vision anymore. The features emerge from structure alone.

Segment Anything / SAM (2023) — Kirillov et al.

Meta’s foundation model for segmentation.

Give it a point, a box, or a text prompt, and it segments any object in any image. SAM was trained on 11 million images and over 1 billion masks. The scale was the story.

But the real shift was conceptual: segmentation went from a task-specific model to a general-purpose tool. SAM is to segmentation what GPT was to text: a foundation model that makes the task feel solved (even when edge cases remain).

RT-DETR (2023) — Zhao et al.

Real-time DETR. Took the elegant set-prediction framework of DETR and made it actually fast enough for production.

Hybrid encoder, efficient cross-attention, no NMS needed. For people deploying detection models on real hardware, RT-DETR was the moment transformer-based detection became a practical option, not just a research curiosity.

PaliGemma / LLaVA / Qwen-VL (2023–2024)

Vision meets language. These models take a vision encoder (often a ViT variant), connect it to a large language model, and suddenly the model can describe images, answer visual questions, and reason about what it sees. This is where the vision and NLP timelines finally converge. The image becomes a token sequence. The boundary between “seeing” and “understanding” dissolves. We’re still early, but the trajectory is clear: the future of vision is multimodal.

From LeCun’s digit reader to models that see and speak. Thirty years. Each paper standing on the shoulders of the last.

If I had to pick the three that changed the most: AlexNet (proved deep learning works), ResNet (made it trainable at any depth), and ViT (proved convolutions were optional). Everything else is a refinement of those three shifts.

If you disagree with the list, or if you think I had missed something important, please do let me know.

This was a fun entry to write. Its a different perspective :)

The Rabbit Hole that is Model Quantization

Mradul Dubey — Wed, 15 Apr 2026 00:28:12 GMT

I love rabbit holes and this is my motivation.

Funnelling down the rabbit hole

Model quantization is not the most exciting part of the ML model lifecycle for many people. Its probably a item on a checklist, a script that needs to be run before moving on to deployment.

As part of my work at Apra Labs is on Edge AI. I deploy object detection models on edge hardware. Not cloud GPUs with 80 GB of VRAM, actual edge hardware with tight memory and real constraints.

A float32 YOLO model looks great on your dev machine. Then you try to deploy it on an edge accelerator and everything falls apart. Confidence score collapse, latency spikes, layers silently fallback to CPU. Your 85% detection confidence becomes 0.06 after conversion.

I’ve spent a significant amount of time fighting this. And I realised, I have been making an often repeated mistake in the landscape of Machine Learning:

I was treating quantization like a checkbox.

Run a script, get a smaller model. Ship it 🚀 . Right ?

Once you realize that, it only takes a second to realize the gravity of your mistake. If you keep an ear out for the buzzwords, you’d know that the pioneering work being done in LLM space is pushing the limits of quantization.

Llama 3 70B in Q4_K_M, Qwen 3.5 running on a laptop through Ollama. Everyone’s crushing 70 billion parameter models down to 4 bit integers and running them locally and getting practical results. It sounds like magic, its not. Its quantization.

It’s the same affine mapping, the same scale-and-zero point math I’ve been doing. The difference is only this: when LLM quantization goes wrong, you get slightly worse text. When mine goes wrong, a camera misses a person.

So, I sat down and mapped out everything a practitioner actually needs to understand to deploy a quantized model on real hardware. The list got long. Fast.

Here’s a taste

It starts with the obvious stuff, IEEE 754. What float32 can represent vs INT8’s 256 levels. Affine mapping. Symmetric vs asymmetric. Per-tensor vs per-channel.

Then, it gets real.

ONNX graphs and QDQ nodes. How PyTorch’s eager mode decomposition affects your TFLite output. Why torch.export with strict=False produces a different graph than torch.onnx.export. Why your ONNX Runtime quantization is correct but your TFLite converter throws it all away.

Then, it gets specific.

TFLite FlatBuffer internals. What BuiltinOperator codes actually mean. Why grappler fuses your softmax but your hardware accelerator can’t run the fused version. How signature keys differ from tensor names and how getting that wrong silently breaks calibration.

Then, it gets painful.

SiLU activations falling back to CPU. INT8 confidence degradation on narrow-range sigmoid outputs. The gap between what a converter claims to support and what actually runs on a hardware accelerator.

What I’m doing about it

I’m writing my way through this. Not in one shot, I will break it down into parts as I go deeper into each layer.

The goal is not to produce a textbook.

The goal is to produce a guide that I wish I had 6 months ago.

Practitioner first. Real failures, Real fixes.

Some topics I already do know. Others, I’m still learning. I will be very upfront about which is which.

If you work anywhere near the edge ML pipeline i.e. training, conversion, optimization, deployment, some of this may save you weeks of debugging.

More soon.

AR: the science behind Facehero II

Mradul Dubey — Thu, 24 Dec 2020 08:52:07 GMT

AR: the science behind Facehero — Part II

Face Hero: A mix of science and art with an element of surprise.

In part I of the blog, we discussed the two problems that we needed to solve to make FaceHero work. We also discussed the tools we would require to solve the problems, namely, ARCore & ARKit. In this part of the blog, we will see how exactly does the information received from our tools solves our problems.

Problem Status

The tools that we are using provide us with information about the user’s face location, orientation and facial gestures in real-time. Logically speaking, the next step would be to assimilate all this information to figure out the user’s face orientation and face gesture. Let me state our requirements more specifically at this point:

A pointer on the screen that tracks the user’s head movement.
To identify the following set of gestures: Angry, Smile, Pout, Neutral.

The first problem can be solved by some linear transformation on the head’s origin & axes, which is present in information that we directly get from CenterPose in ARCore and Face Coordinate System in ARKit. The point where the vector along the z-axis passing through the CenterPose of the user’s head intersects the screen, gives us the required pointer.

For the second problem, however, we can not directly use the information provided by our tools to come to a conclusion. So, we need a module that can analyze the information that we have (BlendShapes or Face Mesh) and provide us with the exact information we need (facial gestures). That is where Machine Learning comes into picture. If we select our information supplied from the ARCore and ARKit smartly enough and process it to maintain the right balance of bias & variance, then, with enough samples we might be on the right side of Hoeffding’s inequality[1] to make a Machine Learning model learn to differentiate between the required gestures.

Hoeffding’s Inequality[1]

ARKit Solution

Recollect that ARKit provided us with Face Coordinate System, 3D Face Mesh & most importantly, BlendShapes. BlendShapes are provided for 52 trackers for different face expression, their values ranging from 0.0 (minimum expression) to 1.0 (maximum expression). With their 52 normalized values, BlendShapes fit perfectly as a “feature vector”.

A feature vector can be loosely described as the unbiased representative data containing enough information to conclude the result across all the data points.

Once we have a feature vector, we could train a classifier to classify these feature vector as one of the required gestures. Since our data was not linearly separable and we were limited ourselves to classic ML algorithms, we tried Support Vector Machines with RBF kernel trained by one-vs-all method. We optimized the hyper-parameters, C and Gamma, using dynamic programming. And we got a decent enough classifier.

However, in our experiments, we found that a voting based ensemble model of multiple SVM models trained as one vs one, with one group being any given gesture and the other being all other gestures combined (including Neutral), worked really well. Therefore we ended up with 3 SVM models:

Angry vs Others
Pout vs Others
Smile vs Others

According to the game play, if the users fails to make the appropriate gesture i.e. remains Neutral then we need to penalise them. But we did not train a separate model for Neutral vs Others because we didn’t need one. Instead we rely on the voting of the three models. If majority of the classifiers classify the frame as “Others”, we take it as Neutral. So, our final result was based on the voting by these classifiers across five frames.

This ensemble works really well. The data is consistently reliable from the Apple hardware and there is no information loss from any dimensionality reduction.

ARCore Solution

In the previous post, we noted that ARCore does not provide us with any equivalent of BlendShapes. We get a face mesh with 468 vertices in 3D space. This is in contrast to 1220 vertices in ARKit. Prima facie, this implies that we get less information from ARCore than ARKit.

This makes things more challenging with ARCore. We have 468 * 3 = 1404 data points for a given frame.

So, while the BlendShapes as a feature vector is a pleasant dream come true, the 1404 point long feature vector with ARCore is a nightmare.

The variance is just too high and we end up with overfit classifiers i.e. the classifiers tend to remember the exact training data rather than to identify patterns in them. This is especially true for the classical ML methods like SVM or Decision Trees.

First, we need to process the data to reduce variance using some method of dimensionality reduction, which is a standard problem in ML. The basic idea is to map the data from a high-dimensional space into a low-dimensional space. Classical algorithms use linear transformations to do this while trying to keep the information loss to a minimum. Methods like Principle Component Analysis (PCA) , Singular Value Decomposition (SVD) or an Autoencoder can be used for the same.

We used PCA to transform our 1404 data point long vector to one 20 points long vector while preserving 98 % data variability. This 20 points long vector acts as our feature vector instead of the original vector.

Once we have got our feature vector, not unlike in the ARKit case, we trained three SVM models:

Angry vs Others
Smile vs Other
Pout vs Others.

Our final result is based on the voting by these models across five frames. This works decently enough. Although less so than in the case of ARKit. And there are reasons for the same:

Firstly, as established earlier, we get less inherent information from ARCore, given the relative face mesh size. Secondly, the data coming from ARCore itself is from a Machine Learning pipeline and hardware is not involved. This makes data less consistent. Thirdly, there is information loss due to dimensionality reduction. Not only that, dimensionality reduction works to preserve “variability” in the data which does not entirely translate to preserving pertinent information. That is to say, a smarter dimensionality reduction methods may improve the performance.

Despite all that, there is something really interesting happening in the case of ARCore. The face mesh was envisioned to enable developers to provide features like face filters that can decently track the user’s facial expressions and movements. We are deconstructing the information from another ML pipeline in a way it was not envisioned for.

What else did we learn ?

Some takeaway points should also be mentioned before concluding this series.

One of that is the case with Neutral gesture. It gave us a lot of trouble across both solutions. All the gestures seem to have an overlap with the Neutral gesture. This leads to a multi-label classifier predicting multiple classes with very high confidence, which defeats the whole purpose of having a classifier in the first place. The reasoning hypothesis is that across any population, somebody’s neutral face might resemble somebody else’s smile face or more often, the pout face.

Two, we also quickly tried neural networks in the case of ARCore but ran into similar problems. The thing with standard Neural Networks is that you don’t have an “I don’t know” option[2]. This combined with the issue that, we often observed very high confidence classification for both neutral and one of the other gestures, suggested we needed more experimentation with NNs to achieve a decent classification for the problem. Sadly, we ran out of time. This is one of the pathways the readers can explore further.

Anyway, it was such a unique solution that it deserved to be shared with more people. Hopefully, this gives them more ideas to try creative solutions and also build upon this solution.

References:

[1] A fantastic lecture on Hoeffding’s inequality and its relevance in Machine Learning. Highly recommended.

[2] Research Paper explaining Bayesian Neural Networks that can be used to model epistemic uncertainty in neural networks.

AR: the science behind Facehero II was originally published in Apra Labs Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

AR: the science behind Facehero

Mradul Dubey — Sat, 08 Aug 2020 18:47:24 GMT

Makings of a fun AR Game

Augmented Reality

Augmented reality (AR) is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, sometimes across multiple sensory modalities, including visual, auditory, haptic, somatosensory and olfactory.

- Wikipedia definition of Augmented Reality (AR)

This is obviously, quite a broad definition of AR and here, we are going to restrict ourselves to the visual modality in this blog series.

We come across this “visual” AR in the form of Google Lens, the widely popular game Pokemon Go and in utility based applications like Measure:

https://medium.com/media/6f3a9a2dd5f96525fc2ef215c2384152/href

This is just the tip of the iceberg. AR has also been used to develop Face Filters on social media apps and also to make your Webcam smarter. The point being that AR continues to be used in a diverse manner and for diverse applications.

The Application: Face Hero

Here, at Apra Labs, we utilized Augmented Reality in an unconventional way, unlike all the applications described above. Our requirement came in the form of Face Hero.

With Face Hero, we tried to develop a novel touch-free game that is not only engrossing and fun, but also very rewarding. Take a peep:

https://medium.com/media/52c74e52b09f75e534fbfacad3017c24/href

You are also encouraged to try this game for free. In any case, let me roughly describe the bare-bones version of the game play:

Face Hero tracks the face orientation of the player and that enables the player to move the pointer around the screen space. Once the player locks on to a target character, the player has to emote a facial gesture corresponding to the character which prompts a positive reaction from the character and the player gets to score. E.g. smile to the baby to make it giggle, pout to the puppy etc.

The Twin Problems

So, while developing the game, speaking AR-wise, we had to solve two problems:

Track the face orientation of the player
Identify the facial gestures of the player

And we had to solve them using efficient algorithms for a smooth game play.

In this blog series, we will slowly explore, in more detail, how we utilized AR in solving both of these problems.

But, in the details, lies the devil.

As we will soon find out there is no “global best” solution for either of the problems.

Tools & The Devils

At the heart of the solution to our problems, across multiple platforms, are two tools, namely: Apple’s ARKit and Google’s ARCore. Both of them contain a big set of features to track real world objects but the one that we were interested in, were the face tracking features that both of them provide. That is not to say that they do not differ in their capabilities and performance, but that they both provide out-of-the-box support for tracking faces.

Apple’s ARKit

ARKit tracks the user’s face and provides a detailed topology of the face. There are three main components of this topology:

Face Coordinate System
3D Face Mesh
Blend Shapes

The face coordinates system is supposed to be a way for positioning other elements relative to the face.

Units of face coordinate space are in meters, with the origin centered behind the face as indicated in the figure below.

ARKit: Face Coordinate System [source]

So, when you see an app being used to apply a face filter, it is this coordinate system that allows moving the filter along with the user’s head movements.

Secondly, ARKit provides us with a Face Mesh:

ARKit: 3D Face Mesh with 1220 vertices tracking Siddhant’s face in real time.

This face mesh consists of 1220 vertices in 3-D space, all relative to the origin of the Face Coordinated System described above. It tracks the user’s face in real time.

Thirdly, and more importantly, ARKit provides us with Blend Shapes.

Blend Shapes are a group of named coefficients representing the detected facial expression in terms of the movement of specific facial features.

Let’s take an example:

ARKit: jawOpen blend shape [source]

Here, “jawOpen” is one of the several blend shapes that ARKit provides. The value of each blend shape ranges from 0.0 (minimum) to 1.0 (maximum). ARKit provides us 52 such types in blend shapes. In other words, all the Blend Shapes together provide an abstract representation of the user’s facial expressions.

Remember the Blend Shapes, they will be crucial in cracking our second problem.

Google’s ARCore

The google counterpart to ARKit is called ARCore. They differ, both in capabilities and also in how they work under the hood.

ARKit utilizes the real-depth camera available on iPhones(X+)/iPads to generate the Face Mesh and Blend Shapes. Hardware variability across android devices does not lend the same luxury to ARCore.

ARCore relies entirely on Deep Learning based algorithms. While this means, ARKit is generally more precise and informative as compared to ARCore, but, at the same time, this implies ARCore can function across android devices with different hardware capabilities and even on iOS. However, it may not be the best solution for iOS currently.

Coming back to our two problems at hand, the ARCore solution comes in the form of Augmented Faces, which mainly provides us with:

Center Pose
3D Face Mesh

The center pose, located behind the nose, is the physical center point of the user’s head (in other words, inside the skull).

ARCore: Center Pose [source]

This is the ARKit’s equivalent of Face Coordinate System.

3D Face Mesh

The 3D Face Mesh provided by ARCore consists of 468 vertices. This is in contrast to the 1220 vertices in the ARKit Face Mesh.

ARCore: 3D Face Mesh with 468 vertices [source]

And that’s it. ARCore doesn’t provide anything like Blend Shapes!

So, You’re Telling Me…

That we had two main problems: Tracking the user’s face orientation and Identifying the facial gestures.

The best tool that we can use to solve our problems varies platform-wise: ARKit and ARCore.

ARKit provides us with a Face Coordinate System, a real-time 3D Face Mesh with 1220 vertices that tracks the user’s face and an abstract representation of user’s facial gestures in the form of 52 Blend Shapes.

ARCore provides us with the Center Pose and a 3D Face Mesh with 468 vertices, but doesn’t provide us with any facial gesture abstractions like Blend Shapes.

A shrewd mind would wonder whether the lower number of vertices provided by ARCore and its inability to provide something like Blend Shapes are related. But, apart from the fact that the mesh tracked by ARKit contains more information (ARKit’s 1220 vs ARCore’s 468 vertices), the ARKit face mesh also feels more responsive than the ARCore counterpart.

Phew! That was a lot to take.

In Part II of the blog, we will see how exactly the information that we get from our tools helps us resolve our twin problems. And that’s where the real fun begins!

See you next time!!

Want to explore more stuff? I have got a bunch of links for you to keep you occupied till the next time:

For the mathematically hungry: Quaternions

For the AR curious: arcore, arkit

For the hardware conscious: apple depth cameras, intel realsense depth cameras

AR: the science behind Facehero was originally published in Apra Labs Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.