Vision-Driven Data Chunking for RAG
“How we built chunking that sees, not just reads”
Introduction
At the very beginning of Amarsia we faced a key problem that needed innovation. Traditional text-based chunking methods weren’t fit for our users, as the information-source format varied from images to asset-heavy PDFs like brochures. Traditional token-based chunking strategies wouldn’t cut it, because the information wasn’t just text. We needed something that could understand the visual context and layout the right information for optimized LLM querying.
“I remember the first time a customer uploaded a 40-page brochure full of images and our RAG pipeline simply ignored all the visuals.”
This isn’t really a new problem — big companies already have teams who clean up and format their data perfectly for RAG. Great AI engineers even build their own custom setups for it.
But our users aren’t AI engineers, and they don’t have the time (or patience) to prepare their data just to make it work.
Why Text-Based Chunking Fails in Vision-Heavy Documents
In many RAG systems, chunking means: extract the text, cut it into token windows (say 512 tokens), embed each chunk, and use them for retrieval. That works fine for clean, text-rich documents — but it breaks down when:
- The document has layout (multi-column, sidebars, captions, images) and flattening loses structure.
- The document is visual (image + caption, chart + explanation) and text alone doesn’t carry meaning.
- The document is scanned or a brochure — the text might be embedded as image, or the important items are visuals.
Our Approach: Vision-Driven Chunking
We realised early that what we needed was a new paradigm: vision-assisted data chunking. Not the same as “just feed everything to an LLM and ask it to chunk for you” (which we tried, but the cost, latency, and instability were all issues). Instead, we built a pipeline that uses a combination of:
- OCR (optical character recognition) to extract text from images and scanned documents.
- Computer vision / layout analysis to detect visual blocks: images, tables, captions, sidebars, columns, headings.
- An “agentic” decision layer that picks the chunking strategy based on document type and visual structure (e.g., brochure vs report vs scanned sheet).
- Then embedding those chunks and feeding them into the vector store for retrieval.
Balancing Performance, Quality & Cost
In building this system we had three metrics we cared deeply about: latency (how fast we can chunk and retrieve), quality (how relevant and reliable the retrievals are), and cost (data compute, storage, embedding calls). Traditional text-only or LLM-only chunking often traded off badly: high cost or low quality or slow.
With vision‐driven chunking we managed to hit a better spot:
- Quality: The retrieved chunks maintained visual context, which improved end-user answer relevance.
- Performance: Preprocessing ran faster since we skipped LLM-based chunking and produced cleaner, lighter data. 4s to 1.5s per document. “Relevance score (manual eval) improved from 72% to 88%
- Cost: By using the right method for each file, we cut LLM usage and reduced vector entries — saving ~40% costs for us and our users.
We’ve nailed most cases with our vision layer, but occasionally fall back to LLMs for tricky or unusual layouts.
Real-World Applications
- Ingesting brochures and product catalogues (lots of images + callouts)
- Parsing slide decks and pitch decks with mixed visuals and text
- Handling scanned documents or forms where text is embedded in images
- Eyeballing dashboards, medical reports, or infographics where the visual layout carries meaning
Looking ahead:
- Integrating multi-modal models (vision + language) to better understand visuals rather than just extract text.
- On-device or edge chunking for privacy or high-volume usage.
- Smarter embedding of visual features (not just text) — e.g., image embeddings + text embeddings combined.
- Better analytics: understanding “why” certain chunks are retrieved and refining strategy based on that feedback.
Conclusion
RAG systems can’t treat everything as plain text. At Amarsia, we believe seeing is as important as reading. Our vision-driven chunking improved retrieval, cut costs, and handles real-world, image-heavy data.
Guided by our mission of “most widespread use of AI around the globe”, we’ll keep making powerful tools effortless for everyone to use.
