Chunking Is Easy. Parsing Is Hard.

Souravakumarbehera — Mon, 18 May 2026 18:00:47 GMT

Why Your RAG Pipeline Is Reasoning Over Broken Data.

Section 1 — The Evolution of RAG Pipelines

A production RAG system once confidently answered a question about a financial table with completely wrong numbers. The embeddings were fine. The retrieval was fine. The problem was sitting 200 lines earlier in the pipeline, in the parser nobody had looked at.

RAG Pipelines start simple: Grab a document. Split it into 512-token chunks. Embed them. Store them. Done.

The First Problem: Fixed-Size Chunking Is Blind

It has no idea what it’s cutting through. A table, an equation, a figure caption, all look the same to a token counter. It splits wherever the number says to.

The result? Chunks that look valid but are semantically broken. Your LLM reasons over half a table. Confidently. Wrongly.

The Community Moved to Semantic Chunking

Split on meaning, not token count. Sentence transformers detect where one idea ends and another begins. A real improvement for prose.

But there was still a fundamental problem.

The document was still treated as a flat wall of text:

A table was just text
An equation was just text
A figure caption merged with the paragraph below it , also just text

Semantic chunking found better boundaries. It just had nothing structural to work with.

Then Came Hierarchical Chunking — This Changed Things

The insight was obvious in hindsight: documents are not flat. They have structure. A paper has sections, subsections, paragraphs, tables, figures, equations. Each plays a different role. Each needs a different retrieval granularity.

Hierarchical chunking maps this explicitly:

Parent nodes and child nodes
Element-level metadata
Retrievers that can fetch a full section for broad queries, or a single table row for precise ones

Hybrid chunking pushed further combining structural boundaries with semantic similarity for chunks that are both document-aware and meaning-aware.

These Are Genuinely Better Strategies — But They Share One Silent Assumption

That the parser correctly identified what each element actually is.

Hierarchical chunking needs to know: this is a heading. This is a table. This is a code block.
Hybrid chunking needs clean semantic units
Element-aware splitting needs elements that were actually detected as elements

If your parser outputs a flat list of undifferentiated text strings, none of that works. You’re just cutting up the same wall of text. Slightly more cleverly.

The Dependency the RAG Community Underinvests In

Two parsers sit at this foundation more than any others: Docling and Unstructured.

Everyone debates chunking strategies. Very few people ask what the parser produced before the chunks are even made.

The parser is not a preprocessing step you configure once and forget. It is the foundation everything else rests on.

Section 2 — Under The Hood

Two parsers dominate this space: Docling and Unstructured. Both take a PDF as input. Both give you text as output. So what’s the difference?

Here’s how their pipelines actually compare:

Docling’s output is a tree. Unstructured’s output is a list.

# Docling (tree)
{
  "type": "section",
  "heading": "Results",
  "children": [
    { "type": "table", "data": [...] },
    { "type": "paragraph", "text": "..." }
  ]
}

# Unstructured (list)
{ "type": "Title", "text": "Results", "parent_id": null }
{ "type": "Table", "text": "...", "parent_id": "abc123" }
{ "type": "NarrativeText", "text": "...", "parent_id": "abc123" }

A tree has hierarchy. Headings, sections, tables, equations each is a typed node with a known role and a known position in the document structure. Docling’s hierarchy is structural (built into the tree), Unstructured’s is inferential (reconstructed from metadata).

A list is just elements. One after another. Element types exist Title, Narrative Text, Table but hierarchy is not structural. It’s implicit, encoded as parent_id metadata pointers you have to follow yourself. There's no native way to walk sections or ask what lives under a heading.

Section 3— The Evidence

Let’s look at what actually comes out of each parser.

Four element types. Four failure modes. All taken from real academic papers.

3.1 — Figures

https://medium.com/media/d4b307354c37e5d725cbe0118bc0e895/href

3.2 — Equations

A broken equation doesn’t just produce a bad chunk it produces a confidently wrong one. The text looks like math. The LLM treats it like math. The answer is nonsense.

https://medium.com/media/7c855fdc21e40fa06c486b6466ba2243/href

Your chunk contains machine-readable LaTeX, not OCR noise.

3.3 — Algorithms

Pseudocode is structure-dense. Indentation matters. Symbols matter. Line order matters.

When a parser treats an algorithm block as plain prose, you get symbol soup in your chunk. An LLM reasoning over that produces plausible-looking but logically broken answers.

https://medium.com/media/908c541aa565d3a48b3de2916132c8eb/href

Indentation preserved. Symbols intact. A chunk your LLM can actually reason over.

3.4 — Tables

Tables are where chunking strategies most visibly break.

A naive chunker hitting an HTML table dump either splits mid-row or lumps the entire table into one oversized chunk. Neither works for retrieval. The root cause: no schema, no structure signal, nothing for your chunker to work with.

https://medium.com/media/f1b7a3f4b5318b228d26c75bda5ae90a/href

A typed, schema-defined table node. Your chunker knows exactly what it’s working with.

Conclusion

Both are solid tools. But if your RAG pipeline is built on structured documents and your chunking strategy depends on knowing what each element actually is the parser choice is clear.

https://medium.com/media/5b2fb2048117bedc136abb7d477bb1c6/href

Docling uses CodeFormulaV2 for equation recognition and TableFormer for table reconstruction both contribute directly to the accuracy gaps shown above.

Get the parsing right first. Everything else follows.

Acknowledgements

This article was a collaborative effort of Sourava Kumar Behera & Dhruv Bhatnagar.

[Github Link]

If this changed how you think about your pipeline, share it.

Stories by Souravakumarbehera on Medium