<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Souravakumarbehera on Medium]]></title>
        <description><![CDATA[Stories by Souravakumarbehera on Medium]]></description>
        <link>https://medium.com/@souravakumarbehera03?source=rss-8c772aebe9be------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*kAFxJ3p6CMHowudf</url>
            <title>Stories by Souravakumarbehera on Medium</title>
            <link>https://medium.com/@souravakumarbehera03?source=rss-8c772aebe9be------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 22 May 2026 00:17:02 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@souravakumarbehera03/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Chunking Is Easy. Parsing Is Hard.]]></title>
            <link>https://medium.com/@souravakumarbehera03/chunking-is-easy-parsing-is-hard-0957356263cf?source=rss-8c772aebe9be------2</link>
            <guid isPermaLink="false">https://medium.com/p/0957356263cf</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[generative-ai-tools]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Souravakumarbehera]]></dc:creator>
            <pubDate>Mon, 18 May 2026 18:00:47 GMT</pubDate>
            <atom:updated>2026-05-18T18:24:38.931Z</atom:updated>
            <content:encoded><![CDATA[<h4>Why Your RAG Pipeline Is Reasoning Over Broken Data.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ix7C3lND093Uk0S2hNcTvw.png" /></figure><h3>Section 1 — The Evolution of RAG Pipelines</h3><p>A production RAG system once confidently answered a question about a financial table with completely wrong numbers. The embeddings were fine. The retrieval was fine. The problem was sitting 200 lines earlier in the pipeline, in the parser nobody had looked at.</p><p>RAG Pipelines start simple: Grab a document. Split it into 512-token chunks. Embed them. Store them. Done.</p><h4>The First Problem: Fixed-Size Chunking Is Blind</h4><p>It has no idea what it’s cutting through. A table, an equation, a figure caption, all look the same to a token counter. It splits wherever the number says to.</p><p>The result? Chunks that look valid but are semantically broken. Your LLM reasons over half a table. Confidently. Wrongly.</p><h4>The Community Moved to Semantic Chunking</h4><p>Split on meaning, not token count. Sentence transformers detect where one idea ends and another begins. A real improvement for prose.</p><p>But there was still a fundamental problem.</p><p>The document was still treated as a flat wall of text:</p><ul><li>A table was just text</li><li>An equation was just text</li><li>A figure caption merged with the paragraph below it , also just text</li></ul><p>Semantic chunking found better boundaries. It just had nothing structural to work with.</p><h4>Then Came Hierarchical Chunking — This Changed Things</h4><p>The insight was obvious in hindsight: documents are not flat. They have structure. A paper has sections, subsections, paragraphs, tables, figures, equations. Each plays a different role. Each needs a different retrieval granularity.</p><p>Hierarchical chunking maps this explicitly:</p><ul><li><strong>Parent nodes</strong> and <strong>child nodes</strong></li><li><strong>Element-level metadata</strong></li><li>Retrievers that can fetch a full section for broad queries, or a single table row for precise ones</li></ul><p>Hybrid chunking pushed further combining structural boundaries with semantic similarity for chunks that are both document-aware and meaning-aware.</p><h4>These Are Genuinely Better Strategies — But They Share One Silent Assumption</h4><p>That the parser correctly identified what each element actually is.</p><ul><li>Hierarchical chunking needs to know: <em>this is a heading. This is a table. This is a code block.</em></li><li>Hybrid chunking needs clean semantic units</li><li>Element-aware splitting needs elements that were actually detected as elements</li></ul><p>If your parser outputs a flat list of undifferentiated text strings, none of that works. You’re just cutting up the same wall of text. Slightly more cleverly.</p><h4>The Dependency the RAG Community Underinvests In</h4><p>Two parsers sit at this foundation more than any others: <a href="https://github.com/docling-project/docling"><strong>Docling</strong></a> and <a href="https://github.com/Unstructured-IO/unstructured"><strong>Unstructured</strong></a><strong>.</strong></p><p>Everyone debates chunking strategies. Very few people ask what the parser produced before the chunks are even made.</p><p>The parser is not a preprocessing step you configure once and forget. <strong>It is the foundation everything else rests on.</strong></p><h3>Section 2 — Under The Hood</h3><p>Two parsers dominate this space: <strong>Docling</strong> and <strong>Unstructured</strong>. Both take a PDF as input. Both give you text as output. So what’s the difference?</p><p><em>Here’s how their pipelines actually compare:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*btzqqQ065w7IbLRBQAxPng.png" /></figure><p><strong>Docling’s output is a tree. Unstructured’s output is a list.</strong></p><pre># Docling (tree)<br>{<br>  &quot;type&quot;: &quot;section&quot;,<br>  &quot;heading&quot;: &quot;Results&quot;,<br>  &quot;children&quot;: [<br>    { &quot;type&quot;: &quot;table&quot;, &quot;data&quot;: [...] },<br>    { &quot;type&quot;: &quot;paragraph&quot;, &quot;text&quot;: &quot;...&quot; }<br>  ]<br>}<br><br># Unstructured (list)<br>{ &quot;type&quot;: &quot;Title&quot;, &quot;text&quot;: &quot;Results&quot;, &quot;parent_id&quot;: null }<br>{ &quot;type&quot;: &quot;Table&quot;, &quot;text&quot;: &quot;...&quot;, &quot;parent_id&quot;: &quot;abc123&quot; }<br>{ &quot;type&quot;: &quot;NarrativeText&quot;, &quot;text&quot;: &quot;...&quot;, &quot;parent_id&quot;: &quot;abc123&quot; }</pre><p>A tree has hierarchy. Headings, sections, tables, equations each is a typed node with a known role and a known position in the document structure. Docling’s hierarchy is <strong>structural</strong> (built into the tree), Unstructured’s is <strong>inferential</strong> (reconstructed from metadata).</p><p>A list is just elements. One after another. Element types exist Title, Narrative Text, Table but hierarchy is not structural. It’s implicit, encoded as parent_id metadata pointers you have to follow yourself. There&#39;s no native way to walk sections or ask what lives under a heading.</p><h3>Section 3— The Evidence</h3><p>Let’s look at what actually comes out of each parser.</p><p>Four element types. Four failure modes. All taken from real academic papers.</p><h3>3.1 — Figures</h3><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d4b307354c37e5d725cbe0118bc0e895/href">https://medium.com/media/d4b307354c37e5d725cbe0118bc0e895/href</a></iframe><h3>3.2 — Equations</h3><p><em>A broken equation doesn’t just produce a bad chunk it produces a confidently wrong one. The text looks like math. The LLM treats it like math. The answer is nonsense.</em></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/7c855fdc21e40fa06c486b6466ba2243/href">https://medium.com/media/7c855fdc21e40fa06c486b6466ba2243/href</a></iframe><p><em>Your chunk contains machine-readable LaTeX, not OCR noise.</em></p><h3>3.3 — Algorithms</h3><p><em>Pseudocode is structure-dense. Indentation matters. Symbols matter. Line order matters.</em></p><p><em>When a parser treats an algorithm block as plain prose, you get symbol soup in your chunk. An LLM reasoning over that produces plausible-looking but logically broken answers.</em></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/908c541aa565d3a48b3de2916132c8eb/href">https://medium.com/media/908c541aa565d3a48b3de2916132c8eb/href</a></iframe><p><em>Indentation preserved. Symbols intact. A chunk your LLM can actually reason over.</em></p><h3>3.4 — Tables</h3><p><em>Tables are where chunking strategies most visibly break.</em></p><p><em>A naive chunker hitting an HTML table dump either splits mid-row or lumps the entire table into one oversized chunk. Neither works for retrieval. The root cause: no schema, no structure signal, nothing for your chunker to work with.</em></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f1b7a3f4b5318b228d26c75bda5ae90a/href">https://medium.com/media/f1b7a3f4b5318b228d26c75bda5ae90a/href</a></iframe><p><em>A typed, schema-defined table node. Your chunker knows exactly what it’s working with.</em></p><h3>Conclusion</h3><p>Both are solid tools. But if your RAG pipeline is built on structured documents and your chunking strategy depends on knowing what each element actually is the parser choice is clear.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5b2fb2048117bedc136abb7d477bb1c6/href">https://medium.com/media/5b2fb2048117bedc136abb7d477bb1c6/href</a></iframe><p>Docling uses CodeFormulaV2 for equation recognition and TableFormer for table reconstruction both contribute directly to the accuracy gaps shown above.</p><p>Get the parsing right first. Everything else follows.</p><p><strong>Acknowledgements</strong></p><p>This article was a collaborative effort of <a href="https://www.linkedin.com/in/souravakumarbehera/"><strong><em>Sourava Kumar Behera</em></strong></a> &amp; <a href="https://www.linkedin.com/in/dhruv-bhatnagar63/"><strong><em>Dhruv Bhatnagar</em></strong></a><strong><em>.</em></strong></p><p><a href="https://github.com/Souravakb24/ContextFlow/blob/main/Document_Parser.md">[Github Link]</a></p><p>If this changed how you think about your pipeline, share it.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=0957356263cf" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>