Uncovering AI Insights from a Single PDF: Part 4 — Conclusion

To or To-Not Add a Text Splitter

Ruby Valappil
R7B7 Tech Blog

--

Image By Author: RAG System after Adding Text Splitter

I would highly recommend the reader to go through Part 1 , Part 2 and Part 3 , if you haven’t already, before reading this final part.

Just to put things into perspective, let’s take a quick look at the system we started with,

Image By Author. Rag system v1

What changes are going to be a part of version 5 (version 3 and 4 were covered in Part-3 of this series)? Here’s the list —

  1. Text Splitter
  2. Fixed — Warning Messages from Logs
  3. Summarizer??

Text Splitter

In the previous versions, we were using a pdf extractor (PyPDF and pdfplumber) to read from pdf files page by page. We then used the page content as one single chunk.

Since LLMs work with tokens and LLM owner’s charge by tokens, passing a huge chunk of data to the LLM is a costly business. Another drawback of that approach is it adds a lot of irrelevant content for the LLM to parse through affecting its performance.

--

--