Member-only story
Preparing PDFs for RAGs
I created a graph storage from dozens of annual reports (with tables)
Converting PDFs to text was possible but has never been easier.
I recently created a graph data store to be used in an RAG. In other words, we built a GraphRAG.
Graph RAGs are a fantastic alternative to other RAG apps like widely used vector store-backed RAGs. They bring reasoning to the table. For example, with semantic similarity search (the technique used in vector stores to retrieve information), you could ask who the CFO of XYZ, Inc. was last year. Because XYZ, Inc.’s last year’s annual report would explicitly mention its CFO. But think of a question like this: Which two directors of XYZ, inc. have studied in the same school? The retrieval process won’t be able to fetch the relevant information without mentioning a school name. But graph RAG could do it.
However, the key issue here is how we construct the graph for retrieval. I’ve addressed this issue in a separate post recently. Thinking another step backward, how do we even prepare the annual reports in a way that makes it easier to create the graphs?
That’s the focus of this article.
The first engineering step of all our work is converting data from PDFs to text. However, annual reports are complex documents. There won’t be just text. There’ll be charts, tables, etc. Each provides a vital piece of information about the company.
So, let’s start from there.
How to convert PDFs to rich text
Most Python programmers would have used PDF readers at some point — at least to follow along with a tutorial. The most popular one was PyPDF2.
Most of these libraries do get the job done. But the helpfulness of the information was not great.
The PyPDF2 library, which I have known about for years, extracts all your PDF content as text without any formatting. After extraction, you have no idea…