Handling Data Stored Across Multiple PDF Files With Python
If you work as a data analyst (or a similar role) in the healthcare or biotech industry, you likely deal with PDF files quite often. Reports made by Doctors after a check up are often created in PDF format, as are results generated by software for certain medical devices. Sometimes it is desirable to store these data in one database instead of keeping the data stored across multiple pdf files. This could be as simple as moving the data from PDF files to an Excel spreadsheet. But who wants to enter those data manually? Entering the data manually into a spreadsheet is time that could be better spent and is prone to human error. It would be better to find some way to automate this process, but parsing pdf files can be difficult. I’ve had to deal with this issue in my own workplace, so I wanted to share my approach in case anyone else is in a similar situation.
There are several Python libraries designed for parsing PDF files, but some are not suitable for more complex PDF files. My favorite library for parsing complex PDF files is PDFQuery. The contributors to this library have done a really great job at creating a user friendly tool for parsing PDF files. PDFQuery reads a PDF file in as an element object, similar to the way you would parse an XML file. PDFQuery accesses the desired information by its specific location inside of the PDF document. Lets jump right into an example…
I created a Github repo here so that anyone can follow along. The first step to parsing any pdf file with PDFQuery is to read in the file as an element object and load it:
pdf = pdfquery.PDFQuery('samplepdf1.pdf')
pdf.load()Lets say we want to pull the patients name from samplepdf1. I find the data I want by first writing the PDF to an XML file with PDFQuery:
pdf.tree.write('samplepdf1.xml', pretty_print=True)I like to look at the XML file with SublimeText, but any text editor would do just fine. If the file is large, I hit ‘command f’ and type in the text I am looking for:

We can see the information we are looking for inside the ‘LTTextBoxHorizontal’ tag, and we can see the metadata associated with it. One way we can access this data is by its bounding box, or ‘bbox’.
patient_name = pdf.pq('LTTextLineHorizontal:in_bbox("72.0, 543.675, 205.474, 560.795")').text()Sometimes the the data we want is not in the exact same location in every file. Luckily, PDFQuery can also query tags that contain a given string. Let’s get the patients name again using this method, but first we have to clean the text data a bit; we only want the patients name which comes after the colon:
def clean_text_data(text):
return text.split(':')[1]patient_name = clean_text_data(pdf.pq('LTTextLineHorizontal:contains("Patient Name")').text())
We can use the methods discussed above to create a pandas dataframe from several PDF files. We just need to write a function that takes in a list of PDF files and iterates through each file, loading and parsing one by one. There are several approaches, but something like this will work.
Now we have a function that will iteratively parse a list of pdf files and put the data into a pandas dataframe. From here it is really easy to move the data to any type of database. We can move the data to a SQL database, a CSV file, and many other file formats. For now, we will simply write the data to a Microsoft Excel file. Lets write a function to help us do that:
def write_to_excel(df):
writer = pd.ExcelWriter('example_patient_data.xlsx')
df.to_excel(writer,'Main',index=False)
writer.save()We are just about done. The final step is to call the functions we’ve written. But the process isn’t really ‘automated’ if we have to go into the script and modify the list of PDF files every time there are new files to parse. Let’s use argparse so that we can simply pass the PDF files from the command line:
def main(): parser = argparse.ArgumentParser(description = 'Parsing PDF tutorial')
parser.add_argument('--parse', nargs='+', required=True) args = parser.parse_args() if args.parse:
to_parse = args.parse pdata = pdf_to_df(to_parse) write_to_excel(pdata)
Now we run the script like this:
python pdfexample.py --parse 'samplepdf1.pdf' 'samplepdf2.pdf' 'samplepdf3.pdf' 'samplepdf4.pdf'Full code below:
This article barely scratches the surface of what PDFQuery can do. There obviously will be situations where the PDF files are much more complex and inconsistent than the ones provided in this example. The PDFQuery documentation walks through many more examples of complex situations, so I highly encourage everyone to check it out.
Thanks for reading! I hope this helps someone add value to their workplace!
