Use Your Data: Scraping PDFs

Part 2: Extracting the data from PDF, cleaning and Analysis

oranyeli samuel
Analytics Vidhya
Published in
6 min readOct 14, 2019

--

Hi there. This article, a continuation from Part 1, which shows in simple steps, how to scrape data from PDFs. If you have not read the previous article, which shows how to automate downloads of attachments from emails, you can read it here.

Before we proceed further, let’s do a check on what our goal is, what we’ve done so far, and what is left to do.

Goal : How much do I owe in back payments for child care? And how do I stop these payment discrepancies from recurring?

Steps to achieve goal:

  1. Download attachments from email that contain statement of entitlements (these PDFs contain relevant data such as number of hours spent per child at the daycare, childcare subsidy paid by Centrelink, and the balance to be paid by the parents).
  2. Extract the data from the PDFs.
  3. Clean the data.
  4. Run some analysis. Check the amounts and find out how much is outstanding.

Item 1 has been achieved. Let’s work on the remaining items. We will be using Python and some third party libraries to achieve our goal.

Extract the data from the PDF

A sample PDF has been provided here — the original document cannot be used due to its sensitive nature. Our goal specifically is to extract all the tables from the PDF. Please note that the PDF has to be text based, not scanned documents. If you can click and drag to select text in your table in a PDF viewer, then the PDF is text-based. Below is a portion of the sample PDF.

To extract the tables from the PDF, we will use a powerful third party library called Camelot. Camelot’s syntax is simple:

Simple right? Just use the read_pdf method and you have your table. If the pages of the PDF are more than one, the syntax becomes :

You could also set the pages as a range (1, 4–10, …).

If the table cells in the PDF have demarcated lines between them, then camelot automatically works on the PDF using lattice as a default flavor. If however, the cells do not have demarcated lines, simply set flavor as stream:

To find out how many tables were extracted, simply run the code below:

To view the table as a pandas data frame simply call the df method on the table:

There are many more options, as well as the installation procedure, on the library website here.

For our purpose there are multiple pages in the PDF and it does not have demarcated lines so we’ll use flavor=’stream’. But how do we get the number of pages? Simple — pass the string ‘all’ to the pages argument.

So, let’s extract our tables:

In the code above, we imported camelot and read in the PDF file using the read_pdf method. Note that flavor is set to stream, and there is an additional argument — edge_tol, which helps improve the detected area. This gets us all the tables we need.

Not bad. We have achieved Item 2 which is to extract the PDFs. However, our data is mangled and quite unstructured, which is to be expected sometimes when pulling tables from PDFs. The image below shows how one of the tables look — quite messy — we have rows that are not relevant — row 4 and 5 for example; we only need the rows that have dates. Also there are some columns that are empty or have ‘….’; they are irrelevant and should be discarded.

This leads to Item 3 — cleaning the data. And we have the tool to achieve this in Python — Pandas. Alright, let’s break down the cleaning tasks. What do we do to make it ‘clean’?

  1. We only need the rows that have a date, the rest can be discarded.
  2. Some columns have dots(….) which represent null or empty cells. We need to clean these cells, and if there are columns that are null in all cells, we remove these columns.
  3. Not all the extracted tables have headers. Let’s use uniform column headers for the tables.
  4. Let’s merge all tables into a single table.

Alright, get the hoses! let’s clean.

  1. Discard rows that are not needed, and only keep rows that have dates.

The function above filters the dataframe for only rows that have ‘Mon’,’Tue’,… as indexes and discards the rest.

2. Clean cells with multiple dots, and get rid of columns that have all cells empty/null

Explanation: The for loop above strips the prefix and suffix of the cells of the dots, if they exist. The next two lines after the for loop looks for columns that have all the cells empty(“”) and drops them out of the dataframe.

3. Set uniform headers for all tables extracted.

Explanation : This does what it says, rename the columns from the table. This will be applied to every extracted table.

One more thing — Change the data type of the dataframe — make them in the right format — date type, numeric type

We have all our functions, now let us apply them to the tables and clean them. We will employ a for loop to achieve this:

We have our cleaned tables in a list. Last step is to lump them all into a single data frame :

Tadaaa! Our job is done. The image below shows a part of our final output.

Now we can run our analysis and find out what how much we need to pay back for child care. I wont run any analysis here since it is just a sample PDF with silly figures.

So, there you have it. We now know how to automate downloads from our email, scrape PDFs and do some data cleaning. The entire code is in my github repository; you can view it here. Break it, test it, and let me know your thoughts. Feedback is much appreciated.

References :

Camelot : https://camelot-py.readthedocs.io/en/master/

Pandas : https://pandas.pydata.org

My github repository : https://github.com/samukweku/PDF_Extraction

--

--