Use Your Data: Scraping PDFs
Part 2: Extracting the data from PDF, cleaning and Analysis
Hi there. This article, a continuation from Part 1, which shows in simple steps, how to scrape data from PDFs. If you have not read the previous article, which shows how to automate downloads of attachments from emails, you can read it here.
Before we proceed further, let’s do a check on what our goal is, what we’ve done so far, and what is left to do.
Goal : How much do I owe in back payments for child care? And how do I stop these payment discrepancies from recurring?
Steps to achieve goal:
- Download attachments from email that contain statement of entitlements (these PDFs contain relevant data such as number of hours spent per child at the daycare, childcare subsidy paid by Centrelink, and the balance to be paid by the parents).
- Extract the data from the PDFs.
- Clean the data.
- Run some analysis. Check the amounts and find out how much is outstanding.
Item 1 has been achieved. Let’s work on the remaining items. We will be using Python and some third party libraries to achieve our goal.
Extract the data from the PDF
A sample PDF has been provided here — the original document cannot be used due to its sensitive nature. Our goal specifically is to extract all the tables from the PDF. Please note that the PDF has to be text based, not scanned documents. If you can click and drag to select text in your table in a PDF viewer, then the PDF is text-based. Below is a portion of the sample PDF.
To extract the tables from the PDF, we will use a powerful third party library called Camelot. Camelot’s syntax is simple:
table = camelot.read_pdf('foo.pdf')
Simple right? Just use the read_pdf method and you have your table. If the pages of the PDF are more than one, the syntax becomes :
tables = camelot.read_pdf('foo.pdf', pages='1,2,3,and so on')
#or you can just set pages="all" to read in all the pages in the pdf.
You could also set the pages as a range (1, 4–10, …).
If the table cells in the PDF have demarcated lines between them, then camelot automatically works on the PDF using lattice as a default flavor. If however, the cells do not have demarcated lines, simply set flavor as stream:
tables = camelot.read_pdf('foo.pdf', pages='1-3',flavor='stream')
To find out how many tables were extracted, simply run the code below:
tables.n
To view the table as a pandas data frame simply call the df method on the table:
tables[0].df
There are many more options, as well as the installation procedure, on the library website here.
For our purpose there are multiple pages in the PDF and it does not have demarcated lines so we’ll use flavor=’stream’. But how do we get the number of pages? Simple — pass the string ‘all’ to the pages argument.
So, let’s extract our tables:
import camelottable = camelot.read_pdf(file,
pages='all',
flavor='stream',
edge_tol = 500)
In the code above, we imported camelot and read in the PDF file using the read_pdf method. Note that flavor is set to stream, and there is an additional argument — edge_tol, which helps improve the detected area. This gets us all the tables we need.
Not bad. We have achieved Item 2 which is to extract the PDFs. However, our data is mangled and quite unstructured, which is to be expected sometimes when pulling tables from PDFs. The image below shows how one of the tables look — quite messy — we have rows that are not relevant — row 4 and 5 for example; we only need the rows that have dates. Also there are some columns that are empty or have ‘….’; they are irrelevant and should be discarded.
This leads to Item 3 — cleaning the data. And we have the tool to achieve this in Python — Pandas. Alright, let’s break down the cleaning tasks. What do we do to make it ‘clean’?
- We only need the rows that have a date, the rest can be discarded.
- Some columns have dots(….) which represent null or empty cells. We need to clean these cells, and if there are columns that are null in all cells, we remove these columns.
- Not all the extracted tables have headers. Let’s use uniform column headers for the tables.
- Let’s merge all tables into a single table.
Alright, get the hoses! let’s clean.
- Discard rows that are not needed, and only keep rows that have dates.
def filter_index(dataframe): '''
filter for only indexes that start with the day of the week
''' date_filter = ('Mon','Tue','Wed','Thu','Fri','Sat','Sun') index_filter = dataframe.index.str.strip().str.startswith(date_filter) dataframe = dataframe.loc[index_filter] return dataframe
The function above filters the dataframe for only rows that have ‘Mon’,’Tue’,… as indexes and discards the rest.
2. Clean cells with multiple dots, and get rid of columns that have all cells empty/null
def clean_columns(dataframe):
'''
Get rid of the dots and drop empty columns
'''
for col in dataframe.columns: dataframe[col] = dataframe[col].str.strip('.')
cols = [col for col in dataframe.columns
if (dataframe[col]=="").all()] dataframe = dataframe.drop(cols,axis=1)
return dataframe
Explanation: The for loop above strips the prefix and suffix of the cells of the dots, if they exist. The next two lines after the for loop looks for columns that have all the cells empty(“”) and drops them out of the dataframe.
3. Set uniform headers for all tables extracted.
def column_rename(dataframe):
'''
change the column headers to something more sensible.
''' col_names = ['Start','End','Hours',
'Attendance_Sign_In',
'Attendance_Sign_Out',
'Attendance_Hours',
'Fee_Before_Discounts',
'Total_Fee_For_CCS',
'Hourly_Fee',
'CCS_Hours',
'Fee_Reduction_to_Service_CCS',
'Parent_Payment',
'Educator_ID'
] dataframe.columns = col_names
return dataframe
Explanation : This does what it says, rename the columns from the table. This will be applied to every extracted table.
One more thing — Change the data type of the dataframe — make them in the right format — date type, numeric type
def change_dtype(dataframe): #here I made changes to the data types
dataframe.index = pd.to_datetime(dataframe.index,
dayfirst=True)
.rename('Date') dataframe['Fee_Before_Discounts']= pd.to_numeric(dataframe['Fee_Before_Discounts'].str.replace('$','')) dataframe['CCS_Hours'] = pd.to_numeric(dataframe['CCS_Hours']) dataframe['Total_Fee_For_CCS'] = pd.to_numeric(dataframe['Total_Fee_For_CCS'].str.replace('$','')) dataframe['Hourly_Fee'] = pd.to_numeric(dataframe['Hourly_Fee']) dataframe['Fee_Reduction_to_Service_CCS'] = pd.to_numeric(dataframe['Fee_Reduction_to_Service_CCS']) dataframe['Parent_Payment'] = pd.to_numeric(dataframe['Parent_Payment']) dataframe.insert(0,'Day',dataframe.index.day_name())
return dataframe
We have all our functions, now let us apply them to the tables and clean them. We will employ a for loop to achieve this:
import pandas as pd data_list = []for t in range(table.n): data = table[t].df if len(data.columns)<13:
continue data_fixed = (data.set_index(0)
.pipe(filter_index)
.pipe(clean_columns)
.pipe(column_rename)
.pipe(change_dtype) )
data_list.append(data_fixed)
We have our cleaned tables in a list. Last step is to lump them all into a single data frame :
data_final = pd.concat(data_list, ignore_index=False, sort=False)
Tadaaa! Our job is done. The image below shows a part of our final output.
Now we can run our analysis and find out what how much we need to pay back for child care. I wont run any analysis here since it is just a sample PDF with silly figures.
So, there you have it. We now know how to automate downloads from our email, scrape PDFs and do some data cleaning. The entire code is in my github repository; you can view it here. Break it, test it, and let me know your thoughts. Feedback is much appreciated.
References :
Camelot : https://camelot-py.readthedocs.io/en/master/
Pandas : https://pandas.pydata.org
My github repository : https://github.com/samukweku/PDF_Extraction