Extracting and Cleaning Table Data from PDFs: A Step-By-Step Tutorial using Python, Tabula, and Jupyter Notebook
Data extraction from PDF files can be a challenging task, especially when the data is presented in tables or irregular formats. In this tutorial, we will walk through the process of extracting and cleaning data from a PDF file using Python, Tabula, and Jupyter Notebook. We will then convert the extracted data into a CSV file for further analysis. This tutorial is perfect for data enthusiasts, students, and professionals who need a simple and effective method for extracting data from PDF files.
Part 1: Installing Required Libraries and Setting Up the Jupyter Notebook
- Installing Python (if not already installed) Visit the official Python website (https://www.python.org/downloads/) to download and install the latest version of Python.
- Installing Tabula :
!pip install tabula-py
- Installing pandas :
!pip install pandas
- Installing Jupyter Notebook (if not already installed) :
!pip install jupyter
- Creating a new Jupyter Notebook :
$ jupyter notebook
Part 2: Reading the PDF File using Tabula
- Importing required libraries :
import tabula
import pandas as pd
import re
- Reading the PDF file with Tabula : The PDF file used for this tutorial has been picked from NABL India website which is an accredition board for labs in India.
df = tabula.read_pdf('./nabl_labs.pdf', pages='all', pandas_options={'header': None})
- Understanding the structure of the extracted data :
len(df)
df[0]
df[31]
Part 3: Cleaning and Preparing the Data
- Dropping unnecessary columns and rows :
df[31].drop(columns=[0,2,4,6,8,10,12,14],inplace=True)
df[31]
for i in range(0,413):
if i==0:
df[0]=df[0].drop([0])
else:
df[i]=df[i].drop([0,1])
- Renaming columns and concatenating DataFrames :
header = df[0].iloc[0]
data_list = []
for i in range(0,413):
if i == 0:
data = df[i].iloc[1:]
else:
data = df[i]
if data.shape[1] == header.shape[0]:
data.columns = header
data_list.append(data)
df_concatenated = pd.concat(data_list)
- Extracting phone numbers, pincodes, emails, and contact persons using Regular Expressions :
regex = r'\d+'
df_concatenated['Numbers'] = df_concatenated['NAME & CONTACT DETAILS OF THE\rLABORATORY'].apply(lambda x: ', '.join(re.findall(regex,str(x))))
phone_pattern = r'(\d{10})'
def extract_pincode(text):
regex = r'\b\d{6}\b'
pincodes = re.findall(regex, text)
return ','.join(pincodes)
df_concatenated['Pincode'] = df_concatenated['Numbers'].apply(extract_pincode)
df_concatenated = df_concatenated.replace('\n','',regex=True)
df_concatenated = df_concatenated.replace('\r',' ',regex=True)
df_concatenated.columns = ['Sno.','Name of Lab','Certificate','discipline','date of issue','date of expiry','validity extended upto','numbers','PIN Code']
# extract phone number using regular expressions
phone = df_concatenated['numbers'].str.extract(phone_pattern, expand=False)
# create a new DataFrame with the extracted information
df_concatenated['Phone'] = pd.DataFrame({'Phone': phone})
email_pattern = r'Email: (\S+)'
contact_pattern = r'Contact Person: (\S+(?:\s\S+)*)'
email = df_concatenated['Name of Lab'].str.extract(email_pattern, expand=False)
contact = df_concatenated['Name of Lab'].str.extract(contact_pattern, expand=False)
df_concatenated[['Email','Contact']] = pd.DataFrame({'Email': email, 'Contact Person': contact})
Part 4: Exporting the Data to a CSV File
- Exporting the cleaned DataFrame to a CSV file :
df_concatenated.to_csv('nabl_extracted.csv',index=False)
- Verifying the exported CSV file :
df_exported = pd.read_csv('nabl_extracted.csv')
df_exported.head()
In this tutorial, we learned how to extract, clean, and export data from a PDF file using Python, Tabula, and Jupyter Notebook. With this knowledge, you can now apply these techniques to other PDF files and further enhance your data extraction and processing skills.