Extracting and Cleaning Table Data from PDFs: A Step-By-Step Tutorial using Python, Tabula, and Jupyter Notebook

Pranay Suyash
3 min readApr 16, 2023

--

Photo by Hitesh Choudhary on Unsplash

Data extraction from PDF files can be a challenging task, especially when the data is presented in tables or irregular formats. In this tutorial, we will walk through the process of extracting and cleaning data from a PDF file using Python, Tabula, and Jupyter Notebook. We will then convert the extracted data into a CSV file for further analysis. This tutorial is perfect for data enthusiasts, students, and professionals who need a simple and effective method for extracting data from PDF files.

Part 1: Installing Required Libraries and Setting Up the Jupyter Notebook

  1. Installing Python (if not already installed) Visit the official Python website (https://www.python.org/downloads/) to download and install the latest version of Python.
  2. Installing Tabula : !pip install tabula-py
  3. Installing pandas : !pip install pandas
  4. Installing Jupyter Notebook (if not already installed) : !pip install jupyter
  5. Creating a new Jupyter Notebook : $ jupyter notebook

Part 2: Reading the PDF File using Tabula

  • Importing required libraries :
import tabula
import pandas as pd
import re
  • Reading the PDF file with Tabula : The PDF file used for this tutorial has been picked from NABL India website which is an accredition board for labs in India.
df = tabula.read_pdf('./nabl_labs.pdf', pages='all', pandas_options={'header': None})
  • Understanding the structure of the extracted data :
len(df)
df[0]
df[31]

Part 3: Cleaning and Preparing the Data

  • Dropping unnecessary columns and rows :
df[31].drop(columns=[0,2,4,6,8,10,12,14],inplace=True)
df[31]

for i in range(0,413):
if i==0:
df[0]=df[0].drop([0])
else:
df[i]=df[i].drop([0,1])
  • Renaming columns and concatenating DataFrames :
header = df[0].iloc[0]
data_list = []

for i in range(0,413):
if i == 0:
data = df[i].iloc[1:]
else:
data = df[i]
if data.shape[1] == header.shape[0]:
data.columns = header
data_list.append(data)

df_concatenated = pd.concat(data_list)
  • Extracting phone numbers, pincodes, emails, and contact persons using Regular Expressions :
regex = r'\d+'
df_concatenated['Numbers'] = df_concatenated['NAME & CONTACT DETAILS OF THE\rLABORATORY'].apply(lambda x: ', '.join(re.findall(regex,str(x))))

phone_pattern = r'(\d{10})'
def extract_pincode(text):
regex = r'\b\d{6}\b'
pincodes = re.findall(regex, text)
return ','.join(pincodes)

df_concatenated['Pincode'] = df_concatenated['Numbers'].apply(extract_pincode)
df_concatenated = df_concatenated.replace('\n','',regex=True)
df_concatenated = df_concatenated.replace('\r',' ',regex=True)
df_concatenated.columns = ['Sno.','Name of Lab','Certificate','discipline','date of issue','date of expiry','validity extended upto','numbers','PIN Code']

# extract phone number using regular expressions
phone = df_concatenated['numbers'].str.extract(phone_pattern, expand=False)

# create a new DataFrame with the extracted information
df_concatenated['Phone'] = pd.DataFrame({'Phone': phone})

email_pattern = r'Email: (\S+)'
contact_pattern = r'Contact Person: (\S+(?:\s\S+)*)'

email = df_concatenated['Name of Lab'].str.extract(email_pattern, expand=False)
contact = df_concatenated['Name of Lab'].str.extract(contact_pattern, expand=False)
df_concatenated[['Email','Contact']] = pd.DataFrame({'Email': email, 'Contact Person': contact})

Part 4: Exporting the Data to a CSV File

  • Exporting the cleaned DataFrame to a CSV file :
df_concatenated.to_csv('nabl_extracted.csv',index=False)
  • Verifying the exported CSV file :
df_exported = pd.read_csv('nabl_extracted.csv')
df_exported.head()

In this tutorial, we learned how to extract, clean, and export data from a PDF file using Python, Tabula, and Jupyter Notebook. With this knowledge, you can now apply these techniques to other PDF files and further enhance your data extraction and processing skills.

--

--