Extracting and Cleaning Table Data from PDFs: A Step-By-Step Tutorial using Python, Tabula, and Jupyter Notebook

3 min readApr 16, 2023

Data extraction from PDF files can be a challenging task, especially when the data is presented in tables or irregular formats. In this tutorial, we will walk through the process of extracting and cleaning data from a PDF file using Python, Tabula, and Jupyter Notebook. We will then convert the extracted data into a CSV file for further analysis. This tutorial is perfect for data enthusiasts, students, and professionals who need a simple and effective method for extracting data from PDF files.

Part 1: Installing Required Libraries and Setting Up the Jupyter Notebook

Installing Python (if not already installed) Visit the official Python website (https://www.python.org/downloads/) to download and install the latest version of Python.
Installing Tabula : !pip install tabula-py
Installing pandas : !pip install pandas
Installing Jupyter Notebook (if not already installed) : !pip install jupyter
Creating a new Jupyter Notebook : $ jupyter notebook

Part 2: Reading the PDF File using Tabula

Importing required libraries :

import tabula
import pandas as pd
import re

Reading the PDF file with Tabula : The PDF file used for this tutorial has been picked from NABL India website which is an accredition board for labs in India.

df = tabula.read_pdf('./nabl_labs.pdf', pages='all', pandas_options={'header': None})

Understanding the structure of the extracted data :

len(df)
df[0]
df[31]

Part 3: Cleaning and Preparing the Data

Dropping unnecessary columns and rows :

df[31].drop(columns=[0,2,4,6,8,10,12,14],inplace=True)
df[31]

for i in range(0,413):
    if i==0:
        df[0]=df[0].drop([0])
    else:
        df[i]=df[i].drop([0,1])

Renaming columns and concatenating DataFrames :

header = df[0].iloc[0]
data_list = []

for i in range(0,413):
    if i == 0:
        data = df[i].iloc[1:]
    else:
        data = df[i]
    if data.shape[1] == header.shape[0]:
        data.columns = header
        data_list.append(data)
        
df_concatenated = pd.concat(data_list)

Extracting phone numbers, pincodes, emails, and contact persons using Regular Expressions :

regex = r'\d+'
df_concatenated['Numbers'] = df_concatenated['NAME & CONTACT DETAILS OF THE\rLABORATORY'].apply(lambda x: ', '.join(re.findall(regex,str(x))))

phone_pattern = r'(\d{10})'
def extract_pincode(text):
    regex = r'\b\d{6}\b'
    pincodes = re.findall(regex, text)
    return ','.join(pincodes)

df_concatenated['Pincode'] = df_concatenated['Numbers'].apply(extract_pincode)
df_concatenated = df_concatenated.replace('\n','',regex=True)
df_concatenated = df_concatenated.replace('\r',' ',regex=True)
df_concatenated.columns = ['Sno.','Name of Lab','Certificate','discipline','date of issue','date of expiry','validity extended upto','numbers','PIN Code']

# extract phone number using regular expressions
phone = df_concatenated['numbers'].str.extract(phone_pattern, expand=False)

# create a new DataFrame with the extracted information
df_concatenated['Phone'] = pd.DataFrame({'Phone': phone})

email_pattern = r'Email: (\S+)'
contact_pattern = r'Contact Person: (\S+(?:\s\S+)*)'

email = df_concatenated['Name of Lab'].str.extract(email_pattern, expand=False)
contact = df_concatenated['Name of Lab'].str.extract(contact_pattern, expand=False)
df_concatenated[['Email','Contact']] = pd.DataFrame({'Email': email, 'Contact Person': contact})

Part 4: Exporting the Data to a CSV File

Exporting the cleaned DataFrame to a CSV file :

df_concatenated.to_csv('nabl_extracted.csv',index=False)

Verifying the exported CSV file :

df_exported = pd.read_csv('nabl_extracted.csv')
df_exported.head()

In this tutorial, we learned how to extract, clean, and export data from a PDF file using Python, Tabula, and Jupyter Notebook. With this knowledge, you can now apply these techniques to other PDF files and further enhance your data extraction and processing skills.

Extracting and Cleaning Table Data from PDFs: A Step-By-Step Tutorial using Python, Tabula, and Jupyter Notebook

Written by Pranay Suyash

No responses yet