Converting PDF to CSV Using Pandas🐼 with pdfplumber 👩‍🔧

Lokesh Madiga
4 min readOct 22, 2023

--

Introduction

Have you ever faced a massive 60-page PDF filled with important data? If you tried using Excel to make sense of it, you probably hit a wall. Excel is great with neatly arranged tables, but real-life data is often scattered and messy. So, how do you unlock the power of your data? By converting that cumbersome PDF into a CSV file. In this article, we’ll show you how to do it. Say goodbye to the PDF hassle and hello to data you can work with! using python most loved library for Data Manipulation Pandas.

For this process we need Pandas and pdf plumber libraries

# Required Libraries
import pdfplumber
import pandas as pd

PDF Conversion

Reading pdf

First, we will look how pdf looks like so that we will get an idea how data is present in pdf.

First page of Pdf

In First Page we have 31 Columns, luckily, we have all data in Tabular format so need to worry about anything else..

pdf = pdfplumber.open(r"D:\Data sets\APEAMCET-2022\APEAPCET_2022_Cutoff.pdf")

All Records in pdf have same header so we are going to use 1st page data to make header.

page0 = pdf.pages[0]

For Finding Tabular Data we will use extract_table() method

table = page0.extract_table()

Data Checking

Checking Data weather correct not by looking some records we are using first 3 rows

table[:3]
[['SNO', 'inst_code', 'inst_name', 'type', 'INST\n_RE\nG.', 'DIST', 'PLACE', 'COED', 'AFFLIA\n.UNIV', 'ESTD', 'branch_\ncode', 'Local_Ar\nea', 'OC_BO\nYS', 'OC_GIR\nLS', 'SC_BO\nYS', 'SC_GIR\nLS', 'ST_BOY\nS', 'ST_GIR\nLS', 'BCA_B\nOYS', 'BCA_GI\nRLS', 'BCB_B\nOYS', 'BCB_GI\nRLS', 'BCC_B\nOYS', 'BCC_GI\nRLS', 'BCD_B\nOYS', 'BCD_GI\nRLS', 'BCE_B\nOYS', 'BCE_GI\nRLS', 'OC_EWS_B\nOYS', 'OC_EWS_GI\nRLS', 'COLLFEE'], ['1', 'ACEE', 'ADARSH COLLEGE OF ENGINEERING', 'PVT', 'AU', 'EG', 'GOLLAPROLU', 'COED', 'JNTUK', '2008', 'CIV', '', '143031', '143031', '170697', '170697', '143031', '143031', '143031', '143031', '143031', '143031', '143031', '143031', '143031', '143031', '143031', '143031', '158522', '', '35000'], ['2', 'ACEE', 'ADARSH COLLEGE OF ENGINEERING', 'PVT', 'AU', 'EG', 'GOLLAPROLU', 'COED', 'JNTUK', '2008', 'CSE', '', '132938', '132938', '170635', '170635', '173467', '173467', '153425', '167840', '167292', '169169', '132938', '139146', '134908', '157283', '144114', '144114', '149718', '120842', '35000']]

Making Header

Here Table is nothing but Multi-Dimensional Array
So, use Indexes for fetching columns Every Page has header so we will add that header to data frame only once.

df = pd.DataFrame(columns = table[0])
df
SNO inst_code inst_name type INST\n_RE\nG. DIST PLACE COED AFFLIA\n.UNIV ESTD ... BCB_GI\nRLS BCC_B\nOYS BCC_GI\nRLS BCD_B\nOYS BCD_GI\nRLS BCE_B\nOYS BCE_GI\nRLS OC_EWS_B\nOYS OC_EWS_GI\nRLS COLLFEE
0 rows × 31 columns

Remaining Pages Extraction

Each page data expect header this is each_page_data
We are specifying columns are headers (extracted in above)

each_page_data = pd.DataFrame(table[1:], columns=df.columns)
each_page_data

Adding Each Page Data to Header

# Traverse through each page
for i in range(size):
page = pdf.pages[i] # storing each page data
table = page.extract_table() # Extracting tabular form data from page

# storing as dataframe from every table from 1 index (header is already ther)
each_page_data = pd.DataFrame(table[1:], columns=df.columns)

# Adding each df to main df by using concat (Eg :sum of numbers in array)
df = pd.concat([df, each_page_data], ignore_index=True)

Verification

print(df.shape)
(1496, 31)
df.head(5)
SNO inst_code inst_name type INST\n_RE\nG. DIST PLACE COED AFFLIA\n.UNIV ESTD ... BCB_GI\nRLS BCC_B\nOYS BCC_GI\nRLS BCD_B\nOYS BCD_GI\nRLS BCE_B\nOYS BCE_GI\nRLS OC_EWS_B\nOYS OC_EWS_GI\nRLS COLLFEE
0 1 ACEE ADARSH COLLEGE OF ENGINEERING PVT AU EG GOLLAPROLU COED JNTUK 2008 ... 143031 143031 143031 143031 143031 143031 143031 158522 35000
1 2 ACEE ADARSH COLLEGE OF ENGINEERING PVT AU EG GOLLAPROLU COED JNTUK 2008 ... 169169 132938 139146 134908 157283 144114 144114 149718 120842 35000
2 3 ACEE ADARSH COLLEGE OF ENGINEERING PVT AU EG GOLLAPROLU COED JNTUK 2008 ... 163503 114459 114459 114459 146246 114459 114459 128288 35000
3 4 ACEE ADARSH COLLEGE OF ENGINEERING PVT AU EG GOLLAPROLU COED JNTUK 2008 ... 169169 169142 169163 169163 169163 35000
4 5 ACEE ADARSH COLLEGE OF ENGINEERING PVT AU EG GOLLAPROLU COED JNTUK 2008 ... 169169 169131 169142 169142 169163 169163 35000

5 rows × 31 column

Use Cases

Practical applications of PDF to CSV conversion across various fields & different scenarios such as financial reports, academic research, e-commerce data management, inventory control, and data aggregation, making it clear how versatile and valuable the conversion process can be in real-world situations.

Notes:

PDF to CSV conversion using Python’s Pandas library is versatile, but it’s important to note that pdfplumber is particularly useful when dealing with PDFs in a tabular format. However, there are other methods for PDF to CSV conversion depending on the PDF’s structure.

1. Tabula-Py: Used in conjunction with Pandas to convert the extracted data into CSV.

2. pdftotext:Once the text data is extracted, you can use Python to further process it and convert it to CSV.

3. PyPDF2: Used to extract text from PDFs, which can then be processed and converted to CSV using Python.

4.Adobe Acrobat: Adobe Acrobat Pro DC, a commercial software, provides features for exporting PDF data to various formats, including CSV. This is a suitable option if you already have access to Adobe Acrobat.

Thanks for Reading this Article, this is first article!

If you have any suggestions & any Feedback, I would love to hear from you.

Feel Free to Connect with me here!

linkedin, github, Kaggle

--

--