Convert PDF to CSV using Python

P&S DRAFTS.
Nerd For Tech
Published in
3 min readJun 9, 2023

Converting a PDF to CSV using python is very simple and can be done in less than 10 steps.

First let’s go through the prerequisites:

1] Java installed in your machine

2] Python’s latest version needs to be installed in your machine.

3] Make sure your pdf and notebook are in the same directory.

Now you are good to go.In case you haven’t installed it yet, here are the links for JAVA and Python.

Credits: DocParser

Let’s see how to convert a PDF to CSV file within a few lines of code using Python.

Step 1: Install the required package Tabula[1] in the command shell(you can even use this command in your Google colab cell)

pip install tabula-py

Step 2: Read the PDF using the read_pdf() function, pass 2 parameters with it the file name and the pages which you want to read. This step will return a dataframe.

tabula.read_pdf(“File_name” , pages= “all”)

Step 3: Now we convert the PDF to CSV using the tabula.convert_into() function. Pass 4 parameters with it: the pdf name, the csv file name into which you want the pdf to convert and lastly the pages which you want to convert and the output format.

tabula.convert_into(“pdf_name”, “csv_file_name”, pages= “all”, output_format= “csv”)

Tada, you have successfully converted your PDF to CSV file.

Let’s put all the steps together now:

pip install tabula-py
# Import the required Module
import tabula
# Read a PDF File
# make sure your pdf file is in the same directory as your notebook
df = tabula.read_pdf("Travelling.pdf", pages='all')[0]
# convert PDF into CSV
tabula.convert_into("Travelling.pdf", "trav.csv", output_format="csv", pages='all')
print(df)

Above, we have imported the required library ,defined a variable df where we have used the read_pdf() function to read the data from the pdf and then we have used the convert_into() function to convert pdf to csv. Finally we have printed it for the users.

Parameters:

pdfname: This parameter is the name of the pdf file which we would like to read the data from/convert to csv.

csv_file_name: It is the name of the csv file you want the pdf to be converted into.

output_format: The format in which you want your output to be in.

pages: The number of pages you want to convert. Let’s say you want all the pages to be converted then you can use pages = “all” else you can just type pages= “1–5” as required.

Notes:

[1] Tabular is a basic wrapper of tabula-java that allows users to extract the table and convert PDF file directly into Data frames or JSON using Python. The user can also extract tables from PDF and convert them into TSV, CSV, or JSON format. The major part of tabula-py is written in Java that first reads the PDF document and converts the Python DataFrame into a JSON object.

Written by, R P PAVITRA

--

--

P&S DRAFTS.
Nerd For Tech

Hello! Welcome to P&S DRAFTS. Our names are Pavitra and Smruthi (P&S).