How to import Tables from PDF to CSV, TSV, JSON PDFTables using Tabula in 3 Lines of Code

Asim Zahid
Analytics Vidhya
Published in
3 min readJan 1, 2021
Photo by Austin Distel on Unsplash

In this tutorial, I’ll teach you how to convert and extract tables from pdf to CSV, TSV, JSON format in just three lines of code.

Step 1. Setup tabula (one line code)

Step 2. Import tabula

Step 3. Convert pdf

Introduction

tabula-py is a tool for convert PDF tables to pandas DataFrame. tabula-py is a wrapper of tabula-java, which requires java on your machine. tabula-py also enables you to convert tables in a PDF into CSV, TSV, JSON files.

tabula-py’s PDF extraction accuracy is the same as tabula-java or tabula app; GUI tool of tabula, so if you want to know the performance of tabula-py, I highly recommend you to try tabula app.

tabula-py is good for:

automation with Python script

advanced analytics after converting pandas DataFrame

casual analytics with Jupyter notebook or Google Colabolatory

Step 1

tabula-py requires a java environment, so let’s check the java environment on your machine.

Open your terminal or CMD, enter

java -version

After confirming the java environment, install tabula-py by using pip. again run in terminal or cmd

pip install -q tabula-py

Step 2

Open your favorite IDE and write the below program

import tabula#check your environment via tabula-py,which shows Python, Java #version, Java version, and your OS environment.
tabula.environment_info()
#
pdf_path = "/path/to/you/pdf/file"
# read pdf as CSV
tabula.convert_into(pdf_path, "test.csv", pages="all", output_format="csv", stream=True)

Save the file and run it will convert pdf to your required format.

Example Notebook

Tabula-py

Custom Requests

Do you want a question regarding the article or need an automated custom extractor for yourself or your company? feel free to get in touch.

Hire Me:

Are you seeking a proficient individual for data engineering services? I am available and eager to undertake the task at hand. I look forward to hearing from you in regard to potential opportunities.

About Author:

Asim is an applied research data engineer with a passion for developing impactful products. He possesses expertise in building data platforms and has a proven track record of success as a dual Kaggle expert. Asim has held leadership positions such as Google Developer Student Club (GDSC) Lead and AWS Educate Cloud Ambassador, which have allowed him to hone his skills in driving business success.

In addition to his technical skills, Asim is a strong communicator and team player. He enjoys connecting with like-minded professionals and is always open to networking opportunities. If you appreciate his work and would like to connect, please don’t hesitate to reach out.

Read More:

--

--

Asim Zahid
Analytics Vidhya

I can brew up algorithms with a pinch of math, an ounce of Python and piles of data to power your business applications.