How To Extracting Data Frame PDF file to CSV using Python

Table of content

PDA PHAM
7 min readJan 27, 2020
  • Introduction
  • Install some Packages : Tabula , Java
  • Reading the Table data from PDF
  • Extracting PDF to Dataframe CSV
  • Exporting PDF into CSV
  • Download and open a new file CSV

Introduction

We could run code on Google Colab , Pycharm or Jupyter Notebook . My advice to run code on Google Colab or Jupyter Notebook

  • Google Colab in your mail Google Drive
  • Jupyter Notebook

# P/s : I do not install Jupyter Notebook and do not carry a laptop, which is why I tried to it on the website to code (^_^)

I will show you step by step to open and run code on Jupyter Notebook on the website . Link website : https: www//jupyter.org

Click to Try it in your browner

And It shows , then you choose Try Classic Notebook

And wait a minute to refresh browner

It displays the basic notebook then we click File -> Open New Notebook -> Python 3

Like this .

-> We need to do the following :

Install some Package Tabula , Java

The first time I need to install some packages : Tabula , Java and don’t forget your pdf file. Let’s do it

  1. pip install tabula-py :
pip install tabula-pyRequirement already satisfied: tabula-py in /srv/conda/envs/notebook/lib/python3.7/site-packages (2.0.4)
Requirement already satisfied: distro in /srv/conda/envs/notebook/lib/python3.7/site-packages (from tabula-py) (1.4.0)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.7/site-packages (from tabula-py) (1.16.4)
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (from tabula-py) (0.24.2)
Requirement already satisfied: python-dateutil>=2.5.0 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas->tabula-py) (2.8.0)
Requirement already satisfied: pytz>=2011k in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas->tabula-py) (2019.1)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas->tabula-py) (1.12.0)
Note: you may need to restart the kernel to use updated packages.

2. conda install tabula-py :

Collecting package metadata: done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
current version: 4.6.14
latest version: 4.8.1

Please update conda by running

$ conda update -n base -c defaults conda



## Package Plan ##

environment location: /srv/conda/envs/notebook

added / updated specs:
- tabula-py


The following packages will be downloaded:

package | build
---------------------------|-----------------
alsa-lib-1.1.5 | h516909a_1002 544 KB conda-forge
ca-certificates-2019.11.28 | hecc5488_0 145 KB conda-forge
certifi-2019.11.28 | py37_0 148 KB conda-forge
cffi-1.13.2 | py37h8022711_0 220 KB conda-forge
chardet-3.0.4 | py37_1003 167 KB conda-forge
cryptography-2.8 | py37h72c5cf5_1 625 KB conda-forge
distro-1.4.0 | py_0 19 KB conda-forge
giflib-5.2.1 | h516909a_1 73 KB conda-forge
idna-2.8 | py37_1000 100 KB conda-forge
lcms2-2.9 | h2e4bb80_0 423 KB conda-forge
openjdk-11.0.1 | h600c080_1018 172.1 MB conda-forge
openssl-1.1.1d | h516909a_0 2.1 MB conda-forge
pycparser-2.19 | py37_1 171 KB conda-forge
pyopenssl-19.1.0 | py37_0 83 KB conda-forge
pysocks-1.7.1 | py37_0 27 KB conda-forge
requests-2.22.0 | py37_1 84 KB conda-forge
tabula-py-1.4.1 | py37_0 10.0 MB conda-forge
urllib3-1.25.7 | py37_0 159 KB conda-forge
xorg-fixesproto-5.0 | h14c3975_1002 8 KB conda-forge
xorg-inputproto-2.3.2 | h14c3975_1002 18 KB conda-forge
xorg-libxfixes-5.0.3 | h516909a_1004 17 KB conda-forge
xorg-libxi-1.7.10 | h516909a_0 45 KB conda-forge
xorg-libxtst-1.2.3 | h516909a_1002 31 KB conda-forge
xorg-recordproto-1.14.2 | h516909a_1002 7 KB conda-forge
------------------------------------------------------------
Total: 187.2 MB

The following NEW packages will be INSTALLED:

alsa-lib conda-forge/linux-64::alsa-lib-1.1.5-h516909a_1002
cffi conda-forge/linux-64::cffi-1.13.2-py37h8022711_0
chardet conda-forge/linux-64::chardet-3.0.4-py37_1003
cryptography conda-forge/linux-64::cryptography-2.8-py37h72c5cf5_1
distro conda-forge/noarch::distro-1.4.0-py_0
giflib conda-forge/linux-64::giflib-5.2.1-h516909a_1
idna conda-forge/linux-64::idna-2.8-py37_1000
lcms2 conda-forge/linux-64::lcms2-2.9-h2e4bb80_0
openjdk conda-forge/linux-64::openjdk-11.0.1-h600c080_1018
pycparser conda-forge/linux-64::pycparser-2.19-py37_1
pyopenssl conda-forge/linux-64::pyopenssl-19.1.0-py37_0
pysocks conda-forge/linux-64::pysocks-1.7.1-py37_0
requests conda-forge/linux-64::requests-2.22.0-py37_1
tabula-py conda-forge/linux-64::tabula-py-1.4.1-py37_0
urllib3 conda-forge/linux-64::urllib3-1.25.7-py37_0
xorg-fixesproto conda-forge/linux-64::xorg-fixesproto-5.0-h14c3975_1002
xorg-inputproto conda-forge/linux-64::xorg-inputproto-2.3.2-h14c3975_1002
xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-5.0.3-h516909a_1004
xorg-libxi conda-forge/linux-64::xorg-libxi-1.7.10-h516909a_0
xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.3-h516909a_1002
xorg-recordproto conda-forge/linux-64::xorg-recordproto-1.14.2-h516909a_1002

The following packages will be UPDATED:

ca-certificates 2019.6.16-hecc5488_0 --> 2019.11.28-hecc5488_0
certifi 2019.6.16-py37_0 --> 2019.11.28-py37_0
openssl 1.1.1b-h14c3975_1 --> 1.1.1d-h516909a_0



Downloading and Extracting Packages
xorg-inputproto-2.3. | 18 KB | ##################################### | 100%
ca-certificates-2019 | 145 KB | ##################################### | 100%
xorg-libxi-1.7.10 | 45 KB | ##################################### | 100%
xorg-recordproto-1.1 | 7 KB | ##################################### | 100%
openssl-1.1.1d | 2.1 MB | ##################################### | 100%
alsa-lib-1.1.5 | 544 KB | ##################################### | 100%
certifi-2019.11.28 | 148 KB | ##################################### | 100%
xorg-libxfixes-5.0.3 | 17 KB | ##################################### | 100%
xorg-libxtst-1.2.3 | 31 KB | ##################################### | 100%
chardet-3.0.4 | 167 KB | ##################################### | 100%
openjdk-11.0.1 | 172.1 MB | ##################################### | 100%
tabula-py-1.4.1 | 10.0 MB | ##################################### | 100%
cffi-1.13.2 | 220 KB | ##################################### | 100%
distro-1.4.0 | 19 KB | ##################################### | 100%
urllib3-1.25.7 | 159 KB | ##################################### | 100%
cryptography-2.8 | 625 KB | ##################################### | 100%
lcms2-2.9 | 423 KB | ##################################### | 100%
pyopenssl-19.1.0 | 83 KB | ##################################### | 100%
requests-2.22.0 | 84 KB | ##################################### | 100%
pycparser-2.19 | 171 KB | ##################################### | 100%
pysocks-1.7.1 | 27 KB | ##################################### | 100%
giflib-5.2.1 | 73 KB | ##################################### | 100%
idna-2.8 | 100 KB | ##################################### | 100%
xorg-fixesproto-5.0 | 8 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.

3. conda install java :

Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

- java

Current channels:

- https://conda.anaconda.org/conda-forge/linux-64
- https://conda.anaconda.org/conda-forge/noarch
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/free/linux-64
- https://repo.anaconda.com/pkgs/free/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.



Note: you may need to restart the kernel to use updated packages.

Write Code :

Reading the Table data from PDF

from tabula import read_pdf
from tabula import convert_into

import pandas as pd

# We try action code to know whether all packages run or not :

# I named of pdf flie is “ p ”

df = read_pdf( “ p.pdf ” , ‘ rb ’ , pages = ’all ’ )

df.head ( ) # Show the table data

# We encode with font characters :

df = read_pdf( ‘ p.pdf ’ , encoding = ‘ latin1 ’ , pages = “ all ”, nospreadsheet = True )
df.head ( )

Extracting PDF to Dataframe CSV

# Now we do extracting PDF to CSV :

csv = convert_into ( ‘ p.pdf ’, ‘ test_s.csv ’, output_format = ‘ csv ’, pages = ‘ all ’)

Note: When you run that code and it will show a new file with name test_s.csv

Exporting PDF into CSV

# Ok, We read a new csv file just convert form PDF to CSV :

d = pd.read_csv( ‘ test_s.csv ’ )
d

Download and open a new file CSV

# We save to file csv , download and open file

df.to_csv ( ‘ d_new.csv ’ , index = True )

# P/s : To explain

df.to_csv : data name “ .to_csv ” convert to csv table

‘d_new.csv’ : name of new file you want to choose a nice name of file “.csv”

index : It means to show number of rows 0,1,2,3…., N

I spend a lot of time trying to write code for my work. Today, I summarize all my experience, knowledge and fight to make the right code for anyone who wants to know.

That’s a new blog I try to assisted anyone in converting PDF files to CSV files.

If you have a newer update about converting a PDF into a CSV file for all Newbie and doing their job well

Many Thanks for read my Blog… (^ _ ^)

--

--