How To Extracting Data Frame PDF file to CSV using Python
- Introduction
- Install some Packages : Tabula , Java
- Reading the Table data from PDF
- Extracting PDF to Dataframe CSV
- Exporting PDF into CSV
- Download and open a new file CSV
Introduction
We could run code on Google Colab , Pycharm or Jupyter Notebook . My advice to run code on Google Colab or Jupyter Notebook
- Google Colab in your mail Google Drive
- Jupyter Notebook
# P/s : I do not install Jupyter Notebook and do not carry a laptop, which is why I tried to it on the website to code (^_^)
I will show you step by step to open and run code on Jupyter Notebook on the website . Link website : https: www//jupyter.org
Click to Try it in your browner
And It shows , then you choose Try Classic Notebook
And wait a minute to refresh browner
It displays the basic notebook then we click File -> Open New Notebook -> Python 3
Like this .
-> We need to do the following :
Install some Package Tabula , Java
The first time I need to install some packages : Tabula , Java and don’t forget your pdf file. Let’s do it
- pip install tabula-py :
pip install tabula-pyRequirement already satisfied: tabula-py in /srv/conda/envs/notebook/lib/python3.7/site-packages (2.0.4)
Requirement already satisfied: distro in /srv/conda/envs/notebook/lib/python3.7/site-packages (from tabula-py) (1.4.0)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.7/site-packages (from tabula-py) (1.16.4)
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (from tabula-py) (0.24.2)
Requirement already satisfied: python-dateutil>=2.5.0 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas->tabula-py) (2.8.0)
Requirement already satisfied: pytz>=2011k in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas->tabula-py) (2019.1)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas->tabula-py) (1.12.0)
Note: you may need to restart the kernel to use updated packages.
2. conda install tabula-py :
Collecting package metadata: done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.6.14
latest version: 4.8.1
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: /srv/conda/envs/notebook
added / updated specs:
- tabula-py
The following packages will be downloaded:
package | build
---------------------------|-----------------
alsa-lib-1.1.5 | h516909a_1002 544 KB conda-forge
ca-certificates-2019.11.28 | hecc5488_0 145 KB conda-forge
certifi-2019.11.28 | py37_0 148 KB conda-forge
cffi-1.13.2 | py37h8022711_0 220 KB conda-forge
chardet-3.0.4 | py37_1003 167 KB conda-forge
cryptography-2.8 | py37h72c5cf5_1 625 KB conda-forge
distro-1.4.0 | py_0 19 KB conda-forge
giflib-5.2.1 | h516909a_1 73 KB conda-forge
idna-2.8 | py37_1000 100 KB conda-forge
lcms2-2.9 | h2e4bb80_0 423 KB conda-forge
openjdk-11.0.1 | h600c080_1018 172.1 MB conda-forge
openssl-1.1.1d | h516909a_0 2.1 MB conda-forge
pycparser-2.19 | py37_1 171 KB conda-forge
pyopenssl-19.1.0 | py37_0 83 KB conda-forge
pysocks-1.7.1 | py37_0 27 KB conda-forge
requests-2.22.0 | py37_1 84 KB conda-forge
tabula-py-1.4.1 | py37_0 10.0 MB conda-forge
urllib3-1.25.7 | py37_0 159 KB conda-forge
xorg-fixesproto-5.0 | h14c3975_1002 8 KB conda-forge
xorg-inputproto-2.3.2 | h14c3975_1002 18 KB conda-forge
xorg-libxfixes-5.0.3 | h516909a_1004 17 KB conda-forge
xorg-libxi-1.7.10 | h516909a_0 45 KB conda-forge
xorg-libxtst-1.2.3 | h516909a_1002 31 KB conda-forge
xorg-recordproto-1.14.2 | h516909a_1002 7 KB conda-forge
------------------------------------------------------------
Total: 187.2 MB
The following NEW packages will be INSTALLED:
alsa-lib conda-forge/linux-64::alsa-lib-1.1.5-h516909a_1002
cffi conda-forge/linux-64::cffi-1.13.2-py37h8022711_0
chardet conda-forge/linux-64::chardet-3.0.4-py37_1003
cryptography conda-forge/linux-64::cryptography-2.8-py37h72c5cf5_1
distro conda-forge/noarch::distro-1.4.0-py_0
giflib conda-forge/linux-64::giflib-5.2.1-h516909a_1
idna conda-forge/linux-64::idna-2.8-py37_1000
lcms2 conda-forge/linux-64::lcms2-2.9-h2e4bb80_0
openjdk conda-forge/linux-64::openjdk-11.0.1-h600c080_1018
pycparser conda-forge/linux-64::pycparser-2.19-py37_1
pyopenssl conda-forge/linux-64::pyopenssl-19.1.0-py37_0
pysocks conda-forge/linux-64::pysocks-1.7.1-py37_0
requests conda-forge/linux-64::requests-2.22.0-py37_1
tabula-py conda-forge/linux-64::tabula-py-1.4.1-py37_0
urllib3 conda-forge/linux-64::urllib3-1.25.7-py37_0
xorg-fixesproto conda-forge/linux-64::xorg-fixesproto-5.0-h14c3975_1002
xorg-inputproto conda-forge/linux-64::xorg-inputproto-2.3.2-h14c3975_1002
xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-5.0.3-h516909a_1004
xorg-libxi conda-forge/linux-64::xorg-libxi-1.7.10-h516909a_0
xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.3-h516909a_1002
xorg-recordproto conda-forge/linux-64::xorg-recordproto-1.14.2-h516909a_1002
The following packages will be UPDATED:
ca-certificates 2019.6.16-hecc5488_0 --> 2019.11.28-hecc5488_0
certifi 2019.6.16-py37_0 --> 2019.11.28-py37_0
openssl 1.1.1b-h14c3975_1 --> 1.1.1d-h516909a_0
Downloading and Extracting Packages
xorg-inputproto-2.3. | 18 KB | ##################################### | 100%
ca-certificates-2019 | 145 KB | ##################################### | 100%
xorg-libxi-1.7.10 | 45 KB | ##################################### | 100%
xorg-recordproto-1.1 | 7 KB | ##################################### | 100%
openssl-1.1.1d | 2.1 MB | ##################################### | 100%
alsa-lib-1.1.5 | 544 KB | ##################################### | 100%
certifi-2019.11.28 | 148 KB | ##################################### | 100%
xorg-libxfixes-5.0.3 | 17 KB | ##################################### | 100%
xorg-libxtst-1.2.3 | 31 KB | ##################################### | 100%
chardet-3.0.4 | 167 KB | ##################################### | 100%
openjdk-11.0.1 | 172.1 MB | ##################################### | 100%
tabula-py-1.4.1 | 10.0 MB | ##################################### | 100%
cffi-1.13.2 | 220 KB | ##################################### | 100%
distro-1.4.0 | 19 KB | ##################################### | 100%
urllib3-1.25.7 | 159 KB | ##################################### | 100%
cryptography-2.8 | 625 KB | ##################################### | 100%
lcms2-2.9 | 423 KB | ##################################### | 100%
pyopenssl-19.1.0 | 83 KB | ##################################### | 100%
requests-2.22.0 | 84 KB | ##################################### | 100%
pycparser-2.19 | 171 KB | ##################################### | 100%
pysocks-1.7.1 | 27 KB | ##################################### | 100%
giflib-5.2.1 | 73 KB | ##################################### | 100%
idna-2.8 | 100 KB | ##################################### | 100%
xorg-fixesproto-5.0 | 8 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Note: you may need to restart the kernel to use updated packages.
3. conda install java :
Collecting package metadata: done
Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
- java
Current channels:
- https://conda.anaconda.org/conda-forge/linux-64
- https://conda.anaconda.org/conda-forge/noarch
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/free/linux-64
- https://repo.anaconda.com/pkgs/free/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
Note: you may need to restart the kernel to use updated packages.
Write Code :
Reading the Table data from PDF
from tabula import read_pdf
from tabula import convert_intoimport pandas as pd
# We try action code to know whether all packages run or not :
# I named of pdf flie is “ p ”
df = read_pdf( “ p.pdf ” , ‘ rb ’ , pages = ’all ’ )
df.head ( ) # Show the table data
# We encode with font characters :
df = read_pdf( ‘ p.pdf ’ , encoding = ‘ latin1 ’ , pages = “ all ”, nospreadsheet = True )
df.head ( )
Extracting PDF to Dataframe CSV
# Now we do extracting PDF to CSV :
csv = convert_into ( ‘ p.pdf ’, ‘ test_s.csv ’, output_format = ‘ csv ’, pages = ‘ all ’)
Note: When you run that code and it will show a new file with name test_s.csv
Exporting PDF into CSV
# Ok, We read a new csv file just convert form PDF to CSV :
d = pd.read_csv( ‘ test_s.csv ’ )
d
Download and open a new file CSV
# We save to file csv , download and open file
df.to_csv ( ‘ d_new.csv ’ , index = True )
# P/s : To explain
df.to_csv : data name “ .to_csv ” convert to csv table
‘d_new.csv’ : name of new file you want to choose a nice name of file “.csv”
index : It means to show number of rows 0,1,2,3…., N
I spend a lot of time trying to write code for my work. Today, I summarize all my experience, knowledge and fight to make the right code for anyone who wants to know.
That’s a new blog I try to assisted anyone in converting PDF files to CSV files.
If you have a newer update about converting a PDF into a CSV file for all Newbie and doing their job well
Many Thanks for read my Blog… (^ _ ^)