Data Collection

How to extract multiple tables from a PDF through python and tabula-py

A step-by-step tutorial to extract tables from a PDF document.

Angelica Lo Duca
Analytics Vidhya
Published in
5 min readMar 28, 2020

--

Photo by Wesley Tingey on Unsplash

Often it may happen that your data are not available as CSV or JSON, but they are contained into a PDF file in the form of a table. In the simplest case, the table can be copied and pasted into a text editor or spreadsheet. However, it may happen that you have multiple tables in the same PDF, having all the same structure (see figure below). In this case, it would be very tedious to copy and paste each of them separately.

Here, the python library tabula-py helps you to extract multiple tables separately. Firstly, you need to install this library by typing pip install tabula-py or pip3 install tabula-py if you have a Mac or a Linux OS.

Now you are ready to write your script. You can download the example in my GitHub repository.

For example, we consider the document https://github.com/alod83/dj-infouma/blob/master/DataCollection/EstrazioneDaPDF/dati/Bolletino-sorveglianza-integrata-COVID-19_17-marzo-2020_appendix.pdf and we extract…

--

--

Angelica Lo Duca
Analytics Vidhya

Researcher | +50k monthly views | I write on Data Science, Python, Tutorials, and, occasionally, Web Applications | Book Author of Comet for Data Science