Data Collection

How to extract multiple tables from a PDF through python and tabula-py

A step-by-step tutorial to extract tables from a PDF document.

Angelica Lo Duca

Published in

Analytics Vidhya

5 min readMar 28, 2020

Often it may happen that your data are not available as CSV or JSON, but they are contained into a PDF file in the form of a table. In the simplest case, the table can be copied and pasted into a text editor or spreadsheet. However, it may happen that you have multiple tables in the same PDF, having all the same structure (see figure below). In this case, it would be very tedious to copy and paste each of them separately.

Here, the python library tabula-py helps you to extract multiple tables separately. Firstly, you need to install this library by typing pip install tabula-py or pip3 install tabula-py if you have a Mac or a Linux OS.

Now you are ready to write your script. You can download the example in my GitHub repository.

For example, we consider the document https://github.com/alod83/dj-infouma/blob/master/DataCollection/EstrazioneDaPDF/dati/Bolletino-sorveglianza-integrata-COVID-19_17-marzo-2020_appendix.pdf and we extract…

Data Collection

How to extract multiple tables from a PDF through python and tabula-py

A step-by-step tutorial to extract tables from a PDF document.

Written by Angelica Lo Duca