Get / Scrap data from PDF document using Python + Tabula + CSV into JSON format data

Hanief
Remote Worker Indonesia
11 min readSep 6, 2021

Don’t stop what you are doing right now more than 2 days. Or even better -if you can- don’t ever stop. Because it will get pretty hard if you try to start again. Believe me. — a friendly reminder

Good day friends!

Today we will explore how to get / scrap data from PDF documents using Python and Tabula.

The case we use in this article is about getting regional data of Indonesia from a PDF document published by the Ministry of Home Affairs in 2019— Regulation of The Minister of Home Affairs of the Republic of Indonesia Number 72 in 2019 about Amendment to Regulation of The Minister of Home Affairs Number 137 in 2017 Concerning Regional Codes and Data Government Administration.

You can get the document here (PDF - official source).

Map of Provinces in Indonesia

Before we start, one thing we must know is basic information about regional classifications in Indonesia. There’re 4 regional classifications:

  1. Provinsi (Province),
  2. Kabupaten or Kota (Regency or City),
  3. Kecamatan (District),
  4. Kelurahan or Desa (Administrative Village or Village)

A province has many regencies or cities, a regency or city has many districts and a district has many administrative villages or villages.

Have you take look at the PDF document? Let’s take a peek together!

Page 1 — The Prologue
Page 6 — The Summary
Page 8 — First page of data table of Aceh Province

As we can see in the PDF document, the document is 3933 pages long. We can start looking at regional data from page 8, which shows the Aceh province data table, until page 3933, which shows the summary of data of West Papua province.

Ok, So, the first one we need to do is list how many provinces are in the document, learn the structure of the data in the table and map the pages of provinces from and to.

On page 6, we can see there are 34 provinces with a lot of child regions.

On page 8, we can see the data provided in the table was structured like this : (The list number represents column number)

  1. The code of region with patterns: 2 digits of number for Province (xx), 4 digits of number with 1 point for Regency or City (xx.xx), 6 digits of number with 2 points for District (xx.xx.xx) and 10 digits of number with 3 points for Administrative Village or Village (xx.xx.xx.xxxx)
  2. Province name / Regency name or City name
  3. Numbers of regency in the province (but mostly the column is blank)
  4. Numbers of cities in the province (but mostly the column is blank)
  5. District name
  6. Administrative Village name
  7. Village name
  8. Region area in km2
  9. Population in region
  10. Notes or description

And on the end of the table data of the province, there’s a summary table for the province — we can use it to check if our code has got all the data or not — like the summary of Aceh province data on page 275 below.

Page 275 — Summary table of Aceh Province

After scrolling and scrolling the document, we get the list of provinces and the pages its on like this:

  1. Aceh : page 8–274
  2. Sumatera Utara : page 287–582
  3. Sumatera Barat : page 601–653
  4. Riau : page 662–786
  5. Jambi : page 794–864
  6. Sumatera Selatan : page 872–1022
  7. Bengkulu : page 1034–1106
  8. Lampung : page 1114–1233
  9. Bangka Belitung : page 1244–1259
  10. Kepulauan Riau : page 1264–1284
  11. DKI Jakarta : page 1290–1299
  12. Jawa Barat : page 1304–1524
  13. Jawa Tengah : page 1547–1823
  14. Daerah Istimewa Yogyakarta : page 1843–1858
  15. Jawa Timur : page 1864–2144
  16. Banten : page 2166–2228
  17. Bali : page 2236–2260
  18. Nusa Tenggara Barat : page 2265–2307
  19. Nusa Tenggara Timur : page 2315–2468
  20. Kalimantan Barat : page 2481–2565
  21. Kalimantan Tengah : page 2574–2642
  22. Kalimantan Selatan : page 2650–2727
  23. Kalimantan Timur : page 2735–2777
  24. Kalimantan Utara : page 2784–2801
  25. Sulawesi Utara : page 2806–2893
  26. Sulawesi Tengah : page 2902–3004
  27. Sulawesi Selatan : page 3014–3134
  28. Sulawesi Tenggara : page 3147–3268
  29. Gorontalo : page 3280–3315
  30. Sulawesi Barat : page 3321–3348
  31. Maluku : page 3353–3418
  32. Maluku Utara : page 3426–3486
  33. Papua : page 3494–3789
  34. Papua Barat : page 3810–3924

We have found what we want to get from the document and the pattern of the data. So next step is creating the code to get the data from it.

For the code, I use pyCharm (version 2020.3.5) and the first thing we need to do is create the project and set up the environment for the code.

Create New Project — indonesia_region

Wait for a minute until pyCharm setting up our project and environment. After the project is ready, let’s download the PDF file from here and put it in our project folder.

Download the PDF file

As we know before, the data on the PDF table we want to get is inside a table. To read data in a table on a PDF document, there’s a nice package called tabula-py. Let’s add it to our environment.

Select menu “File > Settings”, the settings window will shows, select “Project: indonesia_region > Python Interpreter”

Python Interpreter

Click on [+] symbol on the bottom left to add a new package, find the package then install it.

Install tabula-py package
tabula-py package success installed!

Ok, the environment is ready, let’s code!

The concept in my mind is to get the data from the PDF document and put it into CSV files for the first step. The second step is data parsing and cleaning and then make it into JSON format.

For the first step, let’s try to use tabula package to get data in the table on page 8.

Let’s run the code. The output we get is like this:

Hmm… it’s quite hard to read, isn’t it? Even if we parse it into more readable JSON, it’s still too hard to know how to read it.😅

Alright, let’s try to convert it into CSV format or file.

Run the code and let’s open CSV file ‘aceh_page_8.csv’ using pyCharm.

Whoa, nice… it’s more neat and readable than the JSON result.😄

Let’s move on to the next step, let’s get all Aceh province data into CSV.

Run the code, and there is the result (for a more neat and beautiful CSV, you can download the CSV file then open it on this site).

I think the code is working fine. The next step is let’s get all of the region data and put it into a CSV file. To make the project files more organized, let’s put the CSV files into “csv” folder.

Run the code and then relax for a while. It will take some time for the code to complete its works.

After the code is finished, this is what our project looks like.

Project File and Code

Alright, we have successfully got the data from the PDF document.

In the second step, we will read the data in the CSV file and parse it to become neat JSON format data.

For the parsing process, I think we need to create another python file, because my hunch says, the code will be a long one. So, let’s create file ’parse.py’

Let’s start the code from how to get data from Aceh Province from file 11-aceh.csv. Let’s take a closer and thoroughly look at the PDF document to understand more the patterns of the data.

Pattern of the data with color

There are patterns we could get from the Aceh Province data on the image above:

  1. For every valid region (province, regency or city, etc), the first column must be a valid region code (such as: 11, 11.01, 11.01.01, 11.01.01.2001, etc) and not a blank column.
  2. A province name is placed on the second column. (Red box, Province: 11 — ACEH)
  3. A regency or city name is placed on the second column with a number in front of it, numbers of its districts on the fifth column, and numbers of its administrative villages or villages on the seventh column. (Orange box, Regency: 11.01 — KAB. ACEH SELATAN, which have 18 districts, 0 administrative villages, and 260 villages) — FYI, it’s a Kabupaten (Regency), not a Kota (City). Yes, they have the same level, but… If the first word of the region name is ‘KAB.’ it’s a Regency (KAB is an abbreviation of KABUPATEN) and if the first word of the region name is ‘KOTA’, it’s a City.
  4. A district name is placed on the fifth column with a number in front of it, numbers of its administrative villages on the sixth column, and numbers of its villages on the seventh column. (Green box, District: 11.01.01 — Bakongan, 0 administrative villages, and 7 villages)
  5. Village names are placed on the seventh column with a number in front of it. (Light blue box, Village: 11.01.01.2001 — Keude Bakongan, 11.01.01.2002 — Ujong Mangki, 11.01.01.2003 — Ujong Padang, 11.01.01.2004 — Gampong Drien)
  6. Villages name without region code is not registered on this district anymore, based on the information on Keterangan / Notes column (Blue box) — the villages administrative data have been moved to a new District (Kec. Kota Bahagia).
  7. Villages name after is a continuation of villages on the district (2nd light blue box, Village: 11.01.01.2015 — Darul Ikhsan, 11.01.01.2016 — Padang Beurahan, 11.01.01.2017 — Gampong Baro)
  8. A 2nd district name is placed on the fifth column with a number in front of it, numbers of its administrative villages on the sixth column, and numbers of its villages on the seventh column. (2nd Green box, District: 11.01.02 — Kluet Utara, 0 administrative villages and 21 villages)
  9. The 3rd Villages name is placed on the seventh column with a number in front of it. (3rd Light blue box, Village: 11.01.02.2001 — Fajar Harapan, 11.01.02.2002 — Krueng Batee, 11.01.02.2003 — Pasi Kuala Asahan, 11.01.02.2004 — Gunung Pulo)

Let’s compare the PDF document (page 8) with the CSV we got (11-aceh.csv).

Hmm… alright, most of the data pattern was similar, the difference is just the number in the region name — in the CSV, the number is on behind the region name (for region name other than the province).

Let’s make a code based on it. Create ‘json_result’ folder, We will save the result as a JSON file and put it into a file in the folder.

Let’s review the code result. Because the result is quite a long one, you can see the full result here.

Let’s see at province result first.

Nothing wrong with it, I think it’s done for the province (for now, done for Aceh province case). Let’s move to Regency and City results.

Hmmm… The region name (kab_name) field needs some ‘cleaning’, we must remove “\n” string.

Wait a minute… why there’s just one result from our code? From what we see on PDF document, there’s 23 regencies or city in Aceh province. The regency we get is just regency on page 8. I think we must find the problem to get regency on the next pages and find a solution for it. Let’s take a look at page 9 CSV first.

Whoa… it’s quite a mess, isn’t it? So… I think we must create a conditional for the pattern on page 8 (let’s just name it pattern-1) and on page 9 (pattern-2).

As you can see at the code we write, we make some new function (def) to parse region data based on the pattern we found and then we make a function to make region data into neat JSON format (province, regency or city, district, and administrative village or village). On line 128, we made a trap if the parsing process found a new pattern or new data structure in pattern-2.

Ok, let’s see the result of our code. You can see it here (it’s quite long, around 2800 linse).

Hmmm… let’s start with this case.

['11.01.17.2001', '', '', '', '', '', '', '', '', '', '1', '', '', '', '']

On the PDF document, it’s on page 18. Let’s see the CSV data for page 18.

It’s a brand new pattern, a little similar to pattern-2, but with a different position of the string “KABUPATEN / KOTA” and the position of region name. Let’s update our code based on this case (let’s name it pattern-3).

We make a trap for pattern-3 on line 142.

Run our code, and there is the result. Still quite long, around 1870 lines. Ok, let’s move to our next case (let’s complete pattern-2 first):

['11.02.08.2001', '', '', '', '', '', '', '1', '', '', '', '']

On the PDF document, it’s on page 27. This is the CSV data for page 27.

The header is similar to pattern-2 header, let’s group it on pattern-2 code. Let’s add parse code for this CSV structure.

Run our code, and there is the result. Alright, just around 130 lines, I think it’s almost done for pattern-2. Our next case is:

['11.75.04.2009', '', '', '', '', '', '', '', '', '9', '', '', '', '', '']

On the PDF document, it’s on page 274. This is the CSV data for page 274.

Again, its header structure same as pattern-2. Let’s add parse code for this CSV structure.

Run our code, and there is the result. Alright, just around 90 lines to go and it’s done for pattern-2. Let’s move to pattern-3 case.

['11.02', '', '', '', '', '', '16', '', '', '-', '', '385', '4.231,43', '225.139', '']

It’s on the bottom of page 8 on the PDF document.

Whoa, whoa, whoa… wait a minute. I think we missed out on this case when we make a solution for page-18. Let’s revise our code for pattern-3.

Alright, the case we must solve is just for around 45 lines. (You can see the result here). Let’s move to the next pattern:

['11.06.23.2006', '', '', '', '', '', '', '', '', '6', '', '', '', '', '']

On the PDF document, it’s on page 97. This is CSV data for page 97.

Hmmm… It’s similar to another structure data on pattern-3, but the village name position moves 1 column forward. Let’s do it!

Run the code… Alright! there’s no message for another parse needed!

You can see the full JSON result here. Let’s compare our data with the summary on the PDF document for Aceh province (page 7). To help me compare the data, I use this website, copy the JSON file content, and then press “Beautify” button. This is the comparison.

1 province, 23 regencies (or cities), 289 districts, and 6497 villages (or administrative villages).

Fiuh… It’s done and with the correct total number and data. This is what our final code looks like.

For another region, I will let you explore it by yourself. I think, the process to parse the data is not that different because the pattern we saw in the PDF document was mostly similar.

In summary, we have explored how to get data from a PDF document, make it into CSV format and files, and parse it into JSON-ready data. Maybe our step after this is creating a code to save it into a database.

Thank you for your time to read. Let’s join us again next time as we explore another interesting case! 😉

--

--