Never run away from data in ‘ugly’ PDF files again

Hacks/Hackers Nairobi demonstrate simple tools for journalists and researchers who want to liberate data from locked in PDFs.

Soila Kenya
Hacks/Hackers Africa
5 min readOct 9, 2017

--

Data never sleeps. Approximately 2.5 quintillion bytes of data are produced every day. That’s 25 followed by 29 zeros. It’s roughly equivalent to over half a billion HD movie downloads. And this number is only set to increase.

The challenge for journalists is to sift through all this information for the stories that matter.

To make things harder, it’s still the case that governments, NGOs and other stakeholders tend to present import data in non-machine readable formats, such as PDFs. Data presented in this manner can meet requirements for open governance and the need to be transparent, but it can’t easily be imported into a spreadsheet and so is hard for journalists to verify, analyse and draw conclusions from.

Which is why Hacks/Hackers Nairobi dedicated its latest meetup to “data scraping”, the art of pulling data from different places and making it easy to examine.

da.ta scrap.ing

/ˈdadə,ˈdādə ˈskrāpiNG/

verb

A technique in which a computer program extracts data from human-readable output coming from another program.

October #HHNBO Trainer, Andrew Kamau

Andrew Kamau, an all-round techie and co-founder of Pia, a messaging-first service delivery platform, took the #HHNBO community through a few simple steps (below) to extract tables from a PDF and put them into .xls format (Microsoft Excel).

In this example, the data being scraped was from Kenya’s County Government Budget between 2015 and 2016.

1. Download and install Tabula

Tabula is an application that allows you to mine tabular data from PDFs to spreadsheets. It is available for Windows or Mac operating Systems here.

2. Import your data

Once on the application, use the import tools to load the PDF you want into Tabula. Once it appears, click the ‘Extract’ button.

3. Select your data

Drag to select the data within the PDF that you want to use. After making the selection, click the Preview & ‘Export Extracted Data’ button.

4. Export your table

Once your table has been extracted from the PDF, you can export it in CSV format onto your computer.

C’est Fini!

You can now open your table in Microsoft Excel and organise, sort, filter and analyse it to your heart’s desire. However, be warned that you may have to do some cleaning on the exported data.

Participants of October #HHNBO being taken through the steps

Bonus tool…

As an additional skill, Andrew also demonstrated how to use Import.io, a web-based platform for extracting data from websites without writing any code.

#HHNBO extracted data from Wikipedia’s List of countries by tea consumption per capita in a few simple steps.

1. Sign up on Import.io

You can use your Google or Facebook account to set up your account here.

2. Input your website URL

Copy and paste the URL of the website you want to extract data from into Import.io then click ‘Go’.

3. Select the data you want to extract

On the pop-up tab, clear all columns, then re-select only the ones that are relevant to you. In this case, only the ‘Country’ and ‘Tea consumption’ tabs by clicking on the first cell of your data, automatically, the rest of the data in the column will be selected. Click the ‘Extract data from website’ button.

4. Choose your preferences then extract

Import.io will give you options on whether you want your extracted data to be updated if the data on the URL you are using is updated as well. After selecting your preferences, click the ‘Save and run’ button.

C’est Fini!

You should now be able to access your data in form of a Microsoft Excel sheet on your computer.

Congratulations! You’re now a data scraping BOSS!

Well, not yet. All these tools are only useful if journalists use them to improve their reporting. An example of a project done using these nifty tools is Biscuit Index by Code for Kenya which helps Kenyans know how much money is spent by local government on “tea and biscuits” (ie. hospitality) in each county. It also gives the option to compare it not just to similar items that the same amount would buy, but also to other ways the money could be better used.

Interested in learning more about data journalism? Sign up for StoryLab Academy here!

The worlds of hackers and journalists are coming together, as reporting goes digital and internet companies become media empires.

Journalists call themselves “hacks”, someone who can churn out words in any situation. Hackers use the digital equivalent of duct tape to whip out code.

Hacker-journalists try and bridge the two worlds. Hacks/Hackers Africa aims to bring all these people together — those who are working to help people make sense of our world. It’s for hackers exploring technologies to filter and visualise information, and for journalists who use technology to find and tell stories. In the age of information overload and collapse of traditional business models for legacy media, their work has become even more crucial.

Code for Africa, the continent’s largest #OpenData and civic technology initiative, recognises this and is spearheading the establishment of a network of Hacks/Hackers chapters across Africa to help bring together pioneers for collaborative projects and new ventures.

Follow Hacks/Hackers Africa on Twitter and Facebook and join the Hacks/Hackers community group today.

--

--