A daunting job

[update] Ok, this work produced a result: a SPARQL endpoint containing my data. It's here http://lod.openpuglia.org/

It's not perfect, there's still work to be done

It still doesn't contain data on 2014 and 2015

I couldn't understand where some sensors were. I managed to forward a request of access to information to Arpa Puglia, about those. We'll see.

I didn't use Grafter, in the end. I wrote .ttl on my own. But Tommaso Di Noia chose to work on the csv file I had provided him with.

In the future I'll have to understand how exactly he shaped the linking and try to produce a complete .ttl file myself.

Also, Tommaso told me that a bunch of smaller files would be preferable to a unique huge file.

Last but not least, Tommaso took part in the Open Data Day 2016 that was held in Taranto and gave a speech about this endpoint and the data it contains.

See you in Gerusalem, people.

I was born here.

Here there's the largest European steel plant (and a refinery and a concrete plant too).

It 's a small town. About 200 thousands inhabitants. They were 300 thousands in the eighties. It's dwindling.

In the trial against the steel plant management, some experts recently stated that statistically the plant killed around 80 people a year.

With spikes among the workers and inhabitants of the neighborhood around the plant.

Cardiovascular diseases, tumors, stuff like that.

There's a governmental agency with the mandate to monitor the environmental situation. Air, water, ground. It's the Arpa Puglia (Puglia is the name of the region. The agency is regional)

In the past years, the agency managed to be somewhat effective against dioxins.

What is current now is PAH. They are out of control.

A few days ago a voluntary association measured 170 ng/m3. It's a lot.

Now, the Arpa agency has measuring stations scattered around the region. At least 3 of them are in this town.

They publish the measurements on their web site. A year later. In july 2015 they published the data concerning 2014.

It's average values in an hour. The machines take tens of measurements in a minute but they publish the average in an hour.

They also publish the average of the day every evening.

I suppose they might publish the hourly average every evening too, but no. They don't.

It's what it is.

Now the daunting task is to take the spreadsheets published by Arpa and make them a rdf file publishable as Linked Open Data (to friends: LOD).

An artificial intelligence professor from the nearest university is helping me. I didn't even know what this semantic web stuff was all about.

Only incidentally I run into logic programming when reading the SICP

Fascinating, by the way. That convinced me to order The Reasoned Schemer.

That and the marvelous “uncorse” kept by the amazing William Byrd. I like to think that is the best heritage that SICP produced.

Back to us, I wrote a schoolbook piece of Clojure code to do such a thing.

I screwed up at least a thing. I opened ALL the spreadsheets and exported them in csv by hand, then committed the files into github.

I even tried to do that with a Python library but that screwed up too.

I investigated the possibility to programmatically pilot LibreOffice from a script to export the sheets in batch mode. I failed.

I even mumbled about the possibility to connect a LibreOffice spreadsheet to a PostgreSql database and pour the data into it. I think it 's possible but I didn 't know where to start from.

So I did it by hand.

Only later I discovered that Excel files are accessible from clojure directly.

Oh well. What 's done is done.

You can only start to imagine how dirty the data were.

Of course the order of the columns was different for every sheet.

The acronyms for the polluting substances were always different, for the same substance.

The same goes for the names of the places in which the stations were placed. In fact there are still some stations that were dismissed or moved in the past for which I couldn 't understand where they used to be. I was considering to issue a formal demand to access to information in order to provide a complete dataset.

Also, the datetimes were in 4 different formats.

Datetimes were the last passage I dealt with in my script.

I found a Clojure library wrapping Joda Time.

I made the script report the row number and the file name when it run into a datetime that it couldn't parse.

Every time it reported anything I added the new regular expression so that the new format could be handled.

Luckily they were just 4 formats.

Next time I should investigate some unmanaged clustering solution.

In the end I produced a 1.1 Gb csv file

Now Tommaso is trying to make it an rdf, reconciliating and establishing links, with OpenRefine.

He is using a server from the university, giving OpenRefine 20 Gb of ram

And even so every operations takes too long.

He broke the file in smaller parts and supposedly he will have to repeat the operation by hand for every piece.

And what when I will want to integrate the stations that I couldn 't immediately localize ?

And what when I will want to integrate the data about the 2014 that were published while I was working ?

Shall he do the thing all over again ?

Fear not ! I discovered Grafter !

If I understand correctly, that should allow me to extend my batch process to produce a complete rdf file (even with links to vocabularies/ontologies ? I hope so)

It even process .xls files, as far as I understand. So I could throw away all those csv garbage I made. If this is true, a huge part of hand made procees can go !

And if it can work with Onyx it could be made scalable to God knows which scale ;-)

Now, this started as a quick and dirty script ;-)

I did all this in perfect loneliness. Will anyone feel like lending a hand ?