Building useful open data
… or at least what we think about it.
“Data is the new oil”
“A coherent data infrastructure should be a baseline condition for a healthy, progressive society, and a competitive global economy.”, ODI says.
It is certain that a lot of value resides in the gigantic amount of data that is generated on a daily basis today. Many companies are already making the most of it, but the use of data to build better public infrastructures is still at its beginning.
The need of efficient open data
At Ants we believe in the need of open data. We believe a good use of it will benefit most. So we work as much as possible towards making open data a widespread reality.
To do so, it is necessary to work with public administrations. But most public administrations are not used to work with data in general, and the state of open data today is not efficient enough.
Here are some experiences and advices we want to share to help building useful open data.
This was a simple side project we made mainly to make a point: cross-platform applications can be easily developed if you use web technologies.
The web app (available for Bordeaux and Paris) is a simple map displaying the position of public toilets, with additional information such as the toilet type (useful when you don’t want to use it standing on your feet), the distance, time and itinerary from your position.
We used the JSON file from the city of Bordeaux open data website. Its data entries look like that:
“adresse”: “Place Dormoy”,
“quartier”: “Bordeaux Sud”,
“geometrie”: “POINT (-0.564321470670066 44.8267129011554)”,
We didn’t have real issue with this data, which is simple and efficient. But nonetheless, this data contains many non relevant informations, that probably come from a bare export of the original dataset. This rises 2 issues:
Some fields —PartitionKey, RowKey, entityid — are not present in the corresponding CSV file, which is supposed to represent the same data.
Why is data incoherent between the formats ?
Here, the core information, the one that’s important, is the same between the formats, no harm is done. But in a more general context, such differences between formats question the trustworthiness of the data. Which file should be trusted ?
A data infrastructure, open or not, should absolutely be trustworthy and transparent as much as possible. It can be done for instance by being explicit, by documenting the processes by which the data was generated.
Like relevant data, the non-relevant data has a weight. The 3 fields — PartitionKey, RowKey, entityid — i was talking about earlier are an example of non-relevant data, but if you look closer, the coordinates have a precision up to 15 decimals, which is about 0.1 nm, more or less the size of an atom… Such accuracy is clearly irrelevant, all the more when the same data is represented twice.
For a building you would only need 7 or 8 decimals (about 1 mm precision) to represent accurately its position.
Why spare as much data weight for no reason ?
The latitude 44.8267129011554 would shrink to 44.8267129 and the longitude -0.564321470670066 to -0.56432147, and no relevant information would be lost.
In this example, a hundred of data points, a few bytes lost for nothing… naaaah. Who cares ?
But scaling to much wider datasets, with maybe several millions of data points, would make these few bytes a few Mega bytes or Giga bytes, leading to really annoying file processing issues. For basically nothing.
The data should be kept to its minimum without any loss of information.
During the summer 2014 we had some fun playing with the brand new open dataset of 3D geometries of all buildings around the Bordeaux area. The dataset is available on Bordeaux Metropole open data website, under Modélisation Agglo 3D.
This experimental project is now called City (works better on Chrome), because technically capable of displaying any city in a web navigator, providing you have the 3D models.
We had some issues working with the original Bordeaux data, mainly related to the accessibility and quality of the data.
The Bordeaux region is urban, which means many building models. A single file containing all the building 3D models for the entire area would be way too large to be shared, especially if people don’t necessarily need the complete dataset. Think of an architect who only needs to model a single neighborhood.
The correct way to go, as Bordeaux Metropole did, is to split the data into areas, that people can access according to their needs: single area or multiple areas at once, all areas of a particular city...
For Bordeaux Metropole, the data is provided in about 500 areas. If you try to select all areas at once, well, you can’t. Because the download service is limited to 500 Mb, and the whole lot is about 1.8 Gb. So you would need to process multiple times to actually get all the data you want. And i’m not even talking about the textured data :/ (about 25 Gb).
Why is the access to the data arbitrarily limited ?
People should be able to access easily all the data they need.
Without this limitation, this downloading service is okay enough. But regardless, this would be so much easier to query a simple API. Less restrictive data limitations, programatic requests if we please, and the same download service (based on this very same API) if we don’t.
Okay, so now we have downloaded some area data.
Each area is a rectangle of 7 by 5 tiles, each tile being a 200m x 200m square.
To keep track of the position of each piece in this jigsaw puzzle, all tiles have global coordinates (X and Y) and local coordinates (x and y). It is essential that these 2 coordinates systems are coherent to be able to reconstruct all geometries without getting lost.
Well, here, they’re not. The Y and y axis have opposite directions, which leads to buildings oddly placed if, like most people, you didn’t expect it. The simplicity of the patch is not the point here: every user who downloads the dataset will have to figure out this problem and fix it. That’s too much time wasted not working on the interesting data.
But more problematic is to come.
Once the buildings were correctly placed, we found out that large ugly gaps were opened between the tiles, in all 3 directions. With no apparent reason, but of course, there was one.
A tile is 200m x 200m. But actual buildings don’t ever follow that rule. Which means a lot of 3D models would be cut into pieces in order to fit inside the tile. This is not relevant, and Bordeaux Metropole reasonably chose to keep the integrity of all 3D models, even if some points were geometrically outside the tile.
This means that the tiles have theorical square boundaries, and actual boundaries. The problem is that the center of each tile local coordinate system is the center of the actual boundaries, instead of being the center of the theorical boundaries. This means that if a tile has a building which is way outside of the square, (a bridge for instance), well, the corresponding tile will be way off its true position.
We had to make some computations to correct the positions of all buildings in the horizontal plane, and to ask for additional information that we didn’t have to be able to correct the vertical position…
And finally we enjoyed our 3D Bordeaux. But after a long, unexpected and unnecessary preprocessing.
Why would a dataset not be usable as such ?
No preprocessing should be necessary to use the data. It should be directly usable as such.
In case of missing data or errors in your original dataset, you should publish a fixed dataset along with some documentation about the fix, not some weird external patch nobody understands.
For some projects we needed to use data from the French National Institute of Statistics and Economic Studies, the INSEE.
The INSEE recently published a handful of open data with regards to IRIS, which are small geographical entities you can study without violating privacy.
While all the core data we needed was present, accessible, and of good quality, actually using this data was not so easy. This is why we created Open-Moulinette, to digest all the open data from INSEE so that it can be easily handled by most.
Some INSEE data includes geographical coordinates to locate points. These coordinates are expressed using Lambert projection, because it is adapted to represent accurately points of the France area on a plane. But what if you need to use a third-party program that only accepts [latitude, longitude] ? Or for some reason you want to apply another projection ?
Why some data is only usable in a certain context ?
Map projections are really useful because the world is in 3D, but paper and screens are in 2D. But we don’t believe projections should be used in open datasets. Instead, latitudes and longitudes should be used, because they are universal. And eventually, some hints (or the projection algorithm even) on how to project these coordinates accurately should be provided in case one needs to project these points.
In general, all data should be published under a format as universal as possible. Context-specific alterations should be kept aside, or provided apart with documentation.
We needed to work with the IRIS geographical shapes. The datasets can be found here. This is what the core dataset structure looks like once you’ve unzipped it :
There are a lot of folders, subfolders, subsubfolders with strange names, and a lot of files that are probably not needed as such.
All these folders and files are directly issued from INSEE own work processes, and therefore might have a good reason to exist.
While this is ok, this dataset infrastructure shouldn’t be released as it stands for open data. People are not interested in relics of INSEE work processes. People are interested in pure data, that one can easily work with.
Files with extensions such as .xls, .dbf, .prj, .shp, … are not easy to work with, because you need to set up routines or use third-party softwares to access the data you are really interested in.
Why is the data structure complicated to work with ?
The first thing we did before we actually worked on this dataset was to parse all these files into a format we found more useful in this situation : a single JSON containing all the structured data. CSV formats are interesting too, depending on the language you’re using.
This is the purpose of Open-Moulinette. Some routines whose role is to make open datasets easy to read and work with. Also, just as a bonus, we added a tutorial to build a dashboard useful to explore the INSEE datasets. Because it is so much easier to understand data when you are able to graphically see data.
We believe that data consumers shouldn’t have to write this kind of project. Data producers should not issue their data as it stands, but have their own Open-Moulinette-like routines to run before issuing the data.
Our tips to generate efficient open data
At Ants, we want to help public services to build better open data. Our experiences have shown there are ways to improve open data’s efficiency. The main idea is that data producers should build their datasets so that data consumers can focus their work on pure data analysis.
Here are some tips to data producers that we think could benefit everybody.
- Keep the different formats coherent
- Make your data and processes used to generate it transparent
- Be explicit, document as much as possible
- Keep the core information to a minimum, without losing information.
- Provide easy access to all people
- Don’t apply restrictions on quantity
- Provide filters to fit people’s needs
- Use APIs as much as possible
- Provide complete and verified datasets
- Update your dataset when discrepancies are found
- Document the updates and corrections made
- Listen to users comments about your data
- Use universal standards and units as much as possible
- Publish well-organized datasets understandable by all
- Make your data structure easy to work with, don’t issue your data as it stands
- Use JSON, CSV files or provide open source tools to generate such files
- Don’t assume people will use the same tools as yours