From data sets to linked data sets in open government data
Helping the machines to understand what the data means, so that they can help us search and cross the data published by our governments.
Summary (tl;dr;): Publishing open government data dissipates the mist around the business of the State and its tentacles, and enriches the dialogue between the administration and the citizens. However, today, this data is usually not described with machine-readable semantics, which makes it hard to perform large scale search and create value by crossing data sets together.
The Semantic Web technologies come to the rescue: they enable the creation of worldwide identifiers for concepts (the URIs), definitions with machine-readable semantics and data storage and querying as a multidimensional graph. The result is a standard and semantics-driven open data platform.
Open government data 101
Summary: Publishing open government data dissipates the mist around the administration and enables a healthy dialogue between the giant state octopus and the small fishes. This is called transparency.
This article is written with the digital ink of my passion for open data. It has grown exponentially since the moment I was convinced that it was a fantastic tool to bridge the gap between citizens and their leaders. I think the growing lack of trust comes from the feeling that, although most of our leaders have been directly or indirectly nominated with our votes, many decisions are taken behind our back. And if they are taken behind our back, it must be because they do things they don’t want us to know, right?
This lack of trust led 1) citizens to progressively take distance from the politic life, 2) and the politics to assume the great majority of the electors doesn’t care much about what’s going on at the different decision levels of the State (local, regional, national). You end up with a dialogue that is greatly flawed due to the lack of information shared by all parties, and tensions rise.
Increasing the quantity and the quality of the information describing what is happening in the country and in the administration is the foundation of a healthy dialogue. The United States, the United Kingdom and France are among the countries that lead the pack.
To illustrate the problem and its solution, let’s use fresh events that take place in France. On September 20th 2014, the French Ministry of Budget published the distribution of the budget allocated to MPs (en) to subsidize municipalities and associations in their district for 2013. Until then, only the global amount and the amount per region was published. Now, not only do we know how much each MP granted, but we also know the names of the municipalities and associations who benefited from the grants.
The press and associations of citizens analysed (here (en) or here (en)) this information and soon, potential conflicts of interest were highlighted. I assume the MPs will be more careful as the citizens can now watch where the public money goes. On the other hand, wise subsidies are likely to boost the popularity of the MPs who grant them. Or this grant system might simply be deleted as it can be considered as a violation of the separation of powers, as the PMs (legislature) technically replaces the municipality, the department and the region (executive) when they fund local associations.
In any case, the publication of the data set the foundations of a dialogue based on indisputable facts.
Open government data is like publishing the details of the subsidies granted by the MPs, but at the scale of a country, region or city: making the elected representatives accountable for their actions and giving the citizens the information to base their judgement, and potentially their vote, on facts.
The value of the data
Summary: The administrations are sometimes reluctant to spend resources in open data publications as the return on investment is not obvious. In spite of the lack of maturity of the commercial usage of open data, the transparency, thus the public debate, have quickly benefited from these publications.
I think the core value of the data is created when it’s manipulated in a way that unveils new insights that support decisions. And the more you cross data sets, the better are the chances to do unprecedented findings.
With the boom of open data, discussions arose around the following question: what is the value of open data? This question is relevant at a time when ministries and other administrations are asked to publish quality data for free. Transparency is noble, but it doesn’t pay the salary of a data specialist who will ensure the published data is consistent and free of confidential or personal information.
Time will tell, but I believe that when manipulating open data, value is created when the analysis unveils brand new insights that lead to better decisions. Although the analysis of a single data set can be valuable, the most valuable data analyses and visualizations are those that cross various data sets together. This is because, by nature, a data set has a limited scope.
The scope is the number and the depth of the facets under which a certain thing is described. For instance, we could describe an MP using the following facets:
- personal: age, gender, relatives, contact details
- activity: votes to the proposed laws, questions to the Ministers, participation to the debates, study groups, subsidies
- political: activity in their political party, previous memberships
- career: elections, appointments
- financial: expenses and revenue related to their position
- media: appearances in the media (press, TV, Web, books, documentaries, etc.)
- legal: involvement in legal cases
- …
Let’s go back to the grants granted by the French MPs in 2013. The data was published by the Ministry of Budget as a Web site with basic filter functions (en). The data set is a table that lists all the grants. For each grant, the following information is included:
- Name of the municipality or association who received the grant
- Name of the MP
- Political party of the MP
- District of the MP (thus the district where the beneficiary is located)
- Reference of the budget program
- Amount of the grant
- Object of the grant (e.g. purchasing X, fixing Y, running costs, etc.)
Publishing a data set with a specific scope is good, because it makes its purpose easy to identify. However, crossing this data with more data about the MP enables a deeper analysis and above all, has great chances to unveil insights that had never been discovered in the past as the two data sets were managed and used in isolation.
The association Regards citoyens (in English “Citizen looks”) published an enriched version (en) of the data set published by the Ministry of Budget. Besides the facets that were included in the original data set, they have added the following ones:
- Gender
- Link to the official Web page of the MP (example, en)
- Link to the Web page that aggregates information about the MP on the Web site of Regards citoyens (example, en)
- Link to the data previously collected by the association about the MP in XML format (example)
By including links to the content they have previously collected, the association greatly enlarges the scope of the original data set. Here is a sample:
- membership in study groups
- occupation
- contact details
This aggregation adds value because:
- it enables the crossing of the data about grants with the data about the commissions the MP belongs to, extending the information about the commitments of the MP
- it bridges the gap between different pieces of information that compose the whole public information about MPs, saving time for future research.
The added value is not easy to measure in financial terms, especially so soon after the publication of the data. Moreover, the consumption of data published by the government is not common practice yet.
Published by humans for humans
Summary: If data crossing favours the discovery of new insights, it should be made easier: nowadays, locating government data sets that can be crossed is cumbersome, in spite of the deployment of faceted search engines. The problem is that the current metadata about the data sets is too superficial.
The initiative of enriching the data published by the government is noble, but this aggregation of data sets could be made easier.
Technically, enriching a data set means adding extra columns of data from data set 1 that describe a subject present in data set 2. In database terms, this subject, it’s a foreign key, an identifier that is available in two different data sets and that refers to the same thing.
When a new data set is published on most open data portals, some metadata is also published, such as the date of publication, the administration that publishes the data, or its domain (health, elections, etc.). This metadata enables the creation of faceted search engines (data.gouv.fr, data.gouv.uk) that help performing accurate search among the numerous data sets.
However, no metadata indicates what columns the data set contains.
Back to the French MPs: a possible way to enrich the original data set would be to add information about the associations that received grants. We would use the association name as a foreign key to link it to a data set containing data such as their yearly budget or their postal address. To locate this data on data.gouv.fr, I tried the following:
- type “associations” in the search engine
- since we are looking for tabular data, select CSV
- select “France” as the geographic scope
- explore the numerous data sets that are returned by the search
- do the same, but with the XLS format
It’s doable, but very much time-consuming. Let alone asking someone with less experience in search to find the piece of data they’re looking for. With the increasing rate and volume of data published, the current metadata will show its core limitation:
The data sets are not linked together.
We have data sets that describe the same things, but that don’t “know” about each other. There is a solution to create metadata that actually connects the data set together: describing their content with machine-readable semantics.
Machine-readable semantics
Summary: To improve the search of data sets, the Semantic Web technologies bring a graph data model, a worldwide identification system and semantics. In this section I show how it could boost data set crossing with a practical example.
Machine-readable: something that a machine can interpret in order to perform the relevant actions.
Semantics: meaning, the act of defining what something is.
What we need to boost our open data search is to associate meaning to the data contained in the data sets. Meaning appears when concepts are connected together with meaningful relationships. Great: the W3C has a standard to express semantics: the Resource Description Framework.
To create and use meaning, RDF works in 3 steps:
- Assign unique worldwide identifiers to all meaningful things: URIs (e.g. http://dbpedia.org/resource/Mexico)
- Connect the things together with properties (e.g. http://purl.org/dc/terms/creator)
- Query the resulting graph with SPARQL
For instance, the data set about MPs could be semantically tagged with the RDF graph below:
A version of this graph in Turtle notation can be downloaded here.
This graph means that:
- The data set is entitled “Réserve parlementaire 2013 de l’Assemblée nationale” in French (the @fr)
- The data set has at least three columns: one that contains persons, one that contains departments and one that contains formal organizations (I have left the labels off the graph to save space)
- The data set was published by something called “Minister of the Economy, Finances and Industry (France)” in English (the @en)
If the data sets stored on data.gouv.fr were all tagged semantically and I wanted to add more information about the associations listed in the MPs grant data set, I would look for all the data sets that have a column that lists formal organizations.
The search interface would translate my search request in a SPARQL query that would look like this:
The search results would tell me that besides the data set I already knew about, two more data sets have a column listing formal organizations, and are consequently good candidates for data enrichment.
You might think:
“So what? That can be done in xSQL/NoSQL/XQuery!”
Yes, but unless you also use URIs to identify things, the scope of your identifiers will be no bigger than your database. With RDF, the scope is the World Wide Web. With a bit of tweaking, if other open data portals use the same vocabularies, I could run this query on their repositories in a single shot. xSQL and NoSQL perform well, but they lack a standardization and are limited to a tabular (bi-dimensional) data model.
When high-availability is required, combining an RDF/SPARQL stack for expressiveness with a NoSQL stack for performance brings the benefits of the two worlds.
Standard vocabularies already exist in RDF to describe data sets:
- The Data Catalog Vocabulary (DCAT)
- Dublin core
- The Vocabulary of Interlinked Datasets (VoID) (relations with other data sets)
- The PROV Ontology (PROV-O) (provenance of the data: creation process, data source, involved agents, etc.)
A national data dictionary
If we want to tag the columns, we need to build a reference dictionary of the things that are described in the data at the scale of the country. The purpose is similar to the creation of a controlled vocabulary.
1. Extracting the column headers
As we need to start somewhere, we might as well take care of the data that we already have. We consequently extract the column headers of all the tabular data formats that we can parse. For each column header, we add the title of data set, the URL of the data set, the name of the publisher, and sample data. Example for with the second CSV file published here:
- header: Parlementaire attributaire
- dataset_title: Réserve parlementaire 2013 publiée par Bercy
- dataset_url: https://www.data.gouv.fr/fr/datasets/reserve-parlementaire-2013-publiee-par-bercy/
- dataset_publisher: https://www.data.gouv.fr/fr/users/regards-citoyens/
- sample_data_1: Bernard Accoyer
- sample_data_2: Gilles Carrez
- sample_data_3: Gestion collective des sénateur SOC de la commission des finances
2. Defining what sort of data the column contains
The column above is interesting because it has a common characteristic: it lists things that are not strictly homogeneous in their nature. Here, in the column “Parlementaire attributaire” (en: granting MP), we find both MPs and collective funds where MPs can deposit a part of their budget.
The column consequently contains things that are part of the French Parliament and have the power to issue grants. These things can either be MPs or collective funds to which MPs contribute.
To tag the column semantically, we would link it to two types of things, “MP” and “Parliamentary collective fund”. This would result in the following RDF graph:
The types (= classes) of things are in purple, the things in yellow. The column object, in the centre, contains two types of things: MP and ParliamentaryCollectiveFund . To give them more semantics, I have respectively declared them sub-classes of the popular classes Person and Organization from the FOAF vocabulary.
I have added that the instances of the classes MP and ParliamentaryCollectiveFund are related to each entry of the column with the property name. Otherwise, looking at the RDF, we wouldn’t know what the columns say about their subject: their birth date? their Web site? their role?
Now that we have created types of things, we could write the semantics of the first column entry as RDF in Turtle notation, even if it goes beyond what we are trying to achieve:
<#columnAline1> rdf:type ex:MP ;
rdf:type foaf:Person ;
foaf:name “Bernard Accoyer” .
More important, with the right search interface, visitors could find this data set if they are looking for data set that mention the name of MPs or parliamentary collective funds. It would also work for person and organization, but it would probably return too many results to be useful.
3. Creating a dictionary
So far, we have created objects that represent the columns and linked them to their data set and one or more types and properties. However, we miss the fuel required for actual semantics: the definitions of the types that we have created. Hopefully, for a decent part of them, a definition will be found either:
- in a glossary maintained by the publisher of the data (ideal)
- in a dictionary (careful with copyrights)
- in an existing ontology (possible need to translate it)
- in Wikipedia
For the types that miss a definition (usually the most specific ones), the best approach is to ask the publisher of the data to provide it.
4. Tagging the new data sets and maintaining the dictionary
When that’s done, you can enable a semi-automatic recognition of the columns for the newly published tabular data sets:
- the column header is searched against the column labels already tagged
2. if there is a match, the definition of the types that are linked to the column are suggested to the publisher, and they pick the one that matches (“What type of thing does the column describe?”). If none matches, either they select “I don’t know”, or they type a definition.
3. if the publisher picked a definition, they are also prompted to pick the characteristic of the type of thing that matches the content of the column (“What characteristic of the thing does the column contain?”).
4. They repeat for each column, and publish the data set.
The procedure can look a little intimidating. The publisher should be informed of the benefits of this tagging for the reuse of their data set.
The columns that are not bound to a type upon publication are taken care of in a dedicated process.
The nirvana of data crossing
Why would we need to search for a data set that could be crossed? Why not having a side panel on each data set page listing the best candidate data sets for enrichment?
To the infinity and beyond
In this article I only highlight the benefits the Semantic Web technologies to stimulate data crossing and create more insightful data sets. The next step would be to initiate the local open data initiatives to these principle and create a national data set directory.
Then, the data itself would be converted to RDF in order to enable querying across multiple sources and type of data. We could answer questions such as:
- Which ministers have never been elected and were born in a department that exported more wine than meat in 2012?
- What was the abstention rate for the last presidential elections of the cities that had an unemployment rate greater than 10%?
- etc…
Please add notes here or react on Twitter or email me, I’d love to have your feedback!
Colin.