Data standards: What are they and why do they matter?

Krzysztof Madejski
TransparenCEE network
8 min readApr 14, 2016
Comic available at xkcd.com/1179/ under the Creative Commons Attribution-NonCommercial 2.5 License.

Data helps us better understand the reality we function in, informs fact-based policies and, when well analyzed, it allows us to see patterns, irregularities and intersections we would never think of. — Anna Kuliberda

That’s why most of the tech for transparency projects begin with data: we want to be better informed — so first of all we need the data. You’ve probably heard someone say “we will open these datasets” many times. This phrase hides all the complexity of a rather tricky process:

  1. data gathering (from published resources or FOI law demands)
  2. data extraction (ie. extracting text from PDFs or scanned documents)
  3. cleaning (ie. Y, y, yes, “+” all stand for an affirmative answer)
  4. structuring (in the simplest scenario — sorting data into columns ).

In the end, we get open data that can be analyzed further to reach some conclusions or published for others to work on. And all of that is great! Well… that is until you want to cooperate with others working on similar datasets.

At TransprenCEE we’re building a common knowledge base, which we hope you’ll find useful and expand it with your experiences.

It’s important for civil society…

Let’s say that working on public procurement projects you’ve created clear visualizations and somebody else has created a tool to flag suspicious tenders. You want to import your data to the others’ solutions and vice-versa. However, if your datasets were opened independently they probably have a different format and collaboration is not that simple. Your solutions don’t “speak” the same language. That problem grows bigger in international collaborations where social and legal contexts differ.

Data standards are the answer to this problem. By supporting and using them, you ensure that your work will be easier to reuse by others. It’s like creating a piece which fits into bigger puzzle. It may slightly raise costs during data opening, but it significantly brings down the costs of integrating with other solutions that ‘speak’ the same standard. In many cases, you can instantly use other solutions on your dataset.

In our series of analyses about data standards we will recommend the best ones for a given topic. Imagine that you want to analyse procurement data in your country or municipality? We say: “Well, if you open data in this standard you will then be able to easily deploy these open source tools to analyse it.”

It’s important for donors…

If you are a donor representative then specifying standards in grants given will bring more value for your money and allow for better integration with other initiatives in the field.

It’s important for IT experts too!

And if you’re not an expert in the field (like public procurements), but an IT development leader, our research will list existing tools as reference, highlight challenges in data opening, introduce language link between model and real world for several of existing deployments, provide contact with people who have previously worked on a given standard and tools, and, of course, provide you with links to specifications.

What is a data standard?

Think of a Master’s thesis that you have to write at the university. It consists of a title, abstract, the thesis itself and a bibliography. Oh, and it’s written as part of your curriculum and verified by other people. So you should specify the university, the faculty, the supervisor and the reviewer. As well as some audit logs: date of creation, last modification date, date of acceptance by reviewer (and his/her opinion), date of acceptance by supervisor. Probably also a bit of translation for the global community: a title and an abstract in English if the thesis is in some other language.

Now let’s say that you want to create a tool to browse theses on any given subject. You need to gather a substantial number of theses and feed them into a computer. For that to happen you need to transform each thesis into a single data record.

Planning how this data record will look like is called modelling. We model real life examples into data records. Modelling can drop some unimportant (arbitrarily) details, like did you publish your thesis in paperback or hardcover, or the color of the hair of your supervisor (unless you want to analyze this factor). Apart from that, it’s mostly about specifying requirements by making decisions like “should supervisor, reviewer, author be specified by providing given name, family name, academic title, university represented?”, “would the title and abstract be obligatory fields?” or “should dates include just year, month, day, or maybe the an hour is necessary as well.” Stakeholder collaboration is essential in this phase as different contexts need to be grasped. In one country, an independent review may not be necessary, while in another you may need two reviewers.

From modelling to representation and interoperability

Creating data standards is all about interoperability: the ability to exchange standardized data between systems owned by different subjects. For that to happen one more step is required: representation — making the decision which file formats to use, how to format dates (look at the last picture again ), how to store images, etc. In the end, you can land with the same information represented (or “serialized”) in possibly different file formats. The resulting files carry the same information, and a preference for one or the other is mostly a matter of preference, if you have the resources all of them can be used in parallel.

Here are a few examples of the same content represented in some popular formats:

JSON (preferred by scripted solutions)

{
“author”: {“given_name”: “Krzysztof”, “family_name”: “Madejski”},
“title”: “Data standards: What are they and why they matter”,
“date_of_final_accept”: “2016–01–29”
}

CSV (anyone is able to view it in spreadsheet, but embedding objects (ie. author in the thesis) is not possible)

author_given_name, author_family_name, title, date_of_final_accept
Krzysztof, Madejski, Data standards: What are they and why they matter, 2016–01–29

XML (preferred by bigger institutions)

<thesis>
<author>
<given_name>Krzysztof</given_name>
<family_name>Madejski</family_name>
</author>
<title>Data standards: What are they and why they matter</title>
<date_of_final_accept>2016–01–29</date_of_final_accept>
</thesis>

Or any other format for which so-called “serialization” is defined. And these files can then be processed by computers.

Standardizing a Standard

Now, what if I create and announce such a standard as “Madejski Thesis Standard 1.0”? Well… most likely no one would care.

The power of the standard comes from the power of all the stakeholders using it. If it’s not really common then it isn’t really a standard.

The comic is published on xkcd.com/927/ under the Creative Commons Attribution-NonCommercial 2.5 License.

There is also one key element to standards: their openness. However, there is no single standard for what constitutes an open standard:

There are a number of definitions of open standards which emphasize different aspects of openness, including the openness of the resulting specification (is it published online? do you have to pay to get it?), the openness of the drafting process (who can propose changes? who decides?), and the ownership of rights to the standard. link

Coming from the internet community, we suggest using World Wide Web Consortium’s definition that stresses open process of standards creation, transparency, relevance and royalty-free usage (you don’t have to pay to use it):

[…] we define the following set of requirements that a provider of technical specification must follow to qualify for the adjective Open Standard:

  • transparency (due process is public, and all technical discussions, meeting minutes, are archived and referenceable in decision making)
  • relevance (new standardization is started upon due analysis of the market needs, including requirements phase, e.g. accessibility, multi-linguism)
  • openness (anybody can participate, and everybody does: industry, individual, public, government bodies, academia, on a worldwide scale)
  • impartiality and consensus (guaranteed fairness by the process and the neutral hosting of the W3C organization, with equal weight for each participant)
  • availability (free access to the standard text, both during development and at final stage, translations, and clear IPR rules for implementation, allowing open source development in the case of Internet/Web technologies)
  • maintenance (ongoing process for testing, errata, revision, permanent access)

Investing in standards by civil society (mainly by using them, but also participating in their development) should go in parallel with ensuring that the community has a voice on these standards. Please go through the definition above as a checklist before using any standard. When working on data that is not yet standardized, we propose that you involve other international stakeholders and create a W3C community group devoted to working on data standardization in a given field.

How to serve the data? The “API” keyword

Data opening is a quite costly process. When we’re doing it for a social cause, let’s be sure that apart from creating a tool operating on data, we also publish the data itself. How do we do it?

One option is the data export. You export all the data and publish it online as a file. Anyone can access it just by downloading it. When data changes you should set-up automatic periodical data exports (each month or day, depending on the data source).

That option works quite well when data is small in size and doesn’t change very often.

The second common option is to serve data through so-called APIs. API is a piece software with detailed specification which acts as a socket through which the data can be pulled by other computer programs. Think of it as a power socket. For example, if your hair dryer has a matching plug, you just plug it in and the electricity flows to your hairdryer, much like data flows to your data-propelled computer program or website.

API is a slightly more complicated option than data export, though if you’ve build your website on any modern web framework, you probably have an API with sufficient basic functionality built-in. When you have a vast amount of data that changes quite frequently, it’s more efficient to publish an API rather than do data exports.

P.S. APIs should be standardized as well as the data they serves. Think of the power adapters you need to take with you while going to UK from Continental Europe. With software these adapters cost even more than the plastic ones at Heathrow!

The UK power plug adapter by Adafruit Industries published under the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license.

What aspect of standards do we analyse?

As a part of the TransparenCEE project, we will analyse and recommend data standards to be used in the tech for transparency field. For each we will mention:

  • open source tools that work with these standards, so you know what you can deploy to process your data, or what other projects can make use of your data;
  • coverage of the standard: who, where and how uses the standard. The bigger coverage the more established the standard;
  • contacts to people responsible for existing deployments so you can consult with them;
  • challenges in data modelling in existing projects (ie. is this parliamentary body closer to a committee, a commission or a board? how was it modeled in countries with a similar parliamentary system);
  • finally, what kind of data is covered (data types, data classes) and links to specifications.

Originally published at transparencee.org.

--

--

Krzysztof Madejski
TransparenCEE network

Postgrowth, civic engagement, transparency, tech. Working at @epforgpl as @codeforall Coordinator; @Transparen_CEE. Po polsku na https://bit.ly/2EGWxSF