Your Open Data is not open enough


Many governments have made concessions in opening access to their data under the public pressure. Such data stores contain data of different governmental branches like procurement docs, statistics, registers, etc. This article is not about government transparency or Open Data censorship. It is about quality and usefulness of public data for using in real applications and my personal vision of how an Open Data future would look like.

The most comprehensive and mature Open Data concept interpretation is Open Knowledge’s definition, that determines such properties as open licenses or format. But does it mean that data is quality and suitable to rely on this data in your real-time applications and services?

Definitely not.

To evaluate the current situation, I compared two Open Data portals: Ukrainian Open Data and Berlin’s. These examples allow to define Open Data development vector and identify final look of really “useful” Open Data.

In 2015 Ukraine rapidly began to make up for missed years in governmental informatization, online presence and transparency. As a result, a DKAN-based Open Data portal appeared. The portal is very illustrative for young Open Data initiatives and has the following flaws:

  1. Unstructured data
    Most of the data is distributed in such formats as doc, pdf and xls. These formats are difficult to operate with, and there is a high probability that the next file from the same provider of the same purpose will have entirely different structure. Moreover, Adobe Flash SWF, FLV files, that are almost unsuitable for processing are allowed by law too.
  2. Out-of-date data
    Every data that has not been collected from the system at the time of the request is potentially irrelevant. The vast amount of data on the site was uploaded manually. This approach will never ensure data relevance. Every moment you can’t have confidence that the data is up-to-date.
  3. Data redundancy
    Ukrainian Open Data portal is a sort of governmental “Dropbox” for the Open Data law implementation. Almost each file there has at least 2 copies: on storage server and in place it has been generated (e.g. an internal application’s database) or it is used.
  4. Lack of data scheme
    Even structured data types require their structure definition. It is not enough to generate the XML file. The integrity of such data can not be verified and it’s hard to react to their changes.

It’s challenging to use Ukrainian Open Data for data-mining since it is heavily unstructured. The lack of data relevance insurance makes them completely unsuitable for using in real-time applications.

Germany, Berlin
CKAN-based Berlin’s Open Data platform appeared in 2011. As of this writing, the vast majority of data has been presented as data interfaces (REST endpoints), that provide structured JSON, XML data in documented message format.

Berlin portal is much more mature and devoid of 1–3 flaws of Ukrainian Open Data portal. However, even though the response message format has constant and documented structure, response body can have arbitrary content and there‘s no way to validate this data.

Nevertheless, it‘s a good example of how Open Data must be handled and delivered. Your applications may rely on this data and have a confidence that data is reliable, structured and up-to-date.

Open data is a great step toward ensuring openness and transparency in government. But it is useless if the civic hacker can’t work with them. Do not concentrate only on making the data open, but also on the way you deliver it.

Back to the Open Data development vector, its end point is clearly visible and I call it the concept of Open Data Streams:

Open Data Streams properties

Open Data Streams is a decentralized and machine-readable Open Data, characterized by the following properties:

  1. Decentralized data
    It is necessary to move away from the concept of “provider gives Open Data” to the concept of “provider is open”. Each provider must open their application programming interfaces for their data information systems and it’s their responsibility to ensure data integrity and relevance. Consequently, there’s no need for the central data warehouse, instead, it becomes a catalog of information sources.
  2. RESTful
    It is fast, efficient and straightforward. REST has been created exactly for this — accessing the information resources.
  3. Real-time data
    The data source must be an interface for an internal application the requested service uses to generate this data. The system must provide data that is up-to-date at the time of the request.
  4. Structured
    A truly open data what is easy to handle. Therefore, data must be distributed exclusively in structured formats such as JSON, XML or YAML.
  5. Documented
    The data structure must be documented. If your data is in structured form, it does not mean that the data is already self-explained. Do offer data structure documentation and its individual properties for your customers.
  6. Validatable
    The data you provide must be accompanied by machine-readable structure specification, which will allow to verify integrity and accuracy of published data. XSD scheme for XML or for JSON is just enough. This also can be combined with #5 as many data schemes allow not only to describe the structure itself, but also to document its properties.
  7. Versioned
    A good API is versioned. When it is necessary to make changes in data format, make sure you don’t break a contract between data consumers and old format. This could have a destructive effect on the applications of your customers.

Open Data Streams is easily accessible & processable data— an integral part of the Government as a Service concept.

Open Data is the fuel for innovation in e-government. However, the data should not be only open, but also has to be reliable, real-time, validatable and easy to handle. It will allow developers to implement new roll in the development of modern society on the top of them.