Sitemap
Open Data Toronto

Telling stories with data. For more content, check the knowledge centre in the Open Data Toronto portal. For more content, check out the Knowledge Centre in the open.toronto.ca

Towards a Data Quality Score in open data (part 1)

Why Open Data Toronto created a score to assess data quality and what it measures

7 min readJan 15, 2020

--

As proud nerds, we have been excited about data quality at Open Data Toronto for a while; so much so, in fact, we created and recently released a Data Quality Score (DQS) to measure the quality of our catalogue.

In this post, I share why we want to measure quality and what the DQS is. Next month I will follow up with an article detailing how we created the DQS — including code and papers — for open data programs or teams facing similar challenges.

Data quality scores are now displayed on some datasets

Why quality > quantity (for us)

Traditionally, Open Data Toronto program performance has been tied to the number of datasets in the catalogue. Mayor Tory and City Council have used this metric to focus the program on growth since early 2014, resulting in the team increasing the catalogue size by over 190% in the last 5 years.

Today, however, catalogue size is less relevant primarily because it fails to measure progress towards the program’s vision of enabling anyone, anywhere, to improve life in Toronto with open data.

A more relevant metric helps measure user value, effectively shifting conversations around the open data catalogue from “how many?” to “how good?”. A focus on value has been a core tenet of the Open Data Master Plan (ODMP), which was co-developed with the community and received unanimous Council approval; in fact, it established “emphasizing quality over quantity” as a key strategic action for the program (Strategic Action 1c for those keeping score).

Catalogue size does not capture what matters

Although dataset count is easy to measure and understand, several issues with that metric have become increasingly salient as the program has matured:

  1. Users don’t view number of datasets as particularly useful. We heard this repeatedly from the community and in the ODMP public consultations.
  2. More datasets do not mean more value. Getting more data users don’t want, can’t use, or can’t help solve civic issues could diminish value.
  3. Skews publisher incentives. Given performance is tied to providing more data, efforts towards improving data quality are not recognized.

Quality is a better indicator of value and potential for impact

Measuring quality would provide a better indicator of how useful datasets are and their potential to be used for improving life in Toronto. This would also enable conversations and enhance accountability around aspects of the catalogue more relevant to the people who use it.

To understand why this is the case, imagine a catalogue with millions of datasets in files in proprietary formats, out of date, without metadata, empty, and with obscure attributes… Hard to use for experts, let alone most people! Not exactly helpful in improving lives.

Now, picture the opposite: a catalogue with datasets that make you think of “good” data, e.g. timely, well described, complete, in various formats, accessible via APIs, etc. The potential of these datasets to be used for improving people’s lives is much greater.

High-quality data enables high-quality impact

The data quality frameworks identified in our research did not assess quality in the way we were looking for. For example, the 5-star Open Data framework focuses primarily on linked data and the ODI Open Data Certificate on the context around it (e.g. policies, access) instead of the data.

So we decided to create our own measure — the Data Quality Score, or DQS.

What the Data Quality Score is (and isn’t)

With the DQS users get an idea of how good data is without viewing it and publishers get pointers for improving their data.

Is created with feedback from other teams and the community

Assembled a Data Quality Working Group representing a diverse membership (e.g. types of users, technical abilities, even from outside the organization) to provide advice and guidance while creating the score.

Is calculated from 5 dimensions and 8 metrics

The DQS is a result of 5 dimensions, i.e. factors that influence quality, which are scored individually and account for different weights towards the score. We defined them after a review of academic and industry papers, with the working group, then set their weights via a rank weighting method.

We then selected metrics to score dimensions that could be automated and made operational fairly quickly, to focus on delivery and prevent “analysis paralysis”. Although it is not an exhaustive list by any means and these metrics paint an incomplete picture of their respective dimensions, issues with here indicate larger quality problems are more likely.

The dimensions, their weight, and underlying metrics are:

1. Usability (38%): how easy is it to work with the data? Measured by 3 metrics:

  • Proportion of columns with meaningful names
  • Proportion of columns with a constant value
  • Proportion of valid features (for geospatial datasets)

2. Metadata (25%): is the data well described? Measured by percent of metadata fields that have been filled out by the publisher.

3. Freshness (18%): how close to creation is publication? Measured by the time gap between published refresh rate v. actual (e.g. expected daily but it has been a week), and gap between last refreshed and today.

4. Completeness (12%): how much data is missing? Measured by proportion of empty cells in the dataset.

5. Accessibility (7%): is the data easy to access? Measured by whether the data can be accessed via the DataStore API — a freebie for the MVP, as it contains data from the DataStore only.

In our tests datasets generally scored about where we expected, with better datasets scoring higher and questionable ones scoring lower.

Is only for “Table” or “Map” datatasets (stored in our database)

We store data in 2 ways, individual files and a database. In the portal the former are tagged “Document”, and the latter “Table” or “Map” (as a sidenote, I don’t think these are the right terms so will revisit them in the future but they are for now).

The score applies only to datasets in the database, as it allowed us to use features (such as APIs) and enjoy standardization that make automation far easier and faster than working with individual files. Given how different files can be from one another — Excel, PDF, Shapefiles, etc — we had to remove them from scope.

Is reported via Bronze/Silver/Gold medals

Instead of reporting the score as a percentage we opted for medals because, at this stage, getting the overall concept right is more important than the specific number. We want to start conversations at a high-level and then move into the details — once we ensure “Gold” is really better than “Bronze” we can look into whether a dataset should score 87% or 83%, for example.

Is designed for tracking scores and models over time

We created the Catalogue quality scores dataset to keep track of every dataset’s DQS, underlying dimension scores, and when they were last scored so we can keep track of improvements to the catalogue over time.

In addition to recording when the scoring happened, and what version of the model, we are also recording the DQS (badge and score), as well dimension scores for every dataset to track trends over time.

The dataset also contains the scoring model (including method, weights, and model version) in a JSON file to track how the model evolves:

{
"v0.1.0": {
"aggregation_methods": {
"
metrics_to_dimension": "avg",
"dimensions_to_score": "sum_and_reciprocal"
},
"dimensions": [
{
"name": "usability",
"rank": 1,
"weights": 0.37854889589905355
}
, {
"name": "metadata",
"rank": 2,
"weights": 0.24605678233438483
}
, {
"name": "freshness",
"rank": 3,
"weights": 0.17665615141955834
}
, {
"name": "completeness",
"rank": 4,
"weights": 0.12302839116719241
}
, {
"name": "accessibility",
"rank": 5,
"weights": 0.07570977917981071
}
]
}

Is not a measure for how accurate data is — that’s for publishers

Despite how important correctness of the data is to quality, open data is not equipped to assess the actual values since we are not experts in it. Accuracy and similar dimensions (e.g. precision, redundancy) are best in the hands of publishers, with their domain and data expertise.

Is not final and will (probably) never be

Analytical models are not “set it and forget it” — the world is always changing, and they need be maintained in order to model it correctly. For example, dimensions or metrics significant today can change or new ones can arise.

Where we go from here

There’s a long list of metrics to integrate and other ideas for improving the score; first, however, it has to be “out in the wild” so we can learn what works.

Additionally, now that we are collecting the data on quality of the catalogue, I expect us and others to begin to visualize and share what we learn.

I share the details of our approach with papers, methodologies and code, in a follow-up story for others to leverage or (ideally!) build on. Stay tuned for that if your organization is facing similar challenges.

I would particularly love to hear how the DQS working out for you (or not) —do expectations of a dataset set by the DQS match the reality after opening it?

Hope you found my first Medium post informative, enjoyable, and/or useful. I welcome feedback as there is no other way to get better.

You can reach me via the comments or by messaging me on Twitter. Also, don’t forget to follow Open Data Toronto on Twitter and subscribe to our monthly newsletter, the Open Data Update, in our portal.

--

--

Open Data Toronto
Open Data Toronto

Published in Open Data Toronto

Telling stories with data. For more content, check the knowledge centre in the Open Data Toronto portal. For more content, check out the Knowledge Centre in the open.toronto.ca

Carlos Hernandez
Carlos Hernandez

Written by Carlos Hernandez

Self-proclaimed nerd in data and design.

No responses yet