Towards a Data Quality Score in open data (part 2)

How the DQS was created: a walkthrough for organizations facing similar challenges

Published in

Open Data Toronto

13 min readFeb 11, 2020

In my first story on Open Data Toronto’s Data Quality Score (DQS) I shared why data quality matters to us and what the DQS is at a high-level; in this story, I walk through the steps of exactly how we created it so it is more detailed and a little more technical. Read on if that’s your jam.

1. Established goals and scope for the DQS

I find iterative product development is as key in data as other fields given many data science tasks, e.g. model tuning or feature engineering, can be an endless pursuit: that extra 1% in accuracy is not always worth the effort and defining product increments helps establish the “good enough” point.

With that in mind, set as our delivery target a Minimum Viable Product (MVP) — a version of the product with just enough features to start learning from users with minimum effort.

There were 5 goals for the MVP:

A. Integrate perspectives from outside the open data team

B. Provide a reasonable indication of quality without having to view the data

C. Create an automatic scoring and delivery mechanism

D. Enable us to begin capturing catalogue data quality trends over time

E. Share our DQS scoring method with the public on an ongoing basis

At the end of this, 4 goals were met. Goal C, automatic scoring, is halfway there: the scoring is done automatically via a script but, for the beta, it has to be run manually. This is temporary while learning what works, what doesn’t, and making changes. Will go through the rigorous promotion processes into production and schedule the script to run daily once it is stable.

2. Assembled a diverse group to offer feedback and perspective

We brought together a Data Quality Working Group to aid us with the ambiguous meaning of quality and validate our approach. It was composed of a diverse membership: multiple teams, users with a range of comfort around data, both data producers and consumers, and an array of wide-ranging perspectives including from outside our organization altogether.

Sought diversity and the divergent thinking that comes with it for two reasons. First, users are different from us (and it’s for the better that others have interests beyond data and technology!) and we needed exposure to different perspective around data and quality. Secondly, to mitigate dangers of groupthink and biases more likely in a group of like-minded individuals.

They acted as our sounding board and proved invaluable by providing feedback and perspective throughout the process — and it enabled us to meet Goal A.

3. Researched what makes “good” or “bad” data

Identified dimensions used to measure data quality across industry (left) and academic (right) papers. Actual terms in the papers may differ, these were abstracted for comparison.

I was particularly interested in how to determine what makes data “high” or “low” quality. To this end, reviewed various academic articles and industry white papers, and identified 15 dimensions for measuring quality including accuracy, completeness, metadata, and timeliness… as expected, basically, but still provided a mental model and rigour moving forward.

Interpretations for each quality dimension identified in the research

It quickly became evident that quality depends on the degree to which data fits its intended purpose — hence why dimensions were different across papers. Metadata is important in open data, for instance, because most users are neither domain nor data experts; on the other hand, to data scientists tuning a real-time recommendation system in a production environment metadata is much less important than timeliness, or how quickly input data is received for processing.

4. Selected quality dimensions that fit the DQS

Then selected dimensions in scope because Open Data could not assess all quality dimensions in a reliable fashion.

Score could be composed of 8 dimensions because 2 were not applicable and 5 were on publishers

As shown above, several dimensions were taken out of scope:

Accuracy, Coherence, Precision, Reliability, and Non-Redundancy: publishers have the domain expertise and knowledge to assess these dimensions since they revolve around the actual data values. Taking these out was a difficult decision but the Open Data team recognized our focus is opening and activating data —we are simply ill-equipped for assessing these dimensions.
Credibility: assumed datasets are equally credible.
Relevance: dependence on who and why uses the data makes this dimension inherently subjective. However, not all datasets are equally important; in fact, the team already captured five of the most relevant problems facing Toronto into “civic issues” to tag datasets and prioritize data requests.

Our ideal dimensions to score and aggregate to an overall DQS were then: Accessibility, Comparability, Machine Readability, Completeness, Granularity, Interpretability, Metadata and Timeliness.

With this many dimensions, I thought the DQS was at risk of becoming difficult to understand and sustain. That aside, they clearly could not all be measured so reality would help with pruning… as often happens in data projects.

5. Defined metrics to measure each dimension

To determine how to assess these dimensions, identified performance factors for each dimension and, from there, define feasible metric(s) to measure these factor. Metrics aggregate to dimension scores which, in turn, have individual weights towards to the overall DQS.

This resulted in 7 dimensions (2 were merged), 18 factors, and 12 metrics. There are fewer metrics than factors because we could not conjure feasible metrics for every factor.

The most straightforward dimensions to measure were Accessibility, i.e. whether a dataset is available via API; Completeness, i.e. proportion of missing values; and Freshness (formerly Timeliness), i.e. time durations such as collection to publication, target to actual refresh, today and time last refreshed.

Usability replaced Comparability and Machine Readability, for ease of use, and was challenging due to its inherent subjectivity. Most noteworthy factors:

Meaningful column names: obscure column names is one of the most frustrating issues when faced with a new dataset, which often occur due to source system limitations or to publishers being so accustomed to them that they do not immediately see how confusing they can be to new users. As a metric, thought of the portion of English words contained in the name.
Ability to join with other datasets: data is most valuable when combined (e.g. parks data may tell you “where” you could go, and enhanced with weather data it could also tell you “when” you should go). In an ideal world, datasets are easy to join. Yet, due to the variety of source systems, data types, aggregation levels, field names, publishers, etc. a feasible metric avoided us for now.
Data shape: long data, for example a single “year” column instead of one per year, makes visualization easier. Had ideas on metrics but nothing truly feasible, same as JSON depth and nested fields.
Geospatial validity and slivers: geospatial data-specific factors and metrics

Granularity, despite being a relatively simple concept, proved difficult to do programmatically because determining how “atomic” data is entails some subjectivity (e.g. the most atomic level may contain private information).

Metadata entails whether metadata fields are filled out (e.g. description, contact name) — easy to measure as a True/False — and the quality of content in those fields — harder to measure, need natural language processing to some degree as well as a training set.

Interpretability is impacted by multiple factors including columns with a constant value, columns with only default values, and consistency in the data.

6. Ranked dimensions for determining weights

Since dimensions would weight towards the overall DQS, the weights had to be assigned the weights. In no particular order, 3 principles guided how weights were determined:

Weighting methodology had to be easy to implement and explainable to a wide audience, so the approach could be shared transparently.
Weights had to be determined with input from outside the Open Data team (indeed, this was a key contribution of the working group).
Wanted to avoid asking user preferences (as often results in “it’s all important”). In essence, I wanted to force a choice to reflect the reality there are tradeoffs and dimensions not equally significant.

Opted for a rank weighting method where criteria are ranked by importance in descending order — 1 being most important — and weights based on the ranks, with higher ranks weighting more than lower ones.

Working group and Open Data team members ranked dimensions by importance. This survey is open to the public, for anyone who wants to take it as a starting point for their own.

After surveying the working group and our internal team, the dimension ranking result was:

Interpretability
Usability
Metadata
Freshness
Granularity
Completeness
Accessibility

Finally, decided to use equal weighting for how much metrics contribute to their specific dimension score and refine it after listening from users. The aim was to avoid getting caught in “analysis paralysis” — going back on forth on this, endlessly, without delivering. A starting point is better than not starting.

7. Selected metrics in scope for the MVP

Selected 8 metrics of the 12 defined for the MVP because they could be calculated programmatically fairly quickly. This meant removing Granularity, which had no metrics, and Interpretability from the MVP.

Interpretability had one metric, percent of columns with a constant value; however, as the highest-ranked dimension it contributed to the largest weight of the DQS. Meaning that metric alone would have accounted for 32% of the score until other metrics were captured — an unreasonable weight. Instead of removing the metric, which is still a good indicator of overall quality, it was moved under Usability.

Important caveat on Freshness

The date used to calculate Freshness is “Last Refreshed”, which is when Open Data pulled from the source system and not when the data was last updated (often referred to as “Currency”). Take this dataset:

Last refreshed: 31/12/2010
Currency: 01/01/2020

Ideally, would use “Currency” for determining Freshness; however, it must be standardized to make scalability and automation of the pipeline feasible.

Alas, this is not possible because the open data catalogue contains such widely different datasets that they may or may not have a currency timestamp and, if they do, it will probably be under a different name… Plus, often the meaning of currency itself is muddled because time is a tricky concept.

For these reasons, decided to use Last Refreshed date as the closest proxy to currency. Although not exactly where we want to be, it is a step in that general direction.

8. Set dimension weights with Sum and Reciprocal rank weighting

Many algorithms can calculate weights from ranks, differing in how they distribute weights: some favour on higher ranking criteria and others weight them more evenly.

If interested, the algorithms tested were outlined in a paper comparing trade-offs in rank weighting methods: rank sum, rank reciprocal, sum and reciprocal, rank exponent, rank order centroid.

After trying several rank weighting methods settled on Sum and Reciprocal as it produced relatively balanced weights across criteria. Dimension weights were (finally!) set:

Usability: 38%
Metadata: 25%
Freshness: 18%
Completeness: 12%
Accessibility: 7%

9. Finalized the DQS calculation

With all the pieces in place, the last item was calculating the DQS:

Measure metric values. They will be between 0 and 1.
Get raw dimension scores, which are is the mean of their metrics
Calculate dimension scores by multiply their weight by their raw score
Get raw DQ scores by summing dimension scores
Scaled raw DQ scores via min-max scaling to get the DQS

For detailed metric calculations check out the script in GitHub, which scores automatically those datasets in scope: those in our database. Data files are not rated. This is because automation is much easier for data in the database because it is standardized (whereas files can be quite different even when in the same format) and because once they are in the database, the data can be accessed automatically through the CKAN Datastore API.

As mentioned in the first story, I wanted to stay away from reporting a number as the score because I felt doing so would distract from the main purpose of this MVP: to provide a reasonable indicator of quality (Goal B).

The score would make it easy to focus on the meaning of each percentage point, e.g. whether a dataset deserves a 75% or 80% . This type of conversation would not have been very valuable, however: the score is far from perfect but it didn’t have to be exact. Just provide a general idea.

10. Created bronze, silver, and gold tiers

Binned the scores to avoid reporting a specific number. This makes communication easier and helps anchor conversations around broader data quality concepts instead of the meaning of percentage points — essentially, it makes it allows one to see the forest from the trees.

Began with letter grades (A-F) since scores were roughly normally distributed but that resulted in too many bins that were difficult to differentiate, and the traditional cutoffs were too punitive.

Reducing the number of bins allowed for better differentiation of the bins, with the benefit the Gold/Silver/Bronze model is broadly familiar. To set the boundaries, i.e. when a Bronze score becomes Silver or a Silver score becomes Gold, followed a pragmatic approach:

Experimenting with boundaries
Confirming datasets fit in their bins, especially if they are at the edge
Ensuring the distribution makes sense. Logic dictates Bronze would be largest in terms of numbers, Silver smaller, and so on
Rinse and repeat until it makes (enough) sense

There’s a lot of art in data science. For instance, it is interpretation and imagination by data scientists and analysts that often give meaning to bins.

At the end of this, the score distribution and dataset-to-tier fit well overall — finally meeting Goal B. It gave us a general idea of the quality of a dataset without having to see it first! The catalogue was scored, and the final boundaries and counts were:

Bronze (normalized score less than 60%): 34 datasets
Silver (normalized score 60% -80%): 23 datasets
Gold (normalized score over 80%): 11 datasets

11. Capturing results to a new dataset for full transparency

Our own dataset, as of the time of writing, scored Silver! This is because the refresh was way overdue (remember, for now it has to be run manually… won’t be an issue once it is scheduled). It does pain me a bit and it’s a little embarrassing to admit, but just being open.

To meet our final goals, the ability to view catalogue data quality trends over time and share the methodology around the DQS , released the Catalogue quality scores dataset, which contains two “resources” (CKAN terms… they just look like files in the portal):

Scoring models (JSON) contains the version, and parameters of the DQS model (e.g. rank weighting method, dimension ranks and weights, bins) to track how the model changes over time.
Catalogue scorecard (CSV) contains the scores from every run of the DQS, as new ones are just added to the end of the file. Interesting fact: if a dataset page loads and there is a score on the sidebar it actually comes from here because this data is in the datastore.

What’s next

There is still a lot of room for improvement in the DQS. It isn’t “finished” or “perfect”, and will probably never be— it is a starting point to exploring the automated DQS concept in the open data space.

From our side, we are listening to feedback from users and the community at large to learn what works and continue tweaking. On a more ambitious note, it would be splendid to come up with ideas of including the publisher dimensions into the DQS, such as accuracy and precision, but that will not happen any time soon.

There’s actually a spin-off from this project! A tool to help the average person validate their data quickly. Although it has been start-and-stop, early versions showed a lot of promise for increasing efficiencies around some internal tasks. I will share it here, as well, once I can make the time for it.

Would be great to continue refining this in collaboration with other programs or teams. If you are interested, or have questions/comments reach me via the comments, DM me on Twitter, or email the team at opendata@toronto.ca.

If you like this article don’t forget to give it a few claps, after all it costs nothing.

Special thanks because I was not alone

To Ryan Garnett (Open Data manager at the time and data quality aficionado) forever) for your championing, bringing together the working group, and valuable input.

To Yizhao Tan (data Swiss Army Knife and web dev by necessity), for the endless sound-boarding, brainstorming, and helping level up the initial code (the POC code was for testing metrics were measurable, not for readability!).

And of course the Open Data team at large for your support.

P.S. Don’t forget to follow Open Data Toronto on Twitter and subscribe to our monthly newsletter, the Open Data Update, in our portal.

References

Step #2: Researching what makes “good” or “bad” data

Industry sources

Academic sources

Bai, Lu, Rob Meredith, and Frada Burstein. “A data quality framework, method and tools for managing data quality in a health care setting: an action case study.” Journal of Decision Systems 27.sup1 (2018): 144–154.
Wang, Richard Y., and Diane M. Strong. “Beyond accuracy: What data quality means to data consumers.” Journal of management information systems 12.4 (1996): 5–33.

Step #8: Set dimension weights with Sum and Reciprocal rank weighting

Danielson, Mats, and Love Ekenberg. “Trade-offs for ordinal ranking methods in multi-criteria decisions.” International Conference on Group Decision and Negotiation. Springer, Cham, 2016.