Facing the Flood: Assessing Metadata Quality on Washington’s Open Data Portal

Andrew Mckenna-Foster
Open Data Literacy
Published in
6 min readAug 25, 2019

Washington State’s open data portal (data.wa.gov) was launched around 2009 and is now one of the largest and more well-established state portals in the country. It currently contains over 800 datasets covering 14 categories. Any of the state’s 197 agencies can publish to the portal giving it a broad scope and making it a valuable resource for the state. However, there is no central quality control and while this laissez-faire system is likely a driver of the portal’s success, the consequence is a proliferation of poor-quality metadata. This has made finding the high-quality data more difficult and has reduced the reusability of datasets on the portal in general. The focus of my Open Data Literacy internship this summer was understanding the extent of the problem.

This dataset has some of the available metadata filled out.

Consider this scenario. If you lived in Washington State and wanted to see some data on water quality tests, you might decide to bypass a Google search and head straight to data.wa.gov. However, a search for “water tests” will return over 500 results, many of which are pdf documents concerning lead tests in school water systems. Those documents, while very important, barely meet the first star of the five stars of open data. As savvy as you are, you filter the results to just datasets and get four results, one of which shows lead test data for all schools in the state. Not exactly what you started looking for but still interesting. Closer inspection shows that the metadata for that dataset is partially complete with a short description, the name of the owner, and the date updated. Those metadata are enough to convey what the data are about but there are still key pieces of information missing, such as how frequently the data are updated, what license the data are published under, and what each column in the dataset means (the data dictionary). These gaps might mean you still need to contact the dataset owner to get more information before using the data. As it turns out, despite missing those metadata elements, this dataset is one of the higher quality datasets available on data.wa.gov.

Data.wa.gov is not alone in its metadata problems; quality issues are a global problem in open government data portals. In 2016, the portal software company Socrata, on which data.wa.gov is built, surveyed developers on their perceptions of open government data (OGD) and found that over half thought that metadata is inconsistent between datasets, data is not up to date, and data is not clean or accurate.

To address metadata quality issues on data.wa.gov, the WA Office of the Chief Information Officer (OCIO), which manages data.wa.gov, approached the Washington State Library (WSL) and the Open Data Literacy program for help. My internship this summer was the first step in a potential partnership between the OCIO and the WSL to curate data.wa.gov to make it more usable.

This project consisted of two components:

  1. Interview state agencies to understand data publishing behavior
  2. Assess the current state of metadata quality on the portal

Interviews

I interviewed eight agencies and one organization that uses data from the portal. Most agencies held positive views of the data.wa.gov and it became clear that every agency uses the portal to meet their unique data needs. In fact, publishing behavior is only generalizable to the extent that it is unique to every agency. Publishing behavior included publishing for

  • Specific users — Example: Healthcare Provider Credential data for hospitals
  • Transparency — Example: State Art Collection locations
  • Internal or interagency use — Example: Salmon Recovery data

Rather than individual citizens, the primary users are other agencies, local governments, federal agencies, and 3rd parties such as businesses, nonprofits, and the media. Some agencies know exactly who their main users are while other agencies can only guess. Seven out of the eight interviewed publishers expect to continue or increase their publishing on the portal.

Publishing behavior is only generalizable to the extent that it is unique to every agency

What does this mean for a potential curator? The varying needs and publishing behaviors of the agencies suggest a curator will need to work closely with publishers to encourage better metadata practices. Any sweeping attempts to increase metadata quality on the portal will affect agencies in different ways and may produce unintended consequences.

Metadata Assessment

There are many ways to assess metadata quality (e.g. Kubler et al. 2018). I assessed metadata quality by examining completeness and understandability using a combination of variables common in multiple other studies.

While the OCIO provides excellent guidelines on open data best practices for data publishers, agencies are only required to fill out the title of a dataset on data.wa.gov. Perhaps for that reason, over half of the published datasets are missing 50% or more of the 11 available metadata elements. About 19% of datasets only have the title. The least filled out elements include license, posting frequency (how often the dataset is updated), period of time, originator, and the metadata language (typically English).

Four charts of summary statistics of metadata existence and understandability on data.wa.gov.
Metadata existence and understandability on data.wa.gov for a sample of datasets from August 5, 2019. Over half of datasets fill out less than half of the available metadata elements and license and posting frequency are some of the least filled out. About 40% of dataset titles are difficult to understand.

It is clear that agencies only fill out a fraction of the metadata, so it would be helpful if they were filling out metadata that provided the most useful information. I selected five elements that if filled out properly would allow a dataset to pass the CRAAP test. These are attribution (publishing agency), description, category, posting frequency, and license. Of these elements, the former three are filled out in some combination in about 60% of datasets, but 69% of datasets are missing posting frequency and 67% are missing license. A full 21% do not include any of these five elements.

So, while it’s impressive if an agency fills in some metadata, if the information they include is incomprehensible, that is just as bad as not filling it out in the first place. I looked at a sample of 112 datasets to assess the understandability of metadata. If information was confusing or incorrect it was scored as difficult to understand. I found that forty percent of dataset titles are enigmatic. Half of the sampled datasets had confusing temporal or spatial information, often both. Only 25% of datasets include a data dictionary (column descriptions) that helps understand the data itself.

Curation efforts should focus on areas that would most efficiently improve the overall quality of data on data.wa.gov

The search experience and usability of data.wa.gov would be much improved if every dataset had at the least all five core elements completed with understandable information. At the moment, only seven percent meet all these criteria and 48% are either missing information or have enigmatic information in one to two core elements. A curator could efficiently improve a significant portion of the metadata needs on the portal by focusing on this 48%.

I also noted other curation needs in addition to metadata quality. At least 13% of sampled datasets contained the same data as one or more other datasets, but from a different time period. Datasets in a time series should be combined to one dataset that is regularly updated. About eight percent of datasets are test or dummy datasets that should be unpublished or removed. Three percent of datasets do not meet the OCIO’s definition of data and are candidates for removal.

Recommendations

The takeaway from this work is that data.wa.gov is an important resource for state agencies but more than half the datasets need metadata improvements or other curation work. It is tempting to compare the portal to a centrally managed library catalog of books in need of development and weeding. However, this comparison is not appropriate because a curator of the portal cannot control what gets published or removed. Instead, a curator will have to work closely with state agencies to collaboratively fix and update metadata.

Here are specific recommendations to help guide a future curator:

Focus work on the five core metadata elements

  • Include controlled vocabularies for Data Provided By and Posting Frequency
  • Provide metadata explanations to help publishers fill out metadata elements
  • Work with publishers to fix existing metadata

Remove datasets when necessary

  • Implement a policy to identify datasets for removal
  • Create a procedure that removes datasets in a transparent way

Evaluate curation efforts

  • Run the assessment script available on the ODL GitHub repository several times a year to evaluate how curation efforts are affecting metadata quality

Next Steps

This project was a massive learning experience for me and an introduction to new challenges facing open data portals. There is much more behind this analysis and it is all available on the ODL GitHub repository. I hope to continue working with WSL to study dataset removal policies in OGD portals and I am looking forward to learning what WSL curation efforts teach us about improving metadata quality.

--

--