The Complicated Problem of Dataset Removal

Published in

Open Data Literacy

5 min readJul 20, 2019

Open government data portals can be a terrific resource for diverse public sector information in a structured format — water quality test results, licensing data for healthcare providers, lobbyist spending, et cetera. However, these data sets often vary widely in quality and utility. This is especially true where agencies publish directly to the portal without a central entity curating each dataset (i.e. unmediated deposits), leading to lots of data that may or may not be what people are looking for. Although a decentralized approach to running an open data portal likely lowers costs and other barriers for data publishing, it also means that the curator of the open data portal needs to periodically remove datasets in order to maintain a more accessible collection for the public and other users. How does a curator document the removal process in a transparent way and what factors need to be considered?

As an intern for the Open Data Literacy program with the University of Washington iSchool, I have been working with the Washington State Library to assess Washington’s open data portal and offer recommendations for curating this important resource. After seven years of growth, the Washington State open data portal is in need of curation and one of the main tasks may be removing datasets. The Washington State Library, with its history of curating government documents and state related information, is the perfect institution for the task.

Open data portals are flooded with low quality data and metadata

Washington’s open data portal is emblematic of the open government data movement as a whole. While hosting excellent data resources, open data portals also face the problem of nearly drowning in oceans of poor-quality data and metadata. For example, 50% of the tabular datasets on Washington’s open data portal have less than half of the available metadata filled in and 40% have not been updated in at least four years. Portals on a global scale face similar quality issues. The open data goals of setting appropriately ambitious timelines and of meeting an ‘open by default’ paradigm could be the reasons behind this issue. As the open data movement matures, it is time to consider how curators can help increase the quality of these resources.

Varying data and metadata quality is a problem because it is difficult to find useful data amid hundreds of datasets with confusing titles, incomplete metadata, and untidy data structures. Removing low quality datasets would mitigate this problem. Weeding datasets may not only strengthen the reputation of a data repository, but it may also make it easier for users of any type to find data that is current and relevant to their needs. However, a clear and broadly applicable policy on the removal datasets, as well as a system to track removals, has yet to emerge. This is not all that surprising since one of the 10 Principles of Open Data is that data should be published with permanence in mind.

Possible Unintended Consequences

A well thought out removal policy is essential to navigate possible unintended consequences of data removal. For example, in removing datasets, curators run the risk of cutting off sources of information that users may have grown accustomed to using in the past. There is also a concern that removing data from an open government portal could be seen as a step away from transparency, no matter how rational the decision to remove it.

Here is an example: A very small dataset of jobs and expenditure totals uploaded five years ago with an opaque title, a confusing description, and no indication when it might be updated, might seem like an obvious candidate for removal (Fig. 1). However, this dataset may populate some figures in a rarely used but important digital government document. Removing the dataset would break links in that document and could also be seen as an effort to obscure information. A curator planning to remove this dataset needs a policy that outlines a transparent and reversible process.

Screenshot of the metadata for a datasets titled Job Sustainment. This is an old dataset with poor quality metadata — *Figure 1. Removal of this dataset would improve overall data quality for the portal but may have unintended consequences.*

Curation Options

New York City may have one of the clearest, and only publicly viewable, policies available. Their process works like this:

Data is identified for removal either by the dataset owner or portal curators. Candidates include data that do not meet the definition of data or are not updated and not analytically useful.
After confirmatory discussion with the dataset owner, the dataset is unpublished for three months and then deleted from the portal.
A small portion of that dataset’s metadata and the reason for removal is added to a dataset available on the portal.
The dataset may be archived or completely deleted.

This is a thorough solution, but it requires a lot of human labor and only works when a portal is curated by a central department. Additionally, depending on the deletion process, deleted datasets may be very difficult or impossible to recover if they are ever needed again. Socrata and CKAN, two popular open government data portal platforms, both allow the deletion and recovery of datasets but neither provides a user-friendly tracking process.

An ideal solution for a decentralized portal would be an automated system that updates the metadata with a reason for deletion and compresses and archives each dataset upon deletion by the dataset owner. Text on the portal’s search results page could prompt users to rerun their search and include records of deleted datasets if a user felt that their original search did not provide expected results. A ‘removed’ dataset would still be findable and usable, but it would be clear that it was a legacy version.

Until an automated, user-focused solution appears, portal curators will need to find ways to track removals in a transparent way. Developing a removal policy and removal procedures will likely be one of the core recommendations I make for how the State Library can help curate the Washington State open data portal.

The Complicated Problem of Dataset Removal

Possible Unintended Consequences

Curation Options

Written by Andrew Mckenna-Foster