Enhancing the City of Austin’s Open Data With Digital Citations

Thomas Montgomery
Open Austin
Published in
8 min readJan 22, 2019

--

With the continued public interest in open data and open government, challenges begin to become apparent for both publishers and users. People interested in open data range from curious citizens, to community activists, academic researchers, skilled developers and everything in between. Often times, the connection between community members and their data is complicated by several factors:

  • we don’t know that the resource exists.
  • we know that the resource exists but don’t know where to find it.
  • we know where to find it but can’t navigate the portal or interface.

Additionally, as the publishers and maintainers of open data we have the real need to measure the value of the datasets being delivered to consumers.

What is the value of a mountain of research and findings if it is isolated in disparate university databases or hidden away in dusty libraries?

These problems are not unique to open data, of course. Academia has also struggled with making research data more accessible to researchers, students, and the public alike while attempting to measure its utility. What is the value of a mountain of research and findings if it is isolated in disparate university databases or hidden away in dusty libraries? Towards the end of the 20th century, while the internet was still young, this problem of data accessibility was dealt with in an ingenious and forward thinking matter. An interesting solution evolved called Digital Object Identifiers (DOI).

Digital Object Identifiers

The DOI system was developed to act as an index of data, publications, reports and professional information across academia and government. By indexing individual assets on a global scale researchers are able to more easily share, access and cite important information. This index serves as a link between platforms by linking individual DOIs to their data and including a description of the information in the form of metadata. By utilizing the internet to make a universal public index of data, this system gives an easy way for people to both search for and view relevant information.

Facing similar challenges here at the City of Austin, we are seeking to integrate the DOI information index model of data accessibility with our well-established open data portal. To see some examples of our initial DOIs, click here.

Example citation using a Digital Object Identifier:

Emergency Medical Services Department. (2014). EMS — Incidents by Month [Data set]. City Of Austin Texas Open Data Portal. https://doi.org/10.26000/001.000001

Altmetrics

City of Austin open data publishers, usually consisting of individual department liaisons, are often asked to publish more datasets while also improving the quality of their department’s existing open data. However, publishers are rarely receiving feedback from open data consumers on the utility and usefulness of their department’s published data, as well as the priority of unpublished data.

A recent City of Austin research study was launched to identify ways to measure the value of our open data. One of the top recommendations identified by the project team was to collect and track qualitative data in the form of use cases by asking data consumers to cite their use of the city’s open data. Because we are moving forward with DOIs for citation and sharing, and because the citation contains a URL to the DOI, we can use the altmetric framework to track usage. Altmetrics are a popular way to track research value in academia by indexing web pages and monitoring traffic and occurrences of links across the web. This is an example of the results of an altmetrics implementation on a dataset:

Specific data about where and how the dataset was used

DataCite and Socrata both offer robust APIs begging to be linked together.

Luckily for us, open data and citations can be integrated. DataCite, a non-profit organization which provides a framework for creating, maintaining, and finding DOIs online and Socrata, the company that provides the open data hosting platform both offer robust APIs begging to be linked together. Since the DOI system acts as an index, all that is needed is to prime it with an initial loading of data and to periodically refresh it as changes occur.

In light of this information, the City of Austin’s open data program started a project last year seeking to investigate altmetrics powered by DOIs. By developing automation to integrate the DataCite/Socrata web APIs and adopting altmetrics, we hope to offer the public an additional, easily accessible platform for finding and using open data while also supplying useful information about use cases.

Gettin’ Technical: Connecting Socrata and DataCite

When given the opportunity to integrate the DOI citation model into the city’s open data program we quickly recognized a great opportunity, and at the same time a new technical challenge.

The City of Austin open data portal has thousands of assets - the term Socrata uses for datasets, maps, and visualizations of public data. The DataCite management software allows for the manual creation of individual DOIs and provides a useful web API, but for logistic reasons does not have a user-friendly bulk creation method. As the DOI model was conceptualized to be unspecific to the flavor of data for citation; public, private, academic and beyond, there is no single pattern of input to the system.

As most of us know all too well, manually managing thousands of individual things can turn into a fiasco very quickly. (╯°□°)╯︵ ┻━┻ Compounded with the fact that we are a large and sometimes bureaucratic organization, this outcome could be decidedly sub-optimal.

We realized that to meaningfully use the DOI citation system on such a large scale as the City of Austin, we would require a programmatic integration of the two APIs. For its simplicity and our existing skill set, we decided on using python to accomplish this integration. We would also require the ability to manually create DOIs as to not overly rely on automation, a problem which has been shown to sometimes cause unintended consequences.

So, we began to explore the integration of the two services. The web requests were rather straightforward. First, we would need to query Socrata for the metadata needed to create DOIs on DataCite. This was done by creating a list using the requests python library and our publicly available asset list from the Socrata API:

Since our data structures are relatively small, they can easily be saved to a directory by the application as three simple JSON files:

  • doi_assets.json — List of published DOI assets in DataCite including 4x4 and DOI values.
  • socrata_assets.json — List of all existing datasets on Socrata with metadata.
  • departments.json — List of City of Austin departments publishing to Socrata and their assigned DOI value blocks.

Socrata offers us a useful key value for all its assets referred to as the four by four (4x4). This gives each asset a unique value which can be used during the integration for determining relationships. For example, the Austin Police Department Hate Crimes 2017 dataset uses the 4x4 identifier 79qh-wdpx. The schema for our JSON tables contains the 4x4 for easy relationship handling between the two interfaces.

For publishing to the DataCite REST API, we would need to construct a payload from Socrata’s asset metadata consisting of a base 64 encoded metadata XML schema defined by DataCite. We also needed to include a static URL for the Socrata asset to resolve clicks, and the DOI number:

DataCite payload

To manage the integration of this scale we would also need to manage a significant amount of records. Essentially it would be twice the number of assets as found on the open data portal.

We first explored using PostgreSQL as a data management mechanism. PostgreSQL offered a well documented, open source, and robust database for tracking data between the two APIs. At the same time, using a full scale database came with the overhead costs of server deployment, maintenance, and expertise. We also did not foresee the need to manage over 6 or 7 thousand records.

The decision to use the open source pandas library for the integration’s data management was taken due to several factors:

  • the relatively small scale of records to be tracked (less than 10,000)
  • no need to deploy a server, lightweight framework
  • well documented open source and simple pattern of implementation

An example of the simplicity of pandas can be seen in the commit diff below. Loading a temporary table for comparison can be narrowed down to a couple lines of code in pandas, while in PostgreSQL it takes several cursor executions and queries:

PostgreSQL on the left, pandas on the right.

As Socrata assets are periodically updated by department liaisons, names, descriptions and other metadata values can change. So we also needed a framework to schedule periodic updates to DataCite. This was done by using the same payload structure as above, but specifying an already-existing DOI value and new metadata:

Updating existing DOI

Considering that the structure of the city’s open data project is organized by department, we agreed that the best way to manage individual DOI numbers would be to group them together based on department. The City of Austin was also assigned a client DOI value of 10.26000, so that all city DOIs would also contain this value.

For example, Austin Resource Recovery would own all DOIs using the prefix 10.26000/002, Emergency Medical Services would have 10.26000/001, and so on. From there, the DOI suffix values would iterate upwards from .000001. So the first EMS DOI would look like 10.26000/001.000001, the second 10.26000/001.000002, etc. So, we needed code to calculate the DOI value by reading from the department and doi_assets JSONs:

Finding existing and creating new DOIs

Finally, we also need to update the custom metadata field we created on Socrata so that people can easily find the DOI from the open data portal. This was done using a patch request on the Socrata metadata API:

Update custom metadata in Socrata

Get Involved

Despite making considerable headway in automation, research, and governance this project is still in the developmental phase. We hope to add more features/improvements to the code base, get input from the open data community on what you think is important, and expand our ability to track use cases of data using altmetrics. Head over to the public GitHub repository if you are interested in how the backend is being developed, or would like to get involved.

You can also email me at thomas.montgomery@austintexas.gov if you have ideas, feedback, or would like more information.

--

--