Opening Rural Library Data: Applying Coursework to the Field
Asotin County Library (ACL) has introduced an open data initiative that will provide public access to civic data collected by local government agencies. Library staff members are well suited to provide open data support to their communities by working as civic infomediaries. Despite the natural relationship between open data and public libraries, it remains uncommon for libraries to openly publish their own data. The scarcity of open library data can be attributed to demands on staff time, a perceived lack of data usage by external groups, or an overestimation of the skills required to process and publish data. This post discusses the methods, tools, and standards used to prepare and openly publish library data.
Libraries collect and use data for a variety of purposes. Circulation, holdings, and bibliographic data describe resources — physical and subject characteristics, location, authorship, and use history. Libraries collect data around programming and other services that track desired audience, program themes, and attendance. Operational data include financial reports, salaries, funding, and facilities information. These domains inform various internal activities like budget planning, collection development, staffing, and service offerings. However, there are definite applications for each of these domains across multiple user groups.
I was able to utilize the methods and tools covered in the data curation curriculum at the University of Washington iSchool to complete an internship with ACL and Open Data Literacy (ODL) over the summer. Two significant goals of the internship were to publish Asotin County Library data to Washington State’s open data portal, data.wa.gov, and to create documentation that will streamline the publication process of future open publishing in Asotin County — both by the library and other agency partners.
A terminal project in the data curation curricula at the iSchool asked students to design the curation protocol for a data repository. We practiced weighing, defining, and assigning various metadata elements from several well-used schemas. This gave us insight into how proper metadata use promotes data sharing. Furthermore, the understanding of why repositories require certain elements to be filled out by data providers guided the development of a metadata template.
In order to make the data preparation and publication process easier for future contributions, it made sense to create a template for metadata creation that is broadly designed for use with Socrata repository software. A principal objective of the metadata template is that it can be used by library staff to publish prepare and publish data from other civic agencies.
In addition to the record-level description and dataset metadata, it also captures context and preserves primary data. When filled out prior to data ingest, this template supplies the data partner with complete and accurate documentation which will facilitate discovery and reuse. It also serves as a record for the dataset owner of what data has been uploaded to the portal and what it takes to replicate the process.
While thorough metadata is essential for data sharing, financial concerns can limit how much staff time is devoted to documentation. Economic sustainability was taken into account when designing the template. A cost analysis is included in the requested information for each dataset. The value of a dataset can be appraised by considering its impact and likelihood of reuse. Administrators can leverage this value and cost estimate to inform their organization’s open data plan.
The next goal, publishing datasets to the state data portal, required some preliminary steps including the selection of datasets to publish. Data privacy and other ethical considerations — significant aspects of Fundamentals of Data Curation — were instrumental in shaping the kind of data that was ultimately published. The question of data demand and the likelihood of reuse also factored into data selection.
In order to predict potential data reuse, I drafted comprehensive use cases related to multiple stakeholders. A use case is a question answered or problem solved through access to the dataset at hand. Informed by case studies and personal experiences, these use cases were among the most advantageous steps in the data preparation process. This practice was emphasized in the cumulative final project of Advanced Data Curation. Not only do they work to justify open publishing by illustrating how the data could be used by external stakeholders, but they can also guide data collection and retrieval.
Once the data were selected, based on use cases and open library data precedent, they needed to be transformed from how they were exported from the library’s database to conform to tidy data expectations. Version control was a vital component at this stage of the project. The original spreadsheet as it was exported from the database is saved in the template, but also in separate files on the computer and in a dedicated web-based repository for the project. OpenRefine, an open source tool for working with tabular data, was integral in transforming the data while simultaneously logging those changes in a manner that will aid future automation.
Tidy data concepts were covered broadly in Advanced Data Curation. When data is clearly formatted, it can be shared and analyzed most effectively. Therefore, tidy data standards were applied to the public library datasets that were chosen to be published. This project allowed me to implement proper data formatting standards, but it also gave me cause to investigate the Resource Description and Access (RDA) and Library of Congress rules that govern how bibliographic data is stored.
Through this project, I learned that, other than having a knowledge of data best practices, the skills needed to prepare and publish open data are widely possessed by library staff. We work with databases and information organization daily. However, the courses I completed at the iSchool instructed me in the importance of complete documentation, tidy data standards, version control, and preservation activities.
Moving forward, a few challenges remain. Further automation in the form of adaptable Python scripts will allow datasets to be easily formatted. The API provided by the library’s catalog software can be used to routinely retrieve data. These steps toward automation will require knowledge of systems and coding but once these protocols are in place, demands on staff time will decrease. Despite increasingly effective automation, it’s clear from my experience with ODL and coursework that human intervention remains the most integral part of successful data curation and sharing.
Moving forward, a few challenges remain. Further automation in the form of adaptable Python scripts will allow datasets to be easily formatted. The API provided by the library’s catalog software can be used to routinely retrieve data. These steps toward automation will require knowledge of systems and coding but once these protocols are in place, demands on staff time will decrease. Despite increasingly effective automation, it’s clear from my experience with ODL and coursework that human intervention remains the most integral part of successful data curation and sharing.