Laying the Groundwork for Transportation Research Data Management
The process of data curation is about providing high quality data that are reusable, accessible, and ready to be archived for long-term preservation. Active and careful human intervention is key throughout this process — from organizing, describing, cleaning, contextualizing, to preserving, weeding, and removing. For this summer’s Open Data Literacy internship at the Washington State Department of Transportation (WSDOT) via the University of Washington Information School (UW iSchool), I tackled the “problem” of data management planning for research data at WSDOT Research & Library Services. This experience has especially reinforced in my mind the idea that much of data management and curation is for the most part a social problem and less a technology problem. I believe this is why library and information professionals make valuable contributions in this area. We are trained to examine information needs and information seeking behaviors and solve problems with human-centered approaches.
The problem statement I received at the start of my project says that “WSDOT lacks a defined process for creating data management plans” and that while data resources have been developed and managed according to the needs of individual business units, the organization “[does] not have an accurate picture of the nature of these data and information resources (suggesting a high likelihood of duplicative or even conflicting data), nor how they are being managed.”
Research data, in particular, occupy a unique area of data resources at WSDOT. They are not captured by the WSDOT Data Catalog (Data or Term Search, DOTS), Data Warehouse, or Data Library. Currently, the tangible output of WSDOT research exists in the form of final narrative reports, and these WA-RD reports are published on the agency’s website. Once cataloged and published on wsdot.wa.gov, the research is considered complete. Research data, on the other hand, are not managed by WSDOT as renewable assets and are not integrated as part of the research procedures. Research data are made discoverable and accessible to neither internal WSDOT personnel, contract researchers, nor the public.
To investigate the problem statement, I can parse it into more granular problems to tackle with a socially minded approach:
Problem 1: We need to improve upon the ways in which researchers independently describe and manage their datasets, and we need to fill in gaps where datasets are missing to tell the whole story of a research project. Why datasets are collected and managed the way they are now — inconsistent from one research project to another — is a social question.
Problem 2: We need to further define what kinds of data we are looking to publish. What’s the general understanding among the research teams when we mention publishing complete datasets that support a discrete research project? What version(s) of datasets can best facilitate reuse? What kinds of datasets can we feasibly collect from researchers and package into sets of digital objects? In this problem, user needs, perceptions, and viability make up the core questions.
Problem 3: How do we ensure that the datasets associated with research projects are suitable for open publication, and according to whom? Considerations of sensitive data, data ethics, and liability issues at a government agency like WSDOT can very much be a social problem centered around human concerns.
Problem 4: Do decision makers recognize the value of research data and the work to care for them before and after publication? Individuals who have responsibilities in the various stages of research data management and open data publication need to have adequate resources and institutional support in order to succeed and sustain the work.
These are all messy inquiries that take time and larger cultural shifts to answer. For the Open Data Literacy project, I aimed to produce tangible tools, such as a data management plan (DMP) template, data inventory starter file, and a division of responsibilities worksheet, that will help guide WSDOT Research & Library Services in the next steps in setting and meeting its short- and long-term goals.
Align with national guidelines and peer organizations
To do so, I focused on seeking alignment and avoiding reinventing the wheel. There are existing guidelines to help address some of the questions. For example, with reference to problem 2, the U.S. Department of Transportation (USDOT) Public Access Plan requires that final research data be made publicly accessible. With reference to problem 3, the WSDOT Open Data Committee has been carefully evaluating ethical and liability issues surrounding open data publication at the agency, and it recently completed a final draft of a detailed and comprehensive Open Data Risk Register. In addition, with regard to sensitive data, the Washington State OCIO policy 141.10 section 4 has classified data into four categories based on publishing restrictions and how the data must be handled. These are existing policies and guidelines that should be consulted and followed. The purpose of a project-level DMP, then, is to have a document and an opportunity for PIs and Research Managers to determine, based on the OCIO classification, whether or not certain datasets need to be excluded from open publication and specify these decisions and sensitivity levels in the DMP.
The issue of managing research data assets is receiving increased attention at other state DOTs. And these peer organizations are confronting similar issues in parallel with WSDOT. Beyond national guidelines, it is helpful to compare practices at other state DOTs and look to them for resources.
Among state DOTs, I found that Wyoming DOT has the most comprehensive data management planning documents, accompanied by an agency metadata schema, a data dictionary, and a master contract for research projects containing language addressing research data management. These documents are included as part of WyDOT’s Research Center Guideline (last revised in June 2018). Missouri is one other state whose DOT has followed WyDOT’s model and created its own research DMP template (finalized August 2018) and a suite of resources for MODOT researchers.
Tailor to WSDOT-specific needs
Creating a template for WSDOT research DMPs, I focused on ensuring standardization. The result is a template that closely resembles that of other state DOTs and even research institutions. Once I had the necessary structure, I thought about what kinds of customization this DMP template, to be used specifically for research projects, would need in order to ensure consistency across research projects. What are the decisions that can be made at the organizational level, so that it does not fall on researchers and PIs to come up with their own solutions, which would likely end up introducing redundancies or unwanted deviations?
One area to customize with more specific instructions, I believe, is the type of research datasets that should be submitted (problem 2). As mentioned above, the USDOT Public Access Plan addresses specifically final research data. “Final research data” appears to be the standard language in the DMP examples from WyDOT and MODOT, as well as the Pacific Transportation Consortium. WyDOT defines final research data as “recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” It goes on to give examples of what to exclude, such as “preliminary analyses,” “laboratory notebooks,” “communications with colleagues,” and “partial datasets.”
Knowing the national standard, it may be beneficial for WSDOT to refine the specific meaning of final research data within the agency and decide whether or not that is the desirable scope that would best serve its audience and data users. Would final datasets, as opposed to raw, less manipulated datasets (resulting, for example, from experiments and simulations), be more reusable to the research community? To determine which version is the most reusable and useful for supporting research findings, WSDOT may need to build use cases to understand how the agency’s open research data may be used and by whom. Such an investigation would also strengthen the argument for treating research data as valuable assets and ensure long-term support (problem 4).
One way to approach this is to systematically survey researchers working on WSDOT research in the current or recent biennium and gain a better understanding of the data management practices and behaviors in accessing and depositing research data. This activity would lead us back to addressing problem 1 and back to user behaviors and management practices. This summer, I conducted a preliminary assessment of the six research projects, where I discovered a range of practices and familiarity with DMPs and open data publishing. I asked each project’s PI or PhD student researcher to fill out a preliminary assessment about their data management practices. In our one-on-one, loosely structured conversations, I asked about data characteristics (e.g., attributes, timeline for collecting, expectations for storage). When possible, they gave me feedback on a draft DMP template. Going forward, a comprehensive assessment across WSDOT research projects may help inform the implementation of DMP requirements and detect gaps.
Laying the groundwork
At the conclusion of this summer Open Data Literacy project, three important open data documents are being finalized concurrently in August 2019 by the WSDOT open data committee: the Risk Register IT Open Data Risks, the executive order on open data publication, and the Open Data Publication Manual. This development shows promising and emerging traction in open data publishing at the state DOT level. For future next steps, my recommendations for WSDOT Research & Library Services included implementing a research DMP template, integrating data management as part of the research procedures and project management, strategically prioritizing research projects based on funding programs, and evaluating repository options for depositing data to be openly published.
WSDOT’s research data serve two important communities that benefit from open data. One comprises users of government information and civic data and another the scientific community, as WSDOT research produces engineering data, environmental and land use data, materials science data, and more. The effective management, open publication, and long-term care of these data would benefit a range of stakeholders, including the public — promoting democracy, accountability, and innovation — and other transportation and academic researchers doing similar work. Finally, implementing strategies in managing research data would not only be a proactive step to anticipate future requirements, but also an ideal way to ensure assets generated by the agency are built to be renewable and effective information resources.