Check out a recent post, “New initiative to help with discovery of dataset use in scholarly work” by Christian Zimmermann on the RePEc blog.
TLDR: RePEc now takes annotations of research papers then asks authors to confirm dataset annotations inferred by machine learning models. That’s a human-in-the-loop approach for enhancing metadata used in scholarly infrastructure. It’s part of a larger AI research effort called Rich Context.
Selected tweets about this work:
RePEc is a decentralized bibliographic database for research in Economics and related fields, which provides search for working papers, journal articles, books, and software components. The collected metadata and services are all maintained by volunteers, across more than 100 countries. RePEc has a reputation for incorporating a broad range of works — providing RePEc handle unique identifiers even in cases where those works do not yet have a DOI assigned. RePEc fosters a close relationship with many authors and works to gather feedback directly from them to correct and enrich the collected metadata.
NYU Coleridge Initiative
The partnering mentioned in the article is a collaboration with the NYU Coleridge Initiative. In particular, the Rich Context project by Coleridge Initiative focuses on knowledge graph representation and inference for metadata about datasets from federal agencies. These datasets get used in social science research for evidence-based policymaking. The knowledge graph work incorporates metadata about: datasets, research publications, authors/researchers, research projects, data stewards, data providers, subject headings mesh, and so on. See figure 2 below for a visualization of entities represented in the graph.
Overall, the Rich Context project is about working with linked data. Consider how social science researchers often work with sensitive data (called micro-data) about people: their incomes, debts, medical history, home addresses, relatives, prison records, etc. Typically a lengthy process is required to obtain authorization to access any of those datasets. While data mining and other “traditional” techniques from information retrieval, knowledge discovery, etc., would not be appropriate given the ethics and privacy concerns and compliance requirements, knowledge graph representation of metadata is a viable option.
Published works which result from this kind of research in turn drive policy at federal, state, and local levels of government. Unfortunately those published works rarely have computer-readable citations for the datasets they use — at least not much other than a brief text mention within some PDF stored behind a paywall. Don’t count on using a search engine to discover those kinds of details. Also keep in mind that many of those datasets are developed and maintained by federal agencies. On the one hand, agencies need to monitor how their datasets get used, by whom, when, where, etc. On the other hand, there’s a substantial learning curve (read: cost to taxpayers) for any new analysts and researchers to make expert use of that data, especially for work across multiple agencies, which is often the case for important policymaking.
To address those issues, the Coleridge Initiative at NYU produces the Administrative Data Research Facility (ADRF) which runs on Amazon GovCloud with FedRAMP “moderate” compliance. ADRF was cited as the first federal example of secure access to confidential data in the final report of the Commission on Evidence-Based Policymaking, and it is now used by more than 15 agencies. For further details, see: “Evidence-based decision making: What DOE, USDA and others are learning” by Wyatt Kash in FedScoop (2019–06–28).
In addition to supporting cross-agency data stewardship and social science research on sensitive data, the ADRF platform collects metadata about dataset usage. The NYU team is developing a metadata explorer and recommendation service for its users and current work by Project Jupyter helps support this. See the upcoming JupyterLab Metadata Service scheduled for an initial release in October 2019. There’s an emerging category of open source projects for knowledge graph of metadata about dataset usage from Lyft, Uber, Airbnb, LinkedIn, WeWork, Stitch Fix, etc. These platforms address issues related to data governance: risk, cost, compliance, etc. They also extend beyond compliance and risk to create value through AutoML and MLOps initiatives. Watch this space carefully if you want to track some of the most important AI trends in industry.
Collecting metadata and making recommendations is where Rich Context comes into the picture … For more details about Rich Context, see its white paper and also a related article “Themes and Conferences per Pacoid, Episode 12” for an overview of where social science research plays vital roles in current industry challenges faced in data science.
Coleridge and partners leverage machine learning to identify dataset mentions in research publications. That work uses ML to infer the implied (read: missing) links, supplying the linked data that’s needed for knowledge graph representation. This fits into the general category of entity linking, where state-of-the-art work in natural language has made substantial leaps forward since 2018. NYU hosted a machine learning competition last year to help kickstart this research and a team from Allen AI took first place. For more details, see “Where’s Waldo?”by Julia Lane at AKBC 2019, and also the upcoming book Rich Search and Discovery for Research Datasets.
ML Competition based on GitHub
As a next iteration, a leaderboard competition hosted on GitHub is just now launching. See https://github.com/Coleridge-Initiative/rclc for details. That follow-up to the first competition includes a public corpus for Rich Context, based on the knowledge graph work at ADRF plus partnering with USDA, Bundesbank, Digital Science, SAGE Pub,GESIS, ResearchGate, and others. So far the knowledge graph work makes use of Dimensions API, OpenAIRE, and RePEc API. Subsequent steps will address author reconciliation, plus recommendations for social science researchers and potentially workflow meta-learning (AutoML) support via JupyterLab and other open source.
For the current research focus in Rich Context, dataset links inferred by machine learning models — in other words, the results from the ongoing ML competition on GitHub — are then confirmed by authors using a feedback loop via RePEc. See figure 4 below.
Human in the loop for Scholarly Infrastructure
Of course one of the biggest advances in machine learning over the past decade has been deep learning, and that’s used by embedding models for entity linking. Deep learning depends on having carefully labeled data for training. One popular approach in industry to label data has been to leverage semi-supervised learning– also called active learning or human-in-the-loop (HITL). In other words, when machine learning models have high confidence for predicted results, use those. Otherwise, revert the decision back to a human expert to provide a label, then use the expert human feedback to train better models. In this case, the research publication authors are the human experts. RePEc sends the inferred datasets per publication to their community of authors, who then confirm or reject the linked data. That feedback completes its round trip back to Rich Context, to augment the knowledge graph.
As Christian Zimmermann described in the RePEc blog article:
We hope eventually to automate the search and discovery of datasets and highlight their value as a scholarly contribution in the same way we collect information about publications and citations. The results should help inform government agencies about the value of data that they produce and work with, empirical researchers to find and discover valuable datasets and data experts in their scientific fields, and policy makers realize the value of supporting investments in data.
Another way to visualize this … consider how the metadata confirmed directly by the authors provides feedback into the the knowledge graph. A related term used to describe that approach is social knowledge collection — see figure 5:
As far as we know, this is the first instance of using semi-supervised learning for improving dataset attribution in social science research. It bodes well for human-in-the-loop AI approaches for augmenting scholarly infrastructure in particular, and for support of the US federal data strategy in general.
Call To Action
Of course, so much work with machine learning is in fact compensating for a lack of more direct metadata annotation “upstream” … Clearly, if researchers had more formal ways to make citations about the datasets they’ve used, and if they provided that metadata as an integral part of how papers get published, then we could simply cut out the ML middleman. The many efforts to promote reproducible research are heading exactly in that direction. However, until that becomes accepted practice (read: required by funding sources and publishers) then there are ways to provide better metadata. For example, the Coleridge Initiative has a form at https://coleridgeinitiative.org/pubsubmission — shown in figure 6:
Meanwhile we’re working to get the kind of social knowledge collection integrated into the publisher workflows, author profiles, etc.