Human-in-the-loop AI for scholarly infrastructure

Check out a recent post, “New initiative to help with discovery of dataset use in scholarly work” by Christian Zimmermann on the RePEc blog.

TLDR: RePEc now takes annotations of research papers then asks authors to confirm dataset annotations inferred by machine learning models. That’s a human-in-the-loop approach for enhancing metadata used in scholarly infrastructure. It’s part of a larger AI research effort called Rich Context.

Selected tweets about this work:


Figure 1: Example article lookup in RePEc

NYU Coleridge Initiative

Figure 2: Knowledge graph representation of metadata about datasets

Overall, the Rich Context project is about working with linked data. Consider how social science researchers often work with sensitive data (called micro-data) about people: their incomes, debts, medical history, home addresses, relatives, prison records, etc. Typically a lengthy process is required to obtain authorization to access any of those datasets. While data mining and other “traditional” techniques from information retrieval, knowledge discovery, etc., would not be appropriate given the ethics and privacy concerns and compliance requirements, knowledge graph representation of metadata is a viable option.

Published works which result from this kind of research in turn drive policy at federal, state, and local levels of government. Unfortunately those published works rarely have computer-readable citations for the datasets they use — at least not much other than a brief text mention within some PDF stored behind a paywall. Don’t count on using a search engine to discover those kinds of details. Also keep in mind that many of those datasets are developed and maintained by federal agencies. On the one hand, agencies need to monitor how their datasets get used, by whom, when, where, etc. On the other hand, there’s a substantial learning curve (read: cost to taxpayers) for any new analysts and researchers to make expert use of that data, especially for work across multiple agencies, which is often the case for important policymaking.

ADRF Platform

In addition to supporting cross-agency data stewardship and social science research on sensitive data, the ADRF platform collects metadata about dataset usage. The NYU team is developing a metadata explorer and recommendation service for its users and current work by Project Jupyter helps support this. See the upcoming JupyterLab Metadata Service scheduled for an initial release in October 2019. There’s an emerging category of open source projects for knowledge graph of metadata about dataset usage from Lyft, Uber, Airbnb, LinkedIn, WeWork, Stitch Fix, etc. These platforms address issues related to data governance: risk, cost, compliance, etc. They also extend beyond compliance and risk to create value through AutoML and MLOps initiatives. Watch this space carefully if you want to track some of the most important AI trends in industry.

Figure 3: Data Governance meets MLOps meets AutoML meets Knowledge Graph about Metadata

Rich Context

Coleridge and partners leverage machine learning to identify dataset mentions in research publications. That work uses ML to infer the implied (read: missing) links, supplying the linked data that’s needed for knowledge graph representation. This fits into the general category of entity linking, where state-of-the-art work in natural language has made substantial leaps forward since 2018. NYU hosted a machine learning competition last year to help kickstart this research and a team from Allen AI took first place. For more details, see “Where’s Waldo?”by Julia Lane at AKBC 2019, and also the upcoming book Rich Search and Discovery for Research Datasets.

ML Competition based on GitHub

For the current research focus in Rich Context, dataset links inferred by machine learning models — in other words, the results from the ongoing ML competition on GitHub — are then confirmed by authors using a feedback loop via RePEc. See figure 4 below.

Figure 4: Semi-supervised learning via RePEc, etc.

Human in the loop for Scholarly Infrastructure

As Christian Zimmermann described in the RePEc blog article:

We hope eventually to automate the search and discovery of datasets and highlight their value as a scholarly contribution in the same way we collect information about publications and citations. The results should help inform government agencies about the value of data that they produce and work with, empirical researchers to find and discover valuable datasets and data experts in their scientific fields, and policy makers realize the value of supporting investments in data.

Another way to visualize this … consider how the metadata confirmed directly by the authors provides feedback into the the knowledge graph. A related term used to describe that approach is social knowledge collection — see figure 5:

Figure 5: Using a human-in-the-loop feedback loop to engage the author community

As far as we know, this is the first instance of using semi-supervised learning for improving dataset attribution in social science research. It bodes well for human-in-the-loop AI approaches for augmenting scholarly infrastructure in particular, and for support of the US federal data strategy in general.

Call To Action

Figure 6: Submitting metadata about datasets used in research publications

Meanwhile we’re working to get the kind of social knowledge collection integrated into the publisher workflows, author profiles, etc.

Many thanks to Sloan, Schmidt Futures, and Overdeck for their funding of this work.

evil mad sci, ; lives on an apple orchard in Ecotopia