Metaspeak Meetup, Dec 14 2020

Published in

Knowledge Technologies

8 min readDec 1, 2020

Registration is open for the Metaspeak Meetup (a free online event) on December 14, 2020, hosted by LinkedIn https://metadataday2020.splashthat.com/

This is the public portion of the Metadata Day 2020 workshop, which convenes leaders from several open source projects for graph-based dataset metadata management, along with thought leaders in metadata management and dataset governance from Google, IBM, UC Berkeley, etc.

Igor Perisic, CDO @ LinkedIn, will present our welcome address, followed by two panels of expert practitioners who will summarize outcomes from our workshop earlier in the day — (1) focus on use cases, and (2) focus on production at scale — plus Q&A from the audience.

A little over a year ago I presented “Overview of Data Governance” at Big Things Conference in Madrid. One of my slides referenced “Ground: A Data Context Service” by Joe Hellerstein, et al., at UC Berkeley, followed by a list of open source projects which focused on graph-based discovery for metadata about datasets. I cited work from LinkedIn, Lyft, Uber, Netflix, etc.

That DG talk had begun as a request by my colleague and co-author Ben Lorica — we’d been conducting a series of industry surveys, beginning in 2018, about “ABC” (AI, Big Data, Cloud) adoption trends.

2018 was a banner year for data governance: the launch of GDPR, published findings about Facebook/Cambridge Analytica scandal, and — although not in the headlines as much — a year in which hundreds of millions of people had their private info exposed (i.e., sold) through several large industry failures. Data Governance was becoming top-of-mind for many IT leaders as a consequence — in contrast to what I’d seen in four decades of computing during which DG had largely been swept under the run.

Earlier in the talk I referenced “The Case for Open Metadata” by Mandy Chessell at IBM, plus her team’s work on the open standard ODPi Egeria and its reference implementation Apache Atlas. I’d also described related US government initiatives, such as the US Federal Data Strategy, H.R. 4174 evidence-based policymaking, etc. Plus, I’d mentioned the JupyterLab Metadata Explorer project that I’d helped work on with Brian Granger, et al., at Project Jupyter.

Speaking of those open source projects… I stuck that slide into the deck with a “Watch this space” caveat in the voice-over. It seemed worth tracking carefully, and that’s proven out over time.

For a particular cohort of organizations, the natural response to GDPR was simple:

Collect metadata about dataset usage across your organization.
Build a graph-based platform for metadata representation.
Build out UX features for dataset discovery and reporting (aka, compliance).
Recognize savings across your organization since your data scientists finally have a definitive source for collected knowledge about datasets.
Push the thing public as an open source project and advocate for adoption by other organizations.
Recognize business upside beyond compliance efforts, since this new lens into dataset usage allows insights about possible new business lines which were previously unavailable.
Elevate reporting of these results up to executive levels.

While GDPR compliance was probably all good right around Step #3, these organizations kept pushing the envelope. The found both ROI and increased markets from an activity that began as risk mitigation.

From my vantage point, the next logical step in business would be for these projects to begin to marry AI methods along with their graph-based metadata discovery practices, leading inevitably into something much larger. That’s on deck now.

You may have read the excellent, recent (2020–10) article “Almost Everything You Need To Know on Data Discovery Platforms” by Eugene Yan at Amazon. If not, please do. The article is filled with brilliant analysis and insights. Eugene curates a list of open source projects within this emerging category: https://github.com/eugeneyan/applied-ml#data-discovery

The list of these data discovery projects, courtesy of Eugene, currently includes:

Amundsen — Lyft, Square, ING, Edmunds, Workday, etc.
DataHub — LinkedIn
Marquez — WeWork, Stitch Fix
Dataportal — Airbnb
DataBook — Uber
Apache Atlas — IBM, plus several large enterprise firms
Metacat — Netflix
Lexikon — Spotify
Data Platform — Twitter
Nemo — Facebook
Data Catalog — Google

From the perspective of an equity partner at a Bay Area VC firm, “Something is definitely up!”

This much activity emerging from several different tech firms indicates both needs and capabilities, i.e., a new market category emerging. Indeed, two of these projects have recently spun out VC-funded tech start-ups.

Part of the pitch is simple: if you, as an enterprise firm, want to be competitive, you’ll need to leverage AI applications … because your competitors already are doing so. One consistent signal that Ben Lorica have found throughout our industry surveys is that when firms begin to recognize ROI from machine learning investments, they double-down on those investments. Consequently the gap between “have’s” and “have-not’s” in the AI space is accelerating. One major hurdle among the “have-not’s” who aren’t competitive: they must get their data in order. Like, yesterday. Data governance. Data strategy. Effective data engineering at scale. Visibility into what data they have. How their data assets can be leveraged. Taking advantage of their metadata to gain even more leverage.

Approximately 50% of enterprise is in the “have-not’s” category currently, and they need help. This emerging category of tech start-ups focusing on these graph-based solutions is key.

If you haven’t been involved much in graph-based data management, nor its applications which are rapidly spilling over into AI, let me tell you that “Something is definitely up!”

I’ve been assisting organizers at knowledge graph conferences such as Knowledge Connexions and The Knowledge Graph Conference. There’s also the fantastic workshop AKBC, albeit more about research. The dirty little secret is that KG practices are widespread throughout enterprise — with the caveat that historically it’s been difficult to get those teams to talk about their work in an industry-first forum, outside of academic conferences. The list of related topics goes well beyond graph DBs and knowledge graphs, into more specific technology areas such as interactive graph visualization, graph algorithms, embedding and graph AI (deep learning), statistical relational learning, probabilistic inference on networks, causal reasoning, persistent homology, and so on.

Full disclosure: I’ve been teaching industry courses about these topics for several years; I’m an advisor for related AI start-ups including Recognai and Primer; I’m a committer for a nascent little open source project called kglab which attempts to blend these application areas into a simple Python-based abstraction layer with “infinite laptop” scale-out provided by Ray from UC Berkeley and Anyscale; plus I’m super excited about this general area of knowledge graphs and math-meets-code-meets-use-cases. But I digress…

There was another twist. For those of you who haven’t been working much in the federal sector and spending time in DC, let me just say that mention of metadata management or data governance tends to get people excited. There’s work to be done, there’s recent policy to support that work, and the priorities (and consequences) for the public are crystal clear: the 2019 US federal data strategy in response to the 2017 bipartisan H.R. 4174 bill sets the stage for dramatic changes in how approximately 30% of the US economy leverages data in the Age of AI.

Compliance, privacy, fairness/bias, security, managing pandemic response, etc., are all key concerns, although the larger issue is about potentially missing out on opportunity. One problem: the US federal government — aside from perhaps the Intelligence Community — has not been renowned as an “early adopter” of technology.

Late last year I was co-chair for the “Rich Context” workshop in DC, which brought together expert practitioners from government, expert practitioners from Silicon Valley, plus top researchers from academia and corporate research labs. Ian Mulvany, also a co-chair there, helped wrangle metadata experts in the room into a “Foo Camp” styled unconference, including Natasha Noy from Google, Deborah McGuinness from RPI, Daniella Lowenberg from Dryad (University of California library system), and more. I had invited several colleagues, including Ed Kearns and his team handling dataset management at NOAA, plus my good friend Mark Grover, product manager for Amundsen at Lyft.

During that workshop I noticed something rare: people who had 30+ years working in the data-intensive parts of the US federal government got really excited about talking with leaders from data-intensive projects in Silicon Valley. One side thoroughly understood policy, compliance, security, accountability, social impact. One side thoroughly understood how to deliver technology solutions that scale, based on open source, while leveraging data with machine learning. Ed Kearns in one corner of the room talking fast with Mark Grover — that told me most of what I needed to know!

To wit, even though decades of data governance products have resulted in a balkanized field of point solutions, the stars appear to be aligning for the next stage of evolution. Indications include:

Top research institutions (e.g., RISElab at UC Berkeley) are focusing on how to define the hard problems — “Something is definitely up!”
A flurry of recent open source projects from sophisticated tech firms in response to critical data governance needs, are earning ROI and gaining enterprise adoption — “Something is definitely up!”
VC firms are jumping into this space, defining a new category of tech start-up s— “Something is definitely up!”
Federal government recognizes the need and is reaching out to partner with Silicon Valley on solutions at scale — “Something is definitely up!”

That last item is especially important. The more data-intensive US federal agencies have budgets that would make most Silicon Valley execs scream, or cry, or both. While we may talk about trillion-dollar market valuations for unicorns on NASDAQ, try comparing that with a decade’s worth of the federal budget. Moreover, given the world’s need for better handling of data (AI ethics, anyone? Urgent pandemic responses?) this is a case of need meeting opportunity. These stars are aligning.

Again, projects such as Amundsen began in response to crucial needs for GDPR compliance, but then led to business upside. Think about that at a much, much larger scale. These stars are aligning. Your work, given the ubiquity of data applications, will probably depend on these innovations, soon.

Metadata Day is a workshop of leading practitioners in this field. We’ll hold a closed session in the morning and collaborate to reduce the outcomes into two presentations: one about use cases, and the other about production practices at scale.

Metadata Day is a workshop of leading practitioners in this field. We’ll hold a closed session in the morning and collaborate to reduce the outcomes into two presentations: one about use cases, and the other about production practices at scale.

Several people I mentioned above will be among the expert practitioners participating in this workshop: Joe Hellerstein, Mandy Chessell, Deborah McGuinness, Natasha Noy, Daniella Lowenberg, Ian Mulvany, Mark Grover — and more. Plus we’ll have leads from several of the open source projects mentioned. Shirshanka Das (DataHub), Nadiya Hayes, and Kapil Surlaker from LinkedIn are hosting the event — and I’m helping them.

Please join us!

Social media hashtag: #metaspeak2020

Metaspeak Meetup, Dec 14 2020

Written by Paco Xander Nathan