Your next (successful) data project will share data

Dean Allemang
5 min readJan 3, 2023

We’ve heard a lot in the past few years about how we are realizing that “Data is king!”. This awareness of the importance of data in business, science and public policy has been a result, to a large extent, of the successes of machine learning algorithms to provide insightful predictions of complex system that we barely understand — as long as we have enough data to describe them.

Since data is so valuable, you could imagine that these developments would result in a tendency to hoard data; if I keep my data to myself, I get to reap that value and not have to share it. To some extent, this happens; many organizations, and even individuals within organizations, work very hard to keep their data secret and safe.

But there are many cases in which data becomes more valuable when it is combined with other data. This principle has been well-known in the scientific community for ages; scientific discoveries happen when we find a correlation between one line of research and another. Scientists and other academics have always been rewarded by publishing their results (“publish or perish”), but in recent years, more and more funding agencies have required scientists to publish their data as well.

Science isn’t the only field of endeavor where data publication is important. Public policy is difficult in the best of times. Good policy decisions are basically predictions of what outcomes will result from which actions. When these decisions are based on data, they have a better chance of being accurate. We saw this in 2020, when there was much controversy on what policies we should take to control the spread of COVID-19. Laboratories, government agencies and NGOs collected and disseminated data on infections, mortality, symptom progression, and other information about the spread of the disease. The Associated Press assembled and curated data, some of which it made available to its subscribers, and some which it published to the public. This data was instrumental for informing public policy. You can see the AP data portal at https://data.world/associatedpress, which includes data about elections and government spending, in addition to a variety of data about COVID infection rates and policy.

It isn’t hard to think of public policy decisions that are impacted by data; election strategies are adjusted based on polls and election outcomes. Traffic patterns in cities inform policy about urban planning. Demographic data informs zoning and urban development. In fact, it is difficult to think of any aspect of public policy that won’t be improved with good data sharing practices.

Private enterprises are data-driven; product sales, service success rates, supply chain and distribution, all of these aspects of a business rely on data to inform effective decision making.

Given the importance of data sharing, it is a bit surprising that data sharing is just an afterthought in most enterprise and in the world at large. It has been over a decade since Tim Berners-Lee exhorted us to demand “Raw Data Now!” in his TED talk, but it is still difficult to get Raw Data about issues of great importance.

Enterprises are at least aware that they want/need to share data inside the enterprise; there is now a whole product category called “Data Catalog”, and it is now commonplace to have a Chief Data Officer in most large enterprises. In its 2004 review (section 13.3) of the events leading up to the 2001 attacks, the 9–11 commission recommended a change of attitude for intelligence data from the previous “need to know” default to a “responsibility to provide”. Despite this awareness, the state of practice of data sharing within and between organizations is still mostly haphazard and piecemeal.

Whether we are talking about data sharing within an enterprise, or among scientific collaborators around the globe; whether we are interested in data about national security or public policy, whether we are advancing scientific knowledge or developing product and market strategy, the basic endeavor is the same. There are three basic things we need to do when sharing data: make our data available to others, find data that others have made available, and merge the data we have found to provide further insights.

Make our data available to others. There are a lot of ways to make data available. We have been sharing web pages on the World Wide Web for decades. But sharing data presents different challenges than sharing web pages, including scale (datasets are typically much larger than web pages), formatting (HTML was a key enabling technology for web pages), access control (who can see it and what are they allowed to do), and many more.

Find data that others have made available. The flip side of publishing data is subscribing to it. How do we find data that is of interest to us? How do we even know what a dataset is, so that we know it is useful to us? How do we make that data actionable? The challenges for consuming data are as plentiful as the challenges for publishing it.

Merging data to provide synergy. Datasets are more valuable when they can be linked to other datasets. But in a world where we are finding data that is published by a variety of providers, the representation of data is necessarily distributed. Some of the better known issues in data distribution have to do with identity (how can I tell when one dataset refers to the same entity as another?), data alignment (does one dataset the same thing about the entities as another), and reference (what does a data value refer to?)

Data sharing situations differ in many ways, but each of the will require some approach to each of these activities. These approaches might involve technology, standardization efforts, or even workflow management. Not all of these approaches will be relevant in every situation, but one thing is sure; data sharing is too important to leave these things to chance.

Subsequent entries in this blog will be focused around data sharing, and in particular, the aspects outlined here of publishing, finding and merging data in a distributed setting. This will take us through a variety of related topics, including graph databases, data catalogs, distributed data modeling, ontologies, web semantics, and much more. I plan to release a new entry on one of these topics each week through 2023; I look forward to your comments and discussions.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.