Mapping the Next Frontier of Open Data: Corporate Data Sharing

This essay first appeared in the Internet Monitor project’s second annual report, Internet Monitor 2014: Reflections on the Digital World. The report, published by the Berkman Center for Internet & Society, is a collection of roughly three dozen short contributions that highlight and discuss some of the most compelling events and trends in the digitally networked environment over the past year.

Stefaan G. Verhulst, with David Sangokoya (@dsango)

When it comes to data, we are living in the Cambrian Age. About ninety percent of the data that exists today has been generated within the last two years. We create 2.5 quintillion bytes of data on a daily basis — equivalent to a “new Google every four days.”[i]

Among the staggering statistics illustrating today’s rapid generation and volume of data, the number of mobile phone subscriptions is expected to reach 7 billion by the end of 2014, nearly equal to the world’s population.[ii] Terabyte after terabyte of data and metadata from these 7 billion mobile subscriptions is collected and stored by corporations. Given the amount of mobile, social and digital data available, corporations have access to a wealth of consumer data on their servers that can be aggregated and analyzed to track preferences, provide more targeted consumer experiences, and derive value towards the corporate bottom line.

As we witness the rapid intensification of “datafication,” access to data is growing increasingly critical and essential to addressing many of our most important social, economic and political challenges. While the rise of the Open Data movement has opened up over a million datasets largely from government agencies and departments, data held by corporations has been harder to access. Most companies are unwilling to share the data they are collecting due to concerns over the legal ramifications of privacy and security breaches — as well as trade secrets and proprietary interests.

At the same time, we witness several early attempts by corporations to open up their datasets for analysis by researchers, public interest organizations and third parties to inform decision-making. By combining original datasets from corporate data providers with diverse, geo-spatial datasets (such as open government data and open science data), users can uncover greater insights and correlations across a range of societal trends.

Corporate data sharing refers to the emerging trend whereby companies are sharing anonymized and aggregated data for third-party users to mine for patterns and trends that can inform better policies and lead to greater public good. The trend was originally coined “corporate data philanthropy” at the World Economic Forum meeting in Davos in 2011 and has gained wider currency through Global Pulse, a United Nations data project that has popularized the notion of a global “data commons.”[iii]

In what follows, we share early findings of our efforts to map this new frontier of open data, along with a set of research questions that must be addressed to understand the value of corporate data sharing better. Illustrating the practice and assessing the importance of opening corporate data will be necessary to accelerate increased access to societal valuable data held by business today.

1569 Mercator map of the world

Taxonomy of current corporate data sharing practices

For all the growing attention corporate data sharing has recently been receiving, it remains very much a fledgling field. Much remains to be defined and understood. There has been little rigorous analysis of different ways of sharing, though our initial mapping of the landscape resulted in identifying six main categories of activity to date:

1. Research partnerships, in which corporations share data with universities and other research organizations. Through partnerships with corporate data providers, several researchers organizations are conducting experiments using anonymized and aggregated samples of consumer datasets and other sources of data to analyze social trends. For instance:

  • Yelp shares its data on neighborhood businesses with 30 universities for researchers to build tools and discover meaningful value in the data. Using shared data on Yelp businesses in the San Francisco Bay Area, an academic research team from U.C. Berkeley used a probabilistic model for natural language processing to detect subtopics across a dataset of over 200,000 Yelp business reviews. Their research uncovered correlations between positive ratings and service quality, giving business owners evidence for improving their services.[iv]
  • Collecting over 20 terabytes of data per month through satellite imagery, Intel is partnering with researchers at the University of California at Santa Barbara to map snow patterns in the Sierra Nevada mountains and understand California’s remaining water resources.[v]
  • Safaricom, one of Kenya’s leading mobile companies, shared a year of anonymized phone data with Harvard researchers to map how migration patterns contributed to the spread of malaria in Kenya. By combining Safaricom’s data on call locations with national infectious disease data, researchers were able to estimate and map routes that contributed to the spread of the disease.[vi]
  • Just recently, online communities like Imgur and Reddit have joined forces with a select group of academic institutions as part of the Digital Ecologies Research Partnership (DERP) in order to provide data and support research on Internet social behavior.[vii]

2. Prizes and challenges, in which companies make data available to qualified applicants who compete to develop new apps or discover innovative uses for the data. Companies typically host these contests in an effort to incentivize a wide range of civic hackers, pro-bono data scientists and other expert users to find innovative solutions with the available data. For instance:

  • In Ivory Coast and Senegal, Orange Telecom hosted a global challenge that allowed researchers to use anonymized, aggregated data to help solve various development problems, including those related to transportation, health, and agriculture.[viii]
  • In its 2014 Dataset Challenge, Yelp is making its data on restaurants in cities like Phoenix, Madison, and Edinburgh available to academic researchers to build models and provide research on urban trends and behavior (such as whether Yelp data can help predict environmental conditions of restaurants).[ix]
  • Last year, Spain’s regional bank BBVA hosted a contest inviting developers to create applications, services and content based on anonymous card transaction data. The first prize went, for instance, an application called Qkly, which helps users plan their time by estimating what time of day a given place will be most overcrowded so as to avoid lines.[x]
  • In its “Big Data Challenge,” Telecom Italia pooled their data with partners from various Italian industries (local news, automobile, energy and weather) into one aggregated, geo-referenced dataset for participants to use for the competition. The data was available in batches and through an API, and contained millions of call data records, energy consumption records, tweets, and weather data points.[xi]

3. Trusted intermediaries, where companies share data with a limited number of known partners. Companies generally share data with these entities for data analysis and modeling, as well as other value chain activities. For instance:

  • South Africa-based telecom MTN makes anonymized call records available to researchers through a trusted intermediary, Real Impacts Analytics — a data analytics firm that provides guided and predictive analytics solutions.[xii]
  • Twitter recently acquired, the social media aggregator Gnip in order to provide its data products to clients. Gnip allows Twitter to provide streams of its dataset to its clients in addition to streams from available social media data.[xiii]

4. Application programming interfaces (APIs), which allow developers and others to access data for testing, product development, and data analytics. By signing a terms of service agreement, companies give access to streams of its data in order to build applications. For instance:

  • Through its metadata and click tracking functionality, Bitly estimates social trends and allows users to build tools from real-time data.[xiv]
  • Building on top of its transportation data, Uber recently shared its API with companies such as Hyatt, United Airlines and Smart Calendar in order to integrate its services across related industries and improve overall customer experience.[xv]

5. Intelligence products, where companies share (often aggregated) data that provides general insight into market conditions, customer demographic information, or other broad trends. For instance:

  • Google shares search query-based data in conjunction with data from the US Centers for Disease Control in order to estimate levels of influenza activity over time.[xvi]
  • Facebook Open Graph Search allows for consumers and companies to mine social graphs for search query-based data, such as demographic and location data, “likes,” and multimedia. Companies such as Slate and Upworthy have used available data from Open Graph Search to optimize their headlines and increase readership.[xvii]

6. Corporate Data cooperatives or pooling, in which corporations — and other important dataholders such as government agencies — group together to create “collaborative databases” with shared data resources. These collaborations typically require an organizing partner as well as technical and legal frameworks surrounding the use and distribution of the data. For instance:

  • Recently, the White House has announced the development and launch of a data public-private partnership, which will involve making existing climate data, tools and products more accessible to decision-makers.[xviii]
  • Through its Accelerating Medicines Partnership, the US National Institutes of Health (NIH) is helping organize data pooling among the world’s largest biopharmaceutical companies in order to identify promising drug and diagnostic targets for Alzheimer’s disease.[xix]

Mapping the Next Frontier

Beyond such broad taxonomies, there exists almost no systematic analysis of corporate data sharing Much research remains to be done on the value proposition for corporations doing the sharing (or, indeed, for end-users), and on ways to maximize the potential and — importantly — minimize potential harms of shared data.

A more comprehensive mapping of the field of corporate data sharing would draw on a wide range of case studies and examples to identify opportunities and gaps, and to inspire more corporations to allow access to their data (consider, for instance, the GovLab’s Open Data 500 mapping for open government data). From a research perspective, the following questions would be important to ask:

  • What types of data sharing have proven most successful, and which ones least?
  • Who are the users of shared corporate data, and for what purposes?
  • What conditions encourage companies to share, and what are the concerns that prevent sharing?
  • What incentives can be created (economic, regulatory, etc.) to encourage corporate data sharing?
  • What differences (if any) exist between shared government data and shared corporate data?
  • What steps need to be taken to minimize potential harms (e.g., to privacy and security) when sharing data?
  • What’s the value created from using shared corporate data?

Additional Reading

Pawelke, Andreas, and Anoush Tatevossian. 2014. ‘Data Philanthropy: Where Are We Now?’. Blog. UN Global Pulse Blog.

Stempeck, Matt. 2014. ‘Sharing Data Is A Form Of Corporate Philanthropy’. Harvard Business Review.

Verhulst, Stefaan. 2014. ‘Mapping The Next Frontier Of Open Data: Corporate Data Sharing’. Blog. The Govlab Blog

References

[i] IBM, and Paul Zikopoulos. 2012. Understanding Big Data. 1st ed. New York: McGraw-Hill.

[ii] ITU,. 2014. ‘World Telecommunication/ICT Indicators Database’. http://www.itu.int/en/ITU-D/Statistics/Pages/publications/wtid.aspx.

[iii] Kirkpatrick, Robert. 2013. ‘A New Type Of Philanthropy: Donating Data’. Harvard Business Review. Accessed October 7 2014. http://blogs.hbr.org/2013/03/a-new-type-of-philanthropy-don/.

[iv] UC Berkeley School of Information,. 2014. ‘Students’ Data Analysis Uncovers Hidden Trends In Yelp Reviews’. http://www.ischool.berkeley.edu/newsandevents/news/20131004yelpdatasetchallenge.

[v] Gilpin, Lyndsey. 2014. ‘How Intel Is Using Iot And Big Data To Improve Food And Water Security’. Techrepublic. http://www.techrepublic.com/article/how-intel-is-using-iot-and-big-data-to-improve-food-and-water-security/.

[vi] “Quantifying the impact of human mobility on malaria,” Amy Wesolowski, Nathan Eagle, Andrew J. Tatem, David L. Smith, Abdisalan M. Noor, Robert W. Snow, Caroline O. Buckee, Science, October 12, 2012

[vii] Hern, Alex. 2014. ‘Reddit, Imgur And Twitch Team Up As ‘Derp’ For Social Data Research’.The Guardian. http://www.theguardian.com/technology/2014/aug/18/reddit-imgur-twitch-derp-social-data.

[viii] Blondel, Vincent D., et al. “Data for development: the d4d challenge on mobile phone data.” arXiv preprint arXiv:1210.0137 (2012).

[ix] Yelp.com,. 2014. ‘Yelp Dataset Challenge | Yelp’. http://www.yelp.com/dataset_challenge.

[x] Centrodeinnovacionbbva.com,. 2014. ‘BBVA On The Trail Of Its Own Applications Ecosystem’. http://www.centrodeinnovacionbbva.com/en/blogs/entrepreneurs/post/bbva-trail-its-own-applications-ecosystem.

[xi] Telecom Italia Corporate,. 2014. ‘HOME | Bigdata Challenge’. http://www.telecomitalia.com/tit/it/bigdatachallenge.html

[xii] Realimpactanalytics.com,. 2014. ‘Real Impact Analytics :: Churn Prediction With Social Network Analysis’. http://www.realimpactanalytics.com/blog/churn-prediction-social-network-analysis/.

[xiii] Blog.twitter.com,. 2014. ‘Twitter Welcomes Gnip To The Flock | Twitter Blogs’. Accessed October 10 2014. https://blog.twitter.com/2014/twitter-welcomes-gnip-to-the-flock.

[xiv] ‘Announcing The Bitly Social Data Apis’. 2014. Accessed October 10 2014. http://blog.bitly.com/post/40026085295/announcing-the-bitly-social-data-apis.

[xv] Uber Blog,. 2014. ‘Introducing The Uber API’. http://blog.uber.com/api.

[xvi] Google.org,. 2014. ‘Google Flu Trends | United States’. http://www.google.org/flutrends/us/#US.

[xvii] Facebook Developers,. 2014. ‘Overview’. https://developers.facebook.com/docs/opengraph/overview

[xviii] Whitehouse.gov,. 2014. ‘FACT SHEET: President Obama Announces New Actions To Strengthen Global Resilience To Climate Change And Launches Partnerships To Cut Carbon Pollution | The White House’. http://www.whitehouse.gov/the-press-office/2014/09/23/fact-sheet-president-obama-announces-new-actions-strengthen-global-resil.

[xix] Reardon, Sara. 2014. ‘Pharma Firms Join NIH On Drug Development’. Nature. doi:10.1038/nature.2014.14672.