Open Data for Businesses and Greater Good

Published in

Kamu Data

5 min readMay 30, 2019

What is Open Data?

From Wikipedia: Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open-source data movement are similar to those of other “open(-source)” movements.

The biggest sources of open data (with some examples) are:

Governments and municipalities

Census data and national statistics:

Data published under open government initiatives:

NGOs and non-profit organizations

Global development, immigration:

Geographic data:

News and mass media

GDELT — an amazing database of world news sentiment analysis
Civil Unrest Events

Science & Research

Health sciences provide wealth of well-structured data.

The number of machine learning datasets is growing.

Historical datasets:

For-profit organizations

Sometimes share their data:

Or allow their data to be scraped for research purposes:

Sports

Popularity of sports betting serves as a great incentive for producing very detailed datasets for improving the forecasting techniques:

Why Should You Care?

Open Data for the Good of Society

If you’re not frustrated with the current political situation, no matter which country you call home — you’re probably one of the people who stopped reading/watching news altogether. Enjoy your bliss. But if you are, and willing to do something about it — here’s an option to consider:

Open data and open government movement are at the top of my list of ways to deal with the present situation, where politicians and public figures manipulate the emotions of people and use highly divisive topics to push their agendas. I see it as a way to transition the society away from political arguments (aka “one who screams loudest — wins”) towards arguments grounded in factual data. It will not make the arguments disappear altogether, but I would very much prefer arguments about different ways of interpreting the data, to current popularity contests and focus-group-polished speeches.

On the government side the hopeful trends are:

We already seen some governments publishing their legislative documents on GitHub and new offerings like GitHub for Governments. Publishing and maintaining data is a lot more difficult problem that remains to be solved.

Some municipalities go a few steps beyond to publish data in the open:

Above include some amazing datasets like the record of all emergency service dispatches, crime data, GIS data on city districts, property lots, major transit lines, and even residential property assessment data.

Visualizing Vancouver’s city blocks in MapBox

A great illustration of the potential of open data is the movement of data-driven journalism, where data is used either to corroborate stories, or even to discover them:

International Consortium of Investigative Journalists who investigated the tax evasion schemes in multi-terabyte leaks of Panama and Paradise Papers
Bellingcat
Data Driven Journalism
FlowingData

Open Data for Businesses

If fixing governments is not at the top of your priorities list — there are many successful businesses built with open data.

Its uses can be categorized as:

Business optimisation in areas such as market analysis, targeted marketing, customer acquisition, and retention. For example using census data to identify the geographical areas and target demographics most receptive to the company’s products.

Enhancing the existing products. For example Google Maps that uses GTFS data for transit schedules.

Business models centered around providing extra functionality on top of [what is or should be] open data. Examples:

Yelp uses a database of businesses and municipal health inspections and augments it with search, ranking, and social features
WalkScore uses location data of shops, schools, and transit to compute a convenience rating for rental apartments
Mapbox uses open data to provide high-quality mapping solutions

Open Data for Machine Learning

With big data and machine learning fields at the peak of the hype cycle expo floors at data conferences are filled with startups tackling big data and machine learning problems for the enterprises. Very little of this hype touched open data. All these companies are currently directing their efforts at the data produced internally within the businesses.

Occurrence of topics in searches. Source: Google Trends

However, if a business needs a model built, but doesn’t actually produce the data needed to build it internally (e.g. not all businesses that need a behavioral model of users are in the psychology field) — this is when the attention turns to open data. In this case, you’ll be looking at a very scarce collection of datasets usually built and open-sourced by universities, for example:

Most open data sets are very old and well known and used by almost every scientist in the respective field simply because there aren’t many alternatives.

Building a dataset is a very labour-intensive task, which is typically done by universities. Unfortunately, they are not eager to open up the datasets they’ve built to the public. Even getting access to the data in research purposes is often gated to the point of near undiscoverability.

Conclusion

The supply and demand for data are currently completely out of proportion, and with the demand so rapidly growing — the gap continues to widen. I believe that open data will play a significant role in closing this gap.

Open data space is still in its infancy. The volume is constantly growing, which is great, but even data that is already out there remains frustratingly underutilized and undervalued. It seems to be stuck in a vicious cycle of low investments and low returns and needs just a little nudge in the right direction to gain momentum. I believe it has a huge potential for the societies, and there are many creative ways to use open data waiting to be discovered by entrepreneurs.

Following is the list of problems that I believe need to be solved to unlock the open data’s true potential:

Discoverability
Lack of cohesion
Accessibility
Collaboration
Poor handling of time dimension
Monopolization of data
Lineage & Provenance

I will dedicate my next few posts to these issues, their causes, and long-term perspectives.

Interested?

Subscribe to our blog and check out kamu.dev where we are working on re-imagining the future of data.