What is Open Data?
From Wikipedia: Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open-source data movement are similar to those of other “open(-source)” movements.
The biggest sources of open data (with some examples) are:
Governments and municipalities
Census data and national statistics:
Data published under open government initiatives:
- Open Government Porta (CA)
- DigitalGov (US)
- Eurostat (EU)
- Open Data Monitor (EU)
- New York Open Data
- Vancouver Open Data
NGOs and non-profit organizations
Global development, immigration:
News and mass media
Science & Research
Health sciences provide wealth of well-structured data.
The number of machine learning datasets is growing.
- Aerial bombings of WW1, WW2, Korean and Vietnam wars
- Trans-Atlantic Slave Trade Database
- Scottish Witchcraft Trials
Sometimes share their data:
Or allow their data to be scraped for research purposes:
Popularity of sports betting serves as a great incentive for producing very detailed datasets for improving the forecasting techniques:
Why Should You Care?
Open Data for the Good of Society
If you’re not frustrated with the current political situation, no matter which country you call home — you’re probably one of the people who stopped reading/watching news altogether. Enjoy your bliss. But if you are, and willing to do something about it — here’s an option to consider:
Open data and open government movement are at the top of my list of ways to deal with the present situation, where politicians and public figures manipulate the emotions of people and use highly divisive topics to push their agendas. I see it as a way to transition the society away from political arguments (aka “one who screams loudest — wins”) towards arguments grounded in factual data. It will not make the arguments disappear altogether, but I would very much prefer arguments about different ways of interpreting the data, to current popularity contests and focus-group-polished speeches.
On the government side the hopeful trends are:
We already seen some governments publishing their legislative documents on GitHub and new offerings like GitHub for Governments. Publishing and maintaining data is a lot more difficult problem that remains to be solved.
Some municipalities go a few steps beyond to publish data in the open:
Above include some amazing datasets like the record of all emergency service dispatches, crime data, GIS data on city districts, property lots, major transit lines, and even residential property assessment data.
A great illustration of the potential of open data is the movement of data-driven journalism, where data is used either to corroborate stories, or even to discover them:
- International Consortium of Investigative Journalists who investigated the tax evasion schemes in multi-terabyte leaks of Panama and Paradise Papers
- Data Driven Journalism
Open Data for Businesses
If fixing governments is not at the top of your priorities list — there are many successful businesses built with open data.
Its uses can be categorized as:
Business optimisation in areas such as market analysis, targeted marketing, customer acquisition, and retention. For example using census data to identify the geographical areas and target demographics most receptive to the company’s products.
Enhancing the existing products. For example Google Maps that uses GTFS data for transit schedules.
Business models centered around providing extra functionality on top of [what is or should be] open data. Examples:
- Yelp uses a database of businesses and municipal health inspections and augments it with search, ranking, and social features
- WalkScore uses location data of shops, schools, and transit to compute a convenience rating for rental apartments
- Mapbox uses open data to provide high-quality mapping solutions
Open Data for Machine Learning
With big data and machine learning fields at the peak of the hype cycle expo floors at data conferences are filled with startups tackling big data and machine learning problems for the enterprises. Very little of this hype touched open data. All these companies are currently directing their efforts at the data produced internally within the businesses.
However, if a business needs a model built, but doesn’t actually produce the data needed to build it internally (e.g. not all businesses that need a behavioral model of users are in the psychology field) — this is when the attention turns to open data. In this case, you’ll be looking at a very scarce collection of datasets usually built and open-sourced by universities, for example:
- Cohn-Kanade face expressions
- RAVDESS for audio emotions
- NRC Word-Emotion Association Lexicon
- Image Net for image labelling
Most open data sets are very old and well known and used by almost every scientist in the respective field simply because there aren’t many alternatives.
Building a dataset is a very labour-intensive task, which is typically done by universities. Unfortunately, they are not eager to open up the datasets they’ve built to the public. Even getting access to the data in research purposes is often gated to the point of near undiscoverability.
The supply and demand for data are currently completely out of proportion, and with the demand so rapidly growing — the gap continues to widen. I believe that open data will play a significant role in closing this gap.
Open data space is still in its infancy. The volume is constantly growing, which is great, but even data that is already out there remains frustratingly underutilized and undervalued. It seems to be stuck in a vicious cycle of low investments and low returns and needs just a little nudge in the right direction to gain momentum. I believe it has a huge potential for the societies, and there are many creative ways to use open data waiting to be discovered by entrepreneurs.
Following is the list of problems that I believe need to be solved to unlock the open data’s true potential:
- Lack of cohesion
- Poor handling of time dimension
- Monopolization of data
- Lineage & Provenance
I will dedicate my next few posts to these issues, their causes, and long-term perspectives.
Subscribe to our blog and check out kamu.dev where we are working on re-imagining the future of data.