# The Power Law of Data Opening

The Power Law distributions are ones of the most underrated statistical distributions (you can think of a distribution as the way in which something is shared out among a group or spread over an area). A Power Law is present when the distribution is top-heavy with a long tail. Stated in other words, it is a system where large is rare and small is common.

In his book Zero To One, Peter Thiel explains that his Funders Fund investment thesis relies on the distribution of success among tech startups. It’s almost binary: while successful companies become huge (and it’s almost impossible to understand how huge they are), almost every other startup fails.

The more we try to understand the world, the more these Power Laws appear. We are used to seeing everything as a Normal Distribution (Gaussian bell curves), but it turns out that there are a lot of phenomena where the distribution follows, in fact, a Power Law. This fact has some really interesting consequences.

“Gaussian and Paretian distributions differ radically. The main feature of the Gaussian distribution . . . can be entirely characterized by its mean and variance . . . A Paretian distribution does not show a well-behaved mean or variance. A power law, therefore, has no average that can be assumed to represent the typical features of the distribution and no finite standard deviations upon which to base confidence intervals . . .” McKelvey

When we discuss Open Data with people, the question of whether we know what type of data an organization needs to open up first pops up all the time. Intuition and experience lead naturally to suggest mobility data, real-time data, and geo-referenced data (geo-shapes). However, we don’t have a real fundamental understanding of the distribution of popularity among open datasets... yet.

So I’ve analyzed the popularity of several Open Data portals’ datasets.I started off by looking at four portals: data.opendatasoft.com, the portal where you can find every dataset opened by OpenDataSoft customers ; data.gov, based on CKAN ; nycopendata.socrata.com, operated by Socrata and data.gouv.fr, the French national portal built off their own technology. Depending on the portal, the popularity is measured through the number of views, the download count or the number of API calls. This is not really an issue as we are only interested in the distribution. And each of those metrics is only an approximation of popularity.

Here are the results, limited to the first few hundreds of datasets - (note that you can explore the data on the link under each graph):

This is pretty impressive, Power Law always emerges. The popularity and usefulness of Open Data thus seem to follow a Power Law! It needs more work, better comparisons but I think it’s a good hypothesis for a better understanding of the dynamics of Open Data.

We can note that the distributions look all the same, even if the number of datasets is clearly different depending on the portal. The scale invariance is one of the main properties of this kind of distribution. It implies, if I’m right, that the dynamics of data opening apply to every Open Data portal.

In addition, starting to understand what people want with Open Data and starting to explore how these things evolve will allow the Open Data ecosystem to improve its strategies. By better serving all the different actors - see my precedent article for more details - the ecosystem will grow and convince more and more organizations and people to open datasets.

“Our institutions (not just businesses, but also educational and governmental) are largely designed for a Gaussian world where averages and forecasts are meaningful.” John Hagel

A lot of businesses have been thought knowing the Power Law distribution of the market.

1. The whole Venture Capitalist funds business is based on the idea that you need to find the teams that will build the companies on the ~top100 worldwide. You can’t spend time on projects that only might give back some money but that don’t have the potential to fully finance the fund in case of success. Peter Thiel Founders Fund seeks less than ten projects per year, in the hopes that all of them will become the leader in their market.
2. Amazon big business model shift, when the firm opened its marketplace and logistics platform, aimed at targeting efficiently the long tail of products. By allowing new vendors to sell their ‘long tail products’ on its platform, and letting them manage the stocks, Amazon seized the opportunity of mastering the whole market, from the big hits to the niche products.
3. Wikipedia understood very quickly what is often referred to as the 1% rule: the distribution of roles between `creators`, `editors` and `browsers`. Knowing that, they developed the tools and strategy to better serve and valorize the creators and to empower the editors. It led to exponential growth in the availability of quality knowledge used by ever increasingly more people.
4. Tons of other businesses can be analyzed through the Power Law distribution: Uber, for now, makes ~90% of its money in only 5 cities around the world while expanding its service. Google beat Yahoo! and many others did so because it gave an easy access to the long tail of - at the time - newly developed websites. Twitter, which has a similar distribution of creators-editors-browsers to Wikipedia, just announced that it will no longer display ads for the high-value influencers.

If the distribution of the popularity of Open Datasets follows a Power Law, we have to think more broadly about the way we can address the different typology of datasets.

“The rewards for achieving a better understanding of the Paretian world are enormous. Small moves, smartly made, can lead to exponential improvements in wealth creation provided they leverage the deep structures that define Paretian distributions” John Hagel

• How can we identify the most valuable datasets and help owners to release them and create value through their openings?
• Should we treat those datasets the same way we treat less popular ones?
• How can the most popular datasets attract re-users, and how can we then make the most of the less popular ones?
• What should the business model be for opening of the long tail datasets?
• If only a few datasets make the Open Data Movement count, maybe we should focus mostly on helping organizations releasing these datasets.

Usually, a Power Law distribution arises when feedback influences future events or decisions. It’s going to be really interesting to investigate the fundamental mechanism of Open Data popularity. Although, popularity is not necessarily wealth creation, and we should also analyze wether or not the popular datasets are the ones that drive returns to data publishers. And if they aren’t, what is the distribution of valuable open datasets?

We are starting to have sufficiently long historical perspective on Open Data and it is now time to go further and start understanding the whole dynamics behind it…

--

--

## More from Nicolas Terpolilli

Love podcasts or audiobooks? Learn on the go with our new app.