The economics of data — Part I
We all read the headline “data is the new oil” a few times. Although this sounds catchy, it is fairly an inaccurate description that doesn’t really reflect how technologists and modern businesses should see data. There definitely are some similarities: Both are “raw material” and can be further refined or mixed to produce more products. However this is almost where the similarities end.Data is not a single commodity that could have a certain spot price. Data comes in different shapes and sizes, some data is scarce while other is abundant.Some data is recurrent.Some data is very sensitive, they require different layers of protection. I have never heard of a barrel of oil requiring an encryption layer.
“I never heard of a barrel of oil that required encryption”
Data as an economic good
There is still room left in economic research and theories around digital goods and the digital economy. Before the digital age, goods were physical and had a fair amount of cost associated with producing them.There was and still is services which is not a physical object but typically required the labor of a specialist. Economists would also categorize these economic objects in terms like “exclusivity” and “rivalry”, two terms that indicate the accessibility and marginal production cost of a good or service. In the digital world, these are hard to achieve and are probably one of the primary reasons most digital services are either “free” or very low cost, especially to consumers.
Like the physical world, the digital world generates data about the transactions going through it. The biggest difference is that it is much easier to capture and process these data when it is bits transmitted over a wire and stored on a hard drive.The other big use case of technology has been communication. It is really quite amazing to me how many messages, shares, likes, etc. are produced. It appears that humans have always wanted to talk to each other it just used to cost too much.
It appears that humans have always wanted to talk to each other it just used to cost too much.
More human activities are becoming virtual. More socioeconomic activities are becoming digital. In the digital world, it is easy to capture data about the activities and the actors. This data can then be processed to extract valuable information fundamental to further our understanding of human societies, individuals’ choices and preferences and human-environment interactions. There can be a whole lot of applications based on this data. So data is a fuel to a cognitive engine that can result in new discoveries in knowledge. Data is food for thought, advertisement algorithms, route planners and drug development efforts.
“Data is food for thought, advertisement algorithms, route planners and drug development efforts.”
As we mentioned fuel, we can now make a clear refinement to the “new oil” analogy. Oil is one kind of fuel, it has certain value based on its characteristics. It is not however the only kind of fuel. There are many kinds of fuel all with distinct characteristics and utilities. I had originally thought of data as a basket of commodities and now I think of data more specifically as a basket of energy commodities. Data has the needed characteristics to be an intangible good: it has value. The easiest proof that data has value is that we are willing to pay for it. We are willing to lay fiber lines, launch satellites and build server farms to transmit and store data.
And now, as we have seen relative success and experienced the benefits of digitizing data, we are moving towards the digitization of more data. We are now trying to digitize more of our world. Sensors to capture physical data, more connectivity to transmit this data, faster and cheaper storage to record this data. We have seen what digital did to commerce and communication, and now we are pushing to do the same to the environment, and then eventually to humans
Pricing data
The price of a good is a function of supply and demand. Demand is a function of utility and total cost of ownership. In order to calculate this price with any reasonable precision, we should first understand what data are we talking about. Data is not at all created equal. Data comes in different shapes, sizes and frequencies. We can categorize data based on different attributes. I will list here a few examples:
- Scarcity : How unique and hard to find/gather the data is.
- Velocity: How fast does the data move
- Essentiality: How important is this type of data for the next steps in the reaction chain
- Minimum Volume: How much data is needed before the engine can start the chain reaction.
- Reproducibility: Is the data reproducible(i.e: can be synthesized) or naturally reoccurring?
Other features of the data like volume or compliance requirements and so on can also be considered. Based on the above list, we can say that personal data is scarce because individuals are unique.What is personal? the scarce parts: genetic code and personal address for example. Individual actions in themselves are not scarce as I am sure I am not the only one who bought that book or went to that movie. But a collection of my activities across different domains and temporal dimensions I would consider scarce and personal.
Data moves, and some data moves slow while others move fast yet some doesn’t move at all. Your heart beats, number of cars driving the 101 south or the number of tweets/sec are all examples of fast moving data. Slow moving data would include things like the members of a parliament that only has an election every 2 years, or the number of highways connecting Berlin and Hamburg or the capital of Nepal. They can change, they may change but they don’t change too frequently. Other data is static: The GPS coordinates of The Great Pyramid of Giza is an example.
Essentiality covers the importance of a piece or a batch of data to enable the next step in the chain reaction. This is fairly domain specific, for example if the chain ends with a food recommendation, knowing any allergies the person may have is essential. If the chain ends with some segmentation of customers then knowing their income maybe essential.
Reproducibility is basically how we would answer the question of trying to synthesize the data or wait for it to reoccur . If we are for example observing a complex and rare natural phenomenon that we have no way of simulating that would be a non-reproducible data set. On the other hand, it is easy to generate simple sentences given a large corpus of text to sample from. Generally speaking, the more temporal significance a data set has the less reproducible it will be as we are yet to master the manipulation of time.
The significance of the minimum volume will become clearer later and in Part II of this article. For some information chains, the reaction can be kicked off with low volume. Other chains depend on a massive volumes to ignite the machine.
It is reasonable to suggest that the price of a data batch or set is directly proportional to Scarcity, Velocity and Essentiality. The higher the magnitude of these attributes the more expensive the data is. The price of data is inversely proportionate to reproducibility . The relation to reproducibility is hopefully intuitive and clear: Since we can reproduce the data ourselves or it will inevitably reoccur naturally the price should be low.
The required minimum volume(RMV from now on) impact on price though is a bit more tricky. There are some game theoretic approaches we can employ here and we will in Part II to understand why it is tricky. Intuitively we think if more of something is needed, then its price will be higher. However we should not confuse RMV with demand. Demand is a genuine force based on a need to acquire something. RMV is a restriction, it means that in order to make anything useful out of this chain reaction, a certain amount of data is required. So if there are no guarantees that the required amount of data is available, there is a risk of acquiring subsets of that data individually. So the relationship between RMV and the price of a data asset is in a great deal dependent on the other factors. If the data is scarce, moves fast and has a high RMV, then it is irrational to pay a high price unless the transaction guarantees we meet the RMV.
Okay, so now we have to move on to supply and demand. Who or what will be producing and selling the data. Who or what will be consuming or buying this data. Before we get to that, we have to ask the multi billion dollar questions: Why would anyone want to buy data?what value do they expect out of it? Why now?
“Why would anyone want to buy data?what value do they expect out of it? Why now?”
The answer of course is : Artificial Intelligence
To be continued