Data is the new IP
Data is the new intellectual property.
Michael Lewis’ 2003 book Moneyball was an eye-opener regarding the power of data. Lewis chronicles how Billy Beane, general manager of baseball’s Oakland A’s, assembled a winning team with the league’s smallest payroll. He used statistical data to determine which players to add and which to let go in order to, in effect, build the most efficient team. In the process, he challenged widely believed myths about what makes a winning team. Fortunately for Beane, a statistician named Bill James had previously collected and organized baseball performance data over decades and published an annual statistics book, The Bill James Baseball Abstract. Would Beane have been able to do what he did if James had not already spent decades poring over baseball games? On the other hand, was it inevitable that someone, sooner or later, would have done what Beane did with James’ data?
Over the last ten years, data has been increasingly driving business decisions. Google and Facebook have used personal data to make online ads more targeted for digital denizens. Amazon, eBay and Yelp analyze personal recommendations and preferences to help buyers make better decisions about products and services. Netflix and Amazon use data to recommend content and products. In the next ten years, most if not all decisions will be aided — if not primarily driven — by data.
Using data to help make certain decisions is not new. Capital markets have used historical data in crafting trading and investment strategies for decades. Bloomberg and Thomson Reuters have become large, successful companies by serving the financial services industry’s hunger for constantly updated data. The data used by the financial services industry was until recently rudimentary and primarily related to trading and the fundamental performance of companies, countries and various other entities. Over the last few years, hedge funds have started to pay for other types of seemingly unrelated data like satellite images, port activity, insurance claims, etc.
Today data is beginning to be used to improve the performance of health care, education, logistics and a host of other industries. Collecting, cleansing, organizing and distributing constantly changing data sets is becoming an attractive investment opportunity.
In parallel, smart phones and other network connected sensors and devices are generating vast amounts of data. We are likely to see a proliferation of data sources in the near future. This data needs to be cleansed, reviewed for relevancy and organized. Delivered to the right applications at the right time, new data sources can generate significant economic and social value.
Collecting, cleansing, organizing and distributing constantly changing data sets is becoming an attractive investment opportunity.
Companies in all sectors have organized their internal data since the computer revolution and used it for various business functions. What we are seeing now is increasing interest in data that is not generated inside an organization but that could help that organization and others make better decisions.
Technologies are being developed to effectively and timely process vast amounts of seemingly unrelated data and derive valuable insights from them. Several factors are making that possible. First is wider adoption of Hadoop and the associated Map Reduce technology, which provides a broad framework for working with massive amounts of data, popularly called “Big Data”. New technologies that store and manage massive amounts of data are driving down the cost and improving the performance of data warehouses. New technologies, often building off UC Berkeley’s open-source project, Spark, are making it possible for engineers to quickly implement new ideas and approaches into sophisticated algorithms that can derive meaningful insights from large stores of constantly changing data. And the final layer of data analysis, visualization, is changing, too.
With these new technologies, we are beginning to see the emergence of a data-driven economy. In this data-driven economy, what is more important: the data or the algorithms that derive insights from that data? In other words, Bill James or Billy Beane? Of course, you need both, but I would argue that data itself is the durable and proprietary asset. While one can always hire smart data scientists to develop new algorithms or improve existing ones, it is often impossible to acquire past data at a later time once you realize you need it.
The economic value of data itself is fundamentally different from software or algorithms. A new company can outdo existing players with better algorithms and software. However, collecting data takes time and often requires partnerships with other players. Past data — if it even exists — is often in isolated silos and owned by individual entities. Data often needs to be captured when it is available. For example, some companies have collected data on the efficacy of brand campaigns over the last several years. Anyone considering a new brand campaign might greatly benefit from this data. Relevant data would be difficult, if not impossible, to compile years later when it might be needed to feed an algorithm or train an Artificial Intelligence engine. Owning that data is the new IP in the era of data-driven decision-making.
In addition, studies have also shown that as you collect more and more relevant data, simpler algorithms can produce reasonably good results. Sophisticated algorithms are more important in producing better results when data is sparse.
To support data-driven decision-making, a number of startups have focused on data collection, cleansing, organization and distribution. The earlier startups collected data that enabled better marketing and targeting approaches. Now, all kinds of data are being collected and used by startups. Examples include satellite images that show the number of cars in retailers’ parking lots, weather data that might indicate the amount of irrigable land, email receipts that might point to trends in consumer spending, cancer treatment outcomes that might reveal improved treatment protocols, or traffic patterns that might help logistics companies improve deliveries.
Many observers have noted that deep learning or statistical methods can be misused to reach arbitrary conclusions. “Lies, damn lies and statistics” is an oft-quoted phrase. Done right, data science can produce a huge benefit for society, but one must be aware of the dangers of misuse. Analysts have a tendency to select data they think is valuable and relevant and discard data that they consider poor quality or not relevant. This was amply proven when most experts, using survey data, predicted incorrectly the outcome of the Brexit vote and the US presidential election. It is not that their algorithms and methodologies were at fault; it is the data they used to make those predictions — showing just how important getting the right data at the right time can be. And how important it is to avoid personal biases in selecting data sources.
As the economy comes to rely more and more on vast quantities of data and startups as well as traditonal companies find new ways to generate and use data, data has never been more valuable. Just like Billy Beane couldn’t have built his team without Bill James’ statistics, companies in this data-driven economy won’t be able to take off without access to comprehensive high-quality most current data. That is why owning the right data is becoming a core intellectual property.