The real promise of big data
Note: A version of this piece was published on VentureBeat
Current “big data” and “API-ification” trends can trace their roots to a definition Kant first coined in the 18th century. In his Critique of Pure Reason, Kant drew a dichotomy between analytic and synthetic truths.
An analytic truth was one that could be derived from a logical argument, given an underlying model or axiomatization of the objects the statement referred to. Given the rules of arithmetic we can say “2+2=4” without putting two of something next to two of something else and counting a total of four.
A synthetic truth, on the other hand, was a statement whose correctness could not be determined without access to empirical evidence or external data. Without empirical data, I can’t reason that adding five inbound links to my webpage will increase the number of unique visitors by 32%.
In this vein, the rise of big data and the proliferation of programmatic interfaces to new fields and industries have shifted the manner in which we solve problems. Fundamentally, we’ve shifted from creating novel analytic models and deducing new findings, to creating the infrastructure and capabilities to solve the same problems through synthetic means.
Until recently, we used analytical reasoning to drive scientific and technological advancements. Our emphasis was either 1) to create new axioms and models, or 2) to use pre-existing models to derive new statements and outcomes.
In mathematics, our greatest achievements were made when mathematicians had “aha!” moments that led to new axioms or new proofs derived from preexisting rules. In physics we focused on finding new laws, from which we derived new knowledge and knowhow. In computational sciences, we developed new models for computation from which we were able to derive new statements about the very nature of what was computable.
The relatively recent development of computer systems and networks has induced a shift from analytic to synthetic innovation.
For instance, how we seek to understand the “physics” of the web is very different from how we seek to understand the physics of quarks or strings. In web ranking, scientists don’t attempt to discover axioms on the connectivity of links and pages from which to then derive theorems for better search. Rather, they take a synthetic approach, collecting and synthesizing previous click streams and link data to predict what future users will want to see.
Likewise at Amazon, there are no “Laws of e-commerce” governing who buys what and how consumers act. Instead, we remove ourselves from the burden of fundamentally unearthing and understanding a structure (or even positing the existence of such a structure) and use data from previous events to optimize for future events.
Google and Amazon serve as early examples of the shift from analytic to synthetic problem solving because their products exist on top of data that exists in a digital medium. Everything from the creation of data, to the storage of data, and finally to the interfaces that scientists use interact with data are digitized and automated.
Early pioneers in data sciences and infrastructure developed high throughput and low latency architectures to distance themselves from hard-to-time “step function” driven analytic insights and instead produce gradual, but predictable synthetic innovation and insight.
Before we can apply synthetic methodologies to new fields, two infrastructural steps must occur:
1. the underlying data must exist in digital form
2. the stack from the data to the scientist and back to the data must be automated
That is, we must automate both the input and output processes.
Concerning the first, we’re currently seeing an aggressive pursuit of digitizing new datasets. An Innovation Endeavors’ company, Estimote, exemplifies this trend. Using Bluetooth 4.0, Estimote is now collecting user specific physical data in well-defined microenvironments. Applying this to commerce, they’re building Amazon-esque data for brick and mortar retailers.
Tangibly, we’re not far from a day when our smartphones automatically direct us, in store, to items we previously viewed online.
Similarly, every team in the NBA has adopted SportsVU cameras to track the location of each player (and the ball) microsecond by microsecond. With this we’re already seeing the collapse of previous analytic models. A friend, Muthu Alagappan, recently received press coverage when he questioned and deconstructed our assumption in positing five different position-types. What data did we have to back up our assumption that basketball was inherently structured with five player types? Where did these assumptions come from? How correct were they? Similarly the Houston Rockets have put traditional ball control ideology to rest in successfully launching record numbers of three-point attempts.
Finally, in economics, we’re no longer relying on flawed traditional microeconomic axioms to deduce macroeconomic theories and predictions. Instead we’re seeing econometrics play an every increasing role in the practice and study of economics.
Tangentially, the recent surge in digital currencies can be seen as a corollary to this trend. In effect, Bitcoin might represent the early innings of an entirely digitized financial system where the base financial nuggets that we interact with exist fundamentally in digital form.
We’re seeing great emphasis not only in collecting new data, but also in storing and automating the actionability of this data. In the Valley we joke about how the term “big data” is loosely thrown around. It may make more sense to view “big data” not in terms of data size or database type, but rather as a necessary infrastructural evolution as we shift from analytic to synthetic problem solving.
Big data isn’t meaningful alone; rather it’s a byproduct and a means to an end as we change how we solve problems.
The re-emergence of BioTech, or BioTech 2.0, is a great example of innovation in automating procedures on top of newly procured datasets. Companies like Transcriptic are making robotic fully automated wet labs while TeselaGen and Genome Compiler are providing CAD and CAM tools for biologists. We aren’t far from a day when biologists are fully removed from pipettes and traditional lab work. The next generation of biologists may well use programmatic interfaces and abstracted models as computational biology envelopes the entirety of biology — driving what has traditionally been an analytic truth seeking expedition to a high throughput low latency synthetic data science.
Fundamentally, we’re seeing a shift in how we approach problems. By removing ourselves from the intellectual and perhaps philosophical burden of positing structures and axioms, we no longer rely on step function driven analytical insights. Rather, we’re seeing widespread infrastructural innovation to accelerate the adoption of synthetic problem solving.
Traditionally these techniques were constrained to sub-domains of computer science — artificial intelligence and information retrieval come to mind as tangible examples — but as we digitize new data sets and build necessary automation on top of them, we can employ synthetic applications in entirely new fields.
Marc Andreessen famously argued, “Software is eating the world” in his 2011 essay. However, as we dig deeper and understand better the nature of software, APIs, and big data, it’s not software alone, but software combined with digital data sets and automated input and output mechanisms that will eat the world as data science, automation, and software join forces in transforming our problem solving capabilities — from analytic to synthetic.