Big Data Strikes Back: The Rise of Image, Sensor, and Genome Data Powering Deep Vertical Startups

Every two years we create ten times more data than all of preceding human history — 90% of the world’s data is less than two years old. If anything, this pace is going to increase. And, the unique confluence of algorithm innovation in machine learning and low-cost data acquisition can provide material advantage in both compute and the businesses that depend on it.

As Alon Halevy, Peter Norvig, and Fernando Pereira from Google point out in “The Unreasonable Effectiveness of Data”, data alone is enough to get great performance out of many kinds of simple machine learning models. By way of example, Tesla and Google and Uber aren’t racing to “learn” every mile of US roads to win some sort of jolly Jules Verne bet — whoever gets there first will have a massive advantage in self-driving cars.

Aside from transportation, Image/Video, IoT, and Genome are three of the most interesting classes of fast-growing data for disrupting large industries.

Annual global IP traffic will pass the zettabyte threshold by the end of 2016 and pass 2 zettabytes by 2020. By 2020 80% of IP traffic will be video. Youtube is the largest source of new online data in the world, generating over 100 petabytes of new data every year. Offline image/video data may be roughly 5x the size of online image/video data. When you add image/video data from space platforms (like DCVC company Planet Labs), cessnas, drones (run by coordination platforms like DCVC company Dronedeploy), robotic high speed medical imaging and pathology systems (like DCVC company 3Scan), video from autonomous vehicles, increasingly ubiquitous security systems, and VR/AR, it is easy to see why computer vision is the hottest field in AI right now.

Image data from our portfolio company, © 2014 Planet Labs Inc. Landsat, July 2013

By 2020, 10% of the world’s data will be IoT data — sensor data generated by machines. Machine learning applied to IoT data will be used to optimize manufacturing lines and processes (as DCVC companies SigOpt, Citrine, and Rescale do today), help tune supply chains in real time (as DCVC company Tradeshift is already doing), create smart cities with less traffic/pollution and more reliable services, reduce energy consumption (as Dr. Yoky Matsuoka’s algorithms enabled NEST do) and provide greater security in homes, power autonomous vehicles in both industrial (e.g., mining, construction, trucking) and consumer segments, and make medical processes safer, less expensive, and more reliable (as several of our stealth companies are doing today).

By 2025, DNA and related “*omics” (e.g., RNA, proteins) sequencing will generate between 2 and 40 exabytes a year, and will have surpassed Youtube as the largest source of new of data in the world. It’s a reasonable guess that if the hottest field in AI today is computer vision, then in 2025 it will be bioinformatics, and it’s an equally reasonable guess that 2025 is too conservative an estimate and that this epiphany occurs in 2020. Already, DCVC companies that combine a unique machine learning and sequencing data advantage are changing drug discovery (Atomwise, Capella Bio), diagnostics (Karius, Molecular Stethoscope, Cofactor Genomics), and even the nature of sequencing itself (Omniome)