Vertical Beats Horizontal in Machine Learning

The best products in the world are made by vertically integrated businesses: Apple’s hardware to software; Amazon’s warehouses to websites; and Carnegie’s mines to mills [1].

Zetta is completely focused on investing in data and machine learning startups. We see lots of horizontal platforms and APIs that anyone can use to add some machine learning models to their application. However, machine learning has advanced to the point where customers expect better than commodity performance. We like to see startups vertically integrating their technical skills with the skills of domain experts and unique data acquisition to build applications with the level of accuracy required in commercial and industrial settings.

This article will describe the state of machine learning, focusing on the importance of domain expertise in feature development and labeling data when building high accuracy models. We will then explore ways in which startups can get the requisite domain expertise and labeled data to build such models. Finally, we will consider some of the challenges in working with customers to develop software based on such models. We won’t teach anyone in the field anything (we recommend that ML practitioners skip the next section) but do hope that you benefit from our perspective having seen thousands of startups and talked to hundreds of customers, and understand a little more about how we think.

Background on Supervised Machine Learning

Machine learning (ML) makes computers intelligent; it enables computers to do many things humans can but at scale, and even many things human cannot. Developing the ML models requires a lot of up-front, human effort. Much of this human effort is in figuring out and programming the features a ML model can use to identify things. In ML (and statistics), a ‘feature’ is a distinct, quantifiable property of what you’re trying to predict.

Sometimes one can just figure out some useful features without much help. For example, we all know that the pupil is a feature of an eye and that it is a black circle. So, when writing an ML model to identify eyes, we would include a feature that is activated when a geometrically circular, black (hexadecimal value #000000) group of pixels is in a given image.

Other times, feature engineering will require the involvement of people with expertise in a particular domain. Everyone can identify a pupil but not everyone can identify the fender from a 2001 Chevy Silverado, for example. Furthering the example, figuring out if that fender is damaged beyond repair or can be banged back into place requires you to know how to fix car parts.

Other times again, one can’t distinguish the features of an object up-front, even if one can generally identify the object. Or, one can’t articulate the features in a way that a machine can generally understand when applying the model to noisy data. For example, I know a zebra when I see it but I couldn’t tell you the typical distance between black and white stripes that means it’s a zebra rather than another thing [2]. We need to know that distance if we want a machine to do lots of zebra recognizing. So, we give a machine a bunch of images and let it learn that typical distance.

Humans can supervise a machine’s feature learning by giving it a set of data that is labeled with inputs and outputs. For example, image x is a zebra and image y is not a zebra. The machine will then find a set of features that it notices in all the x images and not in the y images, in this case the typical distance between black and white stripes in the x images.

Machines can also just work their way through unlabeled data sets, completely unsupervised or semi-supervised. The machine will use various methods (programmed by people) to group the data then conclude that certain groups are significant features of something. Many people around the world are making fantastic progress in unsupervised ML and it is undoubtedly a key way we will move the field forward. However, we won’t discuss it here because both the accuracy and tractability of unsupervised learning-based models is not high enough for mission-critical commercial and industrial applications.

Summarily, when writing ML models we need to define the features of the things we want to identify and:

  1. sometimes those features are straightforward for anyone to identify;
  2. if not straightforward, we might bring in a domain expert to define some features; then
  3. if still not straightforward, we might write an algorithm to figure out some features by studying labeled things.

A greater number of salient, predictive features make an ML model more accurate so one will often use all three steps above when developing a ML model. Given that number 1 is generic, we’ll now focus on numbers 2 and 3. The background above means that we can quickly get through the remaining points in this article.

The Importance of Domain Experts

Despite the brilliant ways in which ML engineers solve problems everyday, it’s fair to assume that any given ML engineer doesn’t know everything about everything. Thus, we often need domain experts to define features so that the ML engineers can test the predictive value of those features.

Alec Ross’s new book, “The Industries of the Future”, includes a nice example of utilizing domain expertise to make software for cattle farmers: Pasture Meter uses grass measurements to recommend where farmers should send cows for feeding. Today, the recommendations are based on sensor readings of grass height. In future, you can imagine the app predicting where one should herd cattle based on how long the cattle stayed on the paddock last time (i.e. how much grass they munched), obviating the need for expensive and unreliable sensors. The key piece of expert knowledge required to build such a predictive model is how much grass one head of cattle munches per day. One could derive that from first principles of bovine nutrition but it would be more efficient and productive to get this information from someone who raises cattle for a living.

Here’s a contra example from Marvin Minksy in 1956 that shows you can’t develop a model of something without having some idea of the features that make up the model [3]:

“My friend Nat Rochester, of I.B.M., had been programming a neural-network model…on the I.B.M. 701 computer. His model had several hundred neurons, all connected to one another in some terrible way. I think it was his hope that if you gave the network some simultaneous stimuli it would develop some neurons that were sensitive to this coincidence. I don’t think he had anything specific in mind but was trying to discover correlations — something that could have been of profound importance. Nat would run the machine for a long time and then print out pages of data showing the state of the neural net. When he came to Dartmouth, he brought with him a cubic foot of these printouts. He said, ‘I am trying to see if anything is happening, but I can’t see anything.’ But if one didn’t know what to look for one might miss any evidence of self-organization of these nets, even if it did take place.”

The strength of supervised, representation learning algorithms is that you don’t need any domain knowledge at all if this knowledge is contained in the labels. We’ll talk about this in the next section.

The Need for Labeled Data

Where we don’t have domain experts to figure out the features of a thing up-front, or it’s just too hard to do that in a way that a machine can generally understand, then we can try getting the machine to learn from a labeled set of examples. That way, the machine can delineate the set of features that differ between the examples of what you want to identify and the examples of other, random things.

Labeling often requires engineers to use other data processing techniques before applying the labels. For example, it is very hard for a machine to figure out anything from a set of 1M customer service emails without any labels related to the products about which customers are complaining, categories of complaints, etc. In this case, an engineer will use something like Natural Language Processing to find the segments of these emails where customers talk about products, then clustering to built up categories, then label the data with this information.

Machines generally need lots of examples to delineate features. Getting lots of labeled examples for specific domains is hard. For example, where would you find 100,000 images of 2001 Chevy Silverado fenders and 100,000 images of fenders that are not from a 2001 Chevy Silverado? Leo Polovets nicely categorized some ways in which startups can build datasets [4]. However, crowdsourcing, running surveys or building a tool for car mechanics are not likely to yield 100,000 images of 2001 Chevy Silverado fenders in short order. One probably has to procure such data from a company that has been photographing cars for years, like a car manufacturer, chain of body shops or insurance company.

Incidentally, capturing this data can be the start of a competitive ‘moat’ around your business. We’ve seen that accessing and owning clean data to feed models can be the single hardest problem in starting a vertical, ML-based business.

There are some promising transfer and semi-supervised learning techniques that, we gather, may provide alternatives to gathering these domain-specific corpora of data, especially for generic domains such as image, video and language understanding. However, the state of the art doesn’t seem to offer enough just yet, and particularly not for specific domains.

Building a Business on Machine Learning

We started this article with a strong statement: the best products in the world are made by vertically integrated businesses. Now that we appreciate the importance of domain expertise and large, labeled data sets in building ML models we can hopefully see that any startup based on such ML models needs to integrate their business with a cadre of domain experts and troves of labeled data. Ideally, that startup would have exclusive access to these things to prevent other companies building ML models of similar accuracy.

Vertically integrating domain experts and data sources by paying salaries and licensing fees, respectively, is often too expensive. Rather, we’d suggest a sort of ‘virtual integration’. This was a term coined by Michael Dell to describe, “stitch[ing] together a business with partners that are treated as if they’re inside the company.” [5]

Bringing on Domain Experts

Ideally, the founders of a startup would have relevant domain expertise. However, it’s not always realistic to expect a 2–3 person founding team to incorporate domain experts in addition to the ML, software engineering and business talent required of any startup in this space. Startups can access domain expertise by:

  • working very closely with customers. Just listening to customers through a good customer development process will yield valuable insights from these domain experts on drivers of predictive models;
  • bringing on advisors with many years of experience in a particular domain;
  • hiring consultants; and
  • many other ways, including just reading about the first principles relevant to their domain.

Incidentally, domain experts can also:

  1. help to contextualize algorithms and analytics to the business case and domain;
  2. help to empathize with customers; and
  3. prevent you from using the inevitable nonsensical or spurious operations of your model.

Acquiring Labeled Data

Ideally, a startup would have a trove of great data so that it can just get to feature engineering on Day Zero. However, data sets come in different degrees of volume and veracity. Often, the dataset one uses to build a proof of concept won’t be big enough or of a high enough quality to build a model with level of accuracy required in commercial and industrial settings. Startups can access big, labeled datasets by:

  • working very closely with customers to find data they already have;
  • scouring the real world and the Web (including the Dark Web) for both free and paid data sets;
  • setting up a network of sensors that deliver some basic analytics to customers, collecting the data along the way;
  • building a domain-specific app that collects data; etc.

We’ve found that it’s somewhat impossible to find data that is labeled adequately for training a ML algorithm so there will also be significant amount of effort beyond the above. Tractable, a company we work for, is building some interesting technology to help with this labeling.

Challenges

We can see that a very likely source of both domain experts and labeled data is your potential customers. The startup has the ML, software engineering and product expertise but the customers have everything else. Some customers may thus ask for an economic interest in your company if they’re significantly and positively contributing to the development of your core technology. They will almost definitely ask if you’re planning to sell to their competitors. The cost/benefit of these asks is an important discussion to have with your team and investors because entering into such agreements can significantly affect the potential revenue and thus value of your company.

Customers may argue that your models only exist because of their data, so they should get a royalty from any use of your models. Often, negotiating away such a royalty requires educating the customer about the difficulty of building the models, infrastructure to run them, and other aspects of ML. Hopefully, they will see that data is an important but ultimately small contributor to the value of your startup. Matt Turck has some good options for startups that find themselves negotiating this with customers[6]:

a. Negotiating upfront and in full disclosure that, while the data will strictly remain the property of the customer, the data learnings will be owned by the vendor
b. A contributory model where the customer needs to join the “customer learning network” to benefit from what the product learned from all other customers
c. A tiered pricing where the customer pays more if they decide to not join the “customer learning network”
d. Early in the life of the startup, targeting startups as prospective customers, as they tend to have a more progressive attitude towards those questions.

Finding a Market Through Focus

Startups find markets by specifically addressing customers’ needs. Startups can only specifically address customers’ needs with the power of ML if they build models that make accurate predictions. Building accurate models requires accurate input data and programming of features that are predictive in your customers’ domain. Thus, any company hoping to derive their competitive advantage from ML technology should figure out how to incorporate domain experts and form a data acquisition strategy on Day Zero.

Thanks

Thanks to Andrew Tulloch, Benjamin Hamner and Alexandre Dalyac for their comments on this post.

Footnotes

  1. A substantive discussion of vertical integration, starting with Coase’s Nature of the Firm is beyond the scope of this article but something about which we’re always interested in talking.
  2. Interestingly, some have theorized that zebras developed their alternating black and white stripes so that the heat difference between the adjacent spaces generates turbulence on the surface, making it hard for the disease-carrying flies to land and bite them. Source: http://www.nature.com/ncomms/2014/140401/ncomms4535/full/ncomms4535.html
  3. http://www.newyorker.com/magazine/1981/12/14/a-i
  4. http://codingvc.com/the-value-of-data-part-2-building-valuable-datasets
  5. https://hbr.org/1998/03/the-power-of-virtual-integration-an-interview-with-dell-computers-michael-dell
  6. http://mattturck.com/2016/01/04/the-power-of-data-network-effects/