Where Does Data Come From?

My favorite type of startup founder is a hacker. I don’t mean the nefarious, law-breaking kind of hacker; I mean the curious, experimental, builder type of hacker — literally one who makes rough cuts. Most of the ethos around startups speaks to this mentality: prototyping, reverse-engineering, testing, and iterating quickly. Hackers view any scientific or engineering advance as something that they might be able to remix to solve a new problem.

Applying machine learning and data science is one area where we could use more experimentalists and hackers. Research-driven advances have launched something of a Cambrian Explosion in machine learning, which accelerated in earnest in 2012. Since that time, progress has been made at a remarkable rate on almost every dimension of ML, from game-playing to image classification. While these breakthroughs have pushed the field forward extremely quickly, there is a growing gap between what has been done in research and what is being shipped in products. This is where hungry, experimentalist-minded startup founders have a chance to make a dent in the universe.

ImageNet visual classification challenge results — Source: Quartz, David Yanovsky

Machine learning tools are quickly getting us to the point where even newcomers to the field can build working models that do useful things. This post covers how to start collecting data and the first several steps towards making what you’ve collected useful/valuable, with the assumption that the reader is still forming their data strategy.

As with anything new, building a data pipeline is not going to be clean, perfect, or right the first time. But the process of experimenting with it (even before you think you’re ready) is one of the prerequisites of getting it right in the future, and might even teach you something about your product in the process.

Collecting raw data

Since data is the main driver for business intelligence, and most startups begin without data, collecting raw data is the logical place to start. Raw data is the same type of information your ML algorithm will receive in the wild. Once it’s smart enough, your algorithm will be able to do something useful with it. While there are several sources for raw data, these are 5 of the most common in my experience. Using an experimentalist mindset, don’t be afraid to prototype by casting a wide net and trying several different sources of data.

1 — Public data

One of the largest sources of data is publicly available on the internet. It’s the information age, and information is being collected and put online at an exponentially increasing rate. One example using publicly available data to draw interesting insights is a project which uses Google street view data to classify types of cars to predict demographic information voting habits in different neighborhoods and districts. This type of information can potentially save hundreds of millions of dollars on door-to-door demographic surveys using “free” public street-view data. Other projects have leveraged public data by scraping Twitter for real-time sentiment analysis with a cost almost negligible compared to legacy techniques.

Pros — Public data is already accessible and if you’re able to automate scraping the data, a very small team can aggregate a large amount of data.

Cons — Everyone has access to public data (which potentially makes your model more replicable), but this doesn’t mean it’s non-viable as a starting point.

2 — Data from an existing product

A previous Lever blog post discussed an example in which Google forms transitioned from being static (the user types question and then manually selects question type) to being dynamic (able to predict which question type the user wants based on the inputted question text). The use of the static forms product generated the data to build the “recommender” system.

Another example is a car with ADAS (advanced driver assistance system) being used to collect data to (eventually) train self-driving models.

Pros — This is an ideal form of data acquisition, as the existing distribution/customer base serves to kickstart the data flywheel.

Cons — This type of data acquisition strategy is less relevant for a startup pre-launch, though is a high-leverage tool for integrating ML into an existing product.

3 — Human-in-the-loop

One of the most successful ways startups begin their data acquisition process is through manually supervising/controlling their pseudo-intelligent system in the real world. Take an autonomous sidewalk delivery robot, for example. The long-term goal may be for the robot to operate 99% autonomously, but before the robot has any data, it may only be 10% autonomous and spend 90% of the time with a human in the loop (walking behind the robot or controlling it through telepresence). In this case, the percentage automation can be increased over time with the robot constantly collecting real-world data.

The Wizard of Oz supervising his pseudo autonomous system

Pros — The startup is both collecting proprietary data and (ideally) exercising/prototyping the other parts of their business (like integrating with the restaurant/grocery store, interfacing with the delivery recipient, etc.)

Cons — This is an expensive and potentially slow way to build up a dataset.

4 — Brute force

The fourth way of achieving raw data acquisition is called “brute force” which is similar to the “human-in-the-loop” method, but the sole purpose of the activity is for collecting data. One example of this is surveying vehicles driving millions of miles to collect data which can then be used to create HD maps, and utilized for training later. Another example is most of the autonomous vehicle programs, which have long development times and extensive need for data collection before being able to generate revenue.

Pros — This form of data collection is generally hardest for a competitor to replicate. It also affords closer control over the data collection process and parameters (resolution, time series, etc.)

Cons — Time and money, neither of which most startups are afforded much of. Another consideration is that as technology compounds over time, the painstakingly acquired data will possibly become easier to collect in the future.

5 — Buying the data

The final way of starting raw data acquisition is to simply buy the data. Numerous platforms are available which broker data (from satellite imagery to legal data). While this may be a reasonable short-term solution, the main risks of relying on purchased data are having a single point of failure if the data stops being available and tailoring the model-building to a specific type of data collection (covered more later in the post).

Pros — Fast, can be relatively complete as a starting point.

Cons — Price, lack of control of supply, lack of controls of important variables (e.g. camera resolution/sensor type) which may affect the model downstream/longer term.

Preparing Raw Data

Raw data itself (e.g. lidar point clouds, images, and audio snippets of spoken language) isn’t, on its own, particularly helpful for making an intelligent system. The step between raw data collection and model building is data preparation. While there are some nuances associated with data preparation, I’ll cover three high-level components: filtering impurities, merging data from different sources, and data labeling/annotation.

The first step in data preparation is filtering impurities (e.g. lens flare, lighting conditions or mistake/corrupted data). Regardless of the source, the data that you have will contain mistakes and outliers. The impurities can be sorted manually, or by using automated tools to identify and separate them from the main training set.

These impurities shouldn’t be discarded since they will contain valuable information about how your model might react to these outlier events in the real world (e.g. lens flare on an autonomous vehicle’s camera), but it is good practice to keep them separate from the rest of the training data at first.

Another challenging component of data refinement is merging data from different sources. As an example, if you were building a self-driving car and stumbled upon a windfall of new image data, there may be challenges in merging these images with the existing corpus of data that you have. These new images may have a different resolution, different contrast, or any number of other differences which could create difficulties for model-building. These differences might be able to be mitigated through software/post-processing.

Labeling (or annotating) is usually the most time consuming step in data preparation. Labeling allows you to ascribe meaning/categories to the data you’ve collected that your model can understand. If your data is a set of flashcards for your model to use to get smarter, raw data (audio snippets, sensor data) goes on the front of the flashcard and labels (what was said, the relevant state of the system) go on the back of the flashcards. In other words, the label is what you want your algorithm to be able to predict in the wild. You can think of labeling as being done in layers of value, which correspond both to how useful the labels are for building the model, and how costly it is to apply the labels.

As an example, let’s look back at the case of labeling data for an autonomous car which operates predominantly on camera data. The relevant (raw) data might be images as seen from the cameras mounted on the car. The “layers of value” or labels that can be added to these images have a variety of purposes in helping to build the policies that govern the self driving car’s actions. Bounding boxes drawn around cars in the frame can be used to build a model which identifies other cars on the road. This is a relatively simple “layer of value” to be added to the raw image file (it might only take a few seconds for an experienced annotator to draw bounding boxes around the cars).

Bounding box annotations around vehicles on the road — Source: Mighty.AI

A second, more intensive form of labeling might be to do a pixel-wise or semantic segmentation of the different objects in the image. What this means in practice is labeling every individual pixel as something like “pedestrian”, “parked car” or “road”. This form of labeling might take an experienced annotator several minutes (cost more), but the data itself (image + object label on every pixel) also has more value, since higher-fidelity data generally makes it easier to train a model to do various tasks.

Semantic segmentation of an image of a road — Source: Mighty.AI

Annotating Data

Now that we’ve covered a brief overview of “what” annotating is, let’s cover the “how”. When it comes to labeling relevant features and removing impurities, there isn’t a single best solution for all projects. There are several options available, each with their own costs and benefits.

1 — External annotation service providers

There are a wide variety of service providers who have either dedicated annotation teams, or enterprise tools for annotation and labeling. Some of the most popular providers of these types of services are Mechanical Turk, Figure Eight, Mighty AI, and Playment.

Pros — Ability to scale up or down as annotation needs change, generally cost-effective when dataset is small.

Cons — It can be hard to teach external groups the nuance of your data (is that a french bulldog or a pug?) Sensitive data (e.g. personal health information) might limit options of external data processing.

2 — Internal annotation team, or (very sparingly) engineering team

This is the in-house version of using an external supplier. If startup chooses to build an internal team, the training, tools and workflows are usually built by the engineering team to support the annotation efforts. A good rule of thumb for startups is to do every process manually until it becomes too tedious, then either stop doing it or to automate it.

Pros — Precise control of the annotation process, ability make real-time adjustments or tweak filtering parameters. Allows teams to build up automation tools to make annotation more efficient in the future.

Cons — Expensive, harder to scale (need to bring in/educate more annotators). Engineers are very expensive annotators, and the tools for an internal annotation team also has engineering overhead.

3 — Getting users to generate the labels

There is a subset of data labeling tasks which users can do themselves, though it shouldn’t be at the expense of user experience (your customers are not your annotation team). A good example of getting users to generate labels is an autocomplete function, since use of the feature generates the training data to make the feature more robust. There are still quality control steps that should be taken, but the heavy lifting of building the annotated data set is built (in a frictionless way) into the product itself.

Pros — This is the lowest-overhead for a resource-constrained company to deploy since the majority of the heavy lifting of annotation is done by the users. The startup only needs the overhead to maintain quality control.

Cons — It could potentially add friction/frustration to the user experience (a voice assistant asking “did you mean…” followed by not what I meant), lack of control, and potentially subjected to adversarial actors who may want to force your model to take undesirable actions.

4 — Tools to speed up annotation

There are several tools available to speed up the annotation process:

  • Software tools — Such as a GUI and an efficient UX for importing new raw data to be annotated can help increase the efficiency of the team, and make it easier to train additional annotators in the future.
  • Supervised prediction — This technique relies on building a model to automate the labeling process. An example is an ML model that predicts bounding boxes on images of cars based on lots of examples of the same task done by humans. This type of system would recommend/predict the annotation (draw the bounding box), and prompt the annotator to correct if there is a mistake.
  • Unsupervised grouping — Several technologies exist which might be able to automatically cluster similar conditions together, which may make the process faster or easier for the annotation team.
  • Transfer learning — If you’re fortunate enough to have an existing product line with existing data, there may be opportunities to leverage some of the models/tools that were used to generate those other datasets or models. Public data is also a potential source for transfer learning.
  • Active Learning — Building a model that predicts which is the highest-value data to be annotated, so that these data can be prioritized.

Start Experimenting

While this piece is by no means comprehensive (we haven’t even touched on building the actual model), I tried to make building a data pipeline feel a bit more like playing in a sandbox rather than reading a research paper. When building a startup, a bias towards action is almost always a good thing, and this case is no different. The process of getting your hands dirty with the data will not only help you iterate faster on your data pipeline, it will probably also help you better understand the real value you are providing to your customers.

As you set off on this journey, keep an experimentalist’s mindset and always be curious, testing assumptions, and thinking through other ways of solving the problem at hand. There is a strong tailwind behind you of smart people making the tools even more accessible than they are today, but you can only take advantage of these tools if you take the first step.


Adam Kell is a former founder and recovering VC. He is consulting on a number of applied ML projects, including The Lever. He is passionate about go-to-market strategy for early stage technology companies. He is also a Google Developers Launchpad mentor.

Like what you read? Give Adam Kell a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.