Venture Scenes | Take 6

Matt Castellini

Published in

Venture Scenes

18 min readMar 31, 2021

AI Data Annotation, A New Podcast, and Another Round Review

AI Investment Thesis: Data Annotation

Artificial Intelligence is an umbrella term that encompasses multiple areas of research and development. Within AI, the area of Machine Learning is driving a great deal of innovation today. Some researchers believe ML is the primary driver of all current innovations in the field of AI. Machine Learning is the subfield of AI that studies the process of teaching machines how to learn, automate, and ultimately perform tasks.

The “machine” referenced in Machine Learning is essentially prediction algorithms that ingest large swaths of data. The machine must be taught what to do with this data or what the “rules” should be for the prediction algorithm. Numerous related AI technologies utilize ML, including Computer Vision, Natural Language Processing, Driverless Cars, and Robotics. There are two core fields of Machine Learning, Supervised and Unsupervised Learning.

We will focus solely on Supervised Learning, largely because experts in the field such as Chicago Booth Professors Sendil Mullainathan and Jens Ludwig have described supervised learning as the “workhorse of AI.” In the eyes of Mullainathan, Ludwig, and many other AI experts, supervised learning represents the ideal field for near-term innovations in AI. With Supervised Learning, we communicate with the algorithm through data, meticulously and repetitively teaching the algorithm what we want. The practice requires rigorous attention to detail surrounding the data and methods used to train algorithms. In the end, Supervised Learning algorithms will only be as effective as the human beings developing them.

The Four Parts of Supervised Learning Algorithms

Supervised learning algorithms have four essential parts according to Professors Mullainathan and Ludwig:

A problem with an input and the desired output (characterized by X’s & Y’s)*
A dataset that teaches the supervised learning algorithm the output we desire, characterized byY (training data)*
A training algorithm that learns on that data and creates…
…a predictor algorithm (or deployment algorithm) that will then be deployed on a dataset the algorithm has not seen before

(*The X & Y Nomenclature used above is sourced directly from Professors Mullainathan and Ludwig. There are discrepancies between the nomenclature used at Booth and elsewhere. For the sake of our analysis, we will use the Booth nomenclature).

The goal of a supervised learning algorithm is to provide an accurate prediction of output for any given input. An elementary example is the Not Hotdog app, which is both an unforgettable plotline in Silicon Valley and a real app. The app uses computer vision powered by a supervised learning algorithm that was developed using training data that consisted of photos (inputs) of hot dogs and other objects. The developers trained the algorithm to recognize a hot dog in the image by labeling each image as a hot dog or not hot dog. The supervised learning algorithm then produces a prediction (not hot dog or hot dog) when it is fed an image it has never seen before.

This is a simplistic algorithm but nowadays, impressive and computationally robust AI algorithms are table-stakes as noted by Andreessen Horowitz:

In the AI world, technical differentiation is harder to achieve. New model architectures are being developed mostly in open, academic settings. Reference implementations (pre-trained models) are available from open-source libraries, and model parameters can be optimized automatically.

The real skill in the field of AI development now revolves around “dataification” (a term coined by Booth Professor Sendhil Mullainathan). Dataification refers to the ability to translate the problem you are looking to solve into actionable training data that a supervised learning algorithm can utilize successfully.

The strength of any supervised learning algorithm lies with the data that was utilized for training the algorithm. Algorithms must be fed with correct data and the right amount of that data. AI algorithms cannot learn what you do not teach. Developers must provide the algorithm with a robust set of Y’s (constituting outputs) to go along with X’s (which characterize inputs). Today, sourcing high-quality data represents one of the significant challenges in AI development.

What is Training Data?

Training data represents all observations that contain the problem that you are trying to solve. The dataset should characterize the inputs (through X’s) and outputs (through Y’s). For example, to create an AI algorithm that can successfully use computer vision to predict if a cup placed in front of a liquid dispenser is hot or cold, you would need to train the algorithm on a robust dataset of pictures. These pictures in should include various types (coffee, water, mug) and material (plastic, paper, glass, ceramic). Once you have uploaded the photos, you must then begin the process of labeling each image. This is a critical step for a supervised learning algorithm. The algorithm cannot intuit what the arrangement of pixels you show it represents. You must tell the algorithm what type of cup requires “hot” or “cold” water.

The overarching goal for any supervised learning architect is to sufficiently train the algorithm on a data set that includes both X (images containing characteristics of cups) and Y (hot-or-cold-bearing). From there, the deployment algorithm can be deployed on a data set it has not seen and produces a prediction regarding what type of liquid should be dispensed given the type of cup it encounters. Once again, this example illustrates the importance of training data and the steps one must take to build a successful AI outcome.

Another example lies within the sport of basketball. Imagine you wanted to create a supervised learning algorithm that could predict from a shooter’s mechanics (input) what was the quality of the shooter’s form (output). The training data for this problem is a large excel spreadsheet that will be utilized by the algorithm to create a model that predicts the quality of the shooting form (good or poor). Additionally, the algorithm can be trained on the video footage of shooters (utilizing computer vision). The combination of computer vision applied to the videos and quantitative and qualitative training data will likely produce the most robust algorithm. This algorithm would then be deployed on a dataset to produce a prediction of Y (Y-Hat). The algorithm will attempt to predict the quality of the shooters’ form, given the information, it has on the distance between feet, time to release, elbow angle, shoulder direction, and video footage.

Adding more rows (observations) or columns (X’s) can be beneficial to the model’s development

The great Catch-22 of Supervised Learning

The utility of a supervised learning algorithm lies in its ability to predict an outcome or output given a series of inputs. For the algorithm to have any worth at predicting, you first must feed it scores of data that contain both the input and the desired output you are hoping it can predict on a different data set.

This is the great catch-22 of supervised learning. A user wishes to utilize a supervised learning algorithm to predict some outcome. But first, the user must train an algorithm using data that contains the outcome. This can represent a challenge to any AI developer and understanding the specific inputs and outputs is essential. The algorithm does not have a mind of its own, the user must train it from the ground up.

To successfully locate the inputs and outputs for your supervised learning application, it is important to distinguish the type of problem you are attempting to solve utilizing artificial intelligence. As Booth Professor Sendhil Mullainathan notes, two core problems are typically addressed using supervised learning — “automation” and “prediction” problems.

Automation vs. Prediction Problems

In supervised learning, Prediction problems revolve around a dataset that arrives fully-formed with both the X’s and Y’s. The data set naturally contains the desired Y from a given set of X’s. Let us return to the basketball example that we previously discussed. To address the example as a prediction problem, you could train the algorithm solely on the video footage of a wide variety of basketball players making and missing shots. The Y (shot made) is naturally present in the dataset along with the X (the characteristics and form of the shooter). While the algorithm may not understand the significance of making a basketball shot, it theoretically will have everything it needs to form a prediction algorithm that can be deployed on a set of basketball shots it has never seen before.

Automation problems require human beings to provide the algorithm with the desired Y for every input (X) that the algorithm ingests. The algorithm is receiving a label for every observation, thereby learning how humans judge that observation. Over time, the algorithm will learn how to automate human judgment and produce a prediction.

AI developers looking to solve automation problems must place an extraordinary emphasis on the quality of the labels provided to the algorithm. The algorithm will only be effective if it is given high-quality data with corresponding labels of Y.

Think back to our “Hot Dog Not Hot Dog” example. The task created for the AI algorithm was to determine whether the collection of pixels (X) was a hot dog or not a hot dog. The developers of that algorithm had to train the algorithm by providing a large and diverse quantity of training data that consisted of pictures of hot dogs from all angles. Then, the developers went through labeling each photo of a hot dog as a “hot dog.” Conversely, they had to label every image that did not contain the corresponding pixels comprising a hot dog as “not hot dog.” This process was an elementary but instructional example of how the labeling process works for automation problems. The algorithm cannot perform its task of predicting whether a collection of pixels represents a hot dog unless it is adequately taught to recognize hot dogs.

Human has to scrape web to 1. Find data (images) of hot dogs and 2. Label data

As you can hopefully see, finding and accurately labeling data is one of the most important components for the feasibility of any supervised learning algorithm. The hot dog example was purposefully simple, but in practice, the application still required 1,000,000s images of hot dogs and labels (poor Jian Yang…) The sourcing and labeling of data is a monumental hurdle in Artificial Intelligence.

The Challenge of Procuring Labeled Data

The process of labeling Y’s in a data set is known as Data Labeling or Data Annotation. Arguably the most significant challenge that AI developers face is to “dataify” the problem they are attempting to solve. According to Professors Ludwig and Mullainathan, the following represent the most significant challenges to procuring the necessary labeled data one needs for AI automation problems:

The problem you are attempting to solve cannot be dataified. In some automation problems, it is impossible to find a ground-truth upon which to base the algorithm’s training. You can train the algorithm to effectively deploy the judgment of the collective human team that taught it, but that is the limit. This can occur when data labelers must use their subjective judgments regarding polarizing topics. For example, if you attempted to train an algorithm to detect offensive speech in a comedian’s standup routine, you would get the human being’s subjective opinion of “offensive” and “not offensive.” This is not a ground truth; it is merely the opinion of whoever is labeling the data.
Datasets exist but are stuck in large, slow-moving organizations. This is a common phenomenon in the healthcare space or any industry where regulations create red tape.
Certain datasets consist of only X or only Y. In this situation, the AI developers must find a way to merge the two data sets. For instance, if you wanted to create an AI algorithm that could predict from car crash images the injuries inflicted on victims, you would need to merge car crash data with healthcare data. As problem #2 notes, this will likely be a slow-moving process.
The data has not been collected or it has not been labeled. Both situations represent an opportunity, which has given rise to an entire billion-dollar industry and numerous interesting early-stage startups.

The Data Annotation Industry

The burgeoning data annotation industry has arisen as a direct result to solve the dataification conundrum. Over the past 5 years, Artificial Intelligence researchers, startups, and enterprises have accepted that the quality of supervised learning algorithms rests largely on the training data that it is trained on. In fact, some reports suggest that 80% of the time spent on building Machine Learning models is dedicated to data management and labeling. Andreessen Horowitz counts the manual labeling of large datasets as one of the largest barriers to the widespread adoption of Artificial Intelligence. The ubiquity of the need for data annotation tools has led to stellar growth in the still-nascent data annotation industry. The data annotation market was valued at over $700mn in 2020, a number that’s expected to rise to $5.5bn in 2026.

The old adage of “Garbage In, Garbage Out” rings true here. Data labeling tools are required for AI models in nearly every industry of the economy, and the use cases are varied and plentiful:

Startups to Watch

Tech giants such as IBM and Amazon are extremely active and invested in the data labeling market. A number of early-stage startups have also proliferated the data annotation market in the past few years. Below is a non-exhaustive list of examples of startups in the field today:

Scale AI: viewed by some as the “heavy hitter” in the space, and backed by Peter Thiel, Scale provides labels to text, pictures, audios, and videos. The company’s customers use Scale’s API to send the company raw data, which Scale then labels. Scale’s customers then utilize that training data for Machine Learning models. The company has a marquee list of customers including Waymo, OpenAI, Airbnb, and Lyft. The company currently employs 100 people but largely relies on 30,000 contractors to aid its labeling practice. The company and its 22-year-old CEO Alexandr Wang recently generated headlines by raising $100mn in Series C financing led by Founders Fund.
Kili Technology. Based in Paris, Kili allows enterprise developers of AI to annotate drone images, videos, emails, and contracts through a collaborative platform. The company believes that 29,000 GB of unstructured data is published every second, most of it useless for supervised learning algorithms looking to solve automation problems. The company most recently raised a $6.85mn seed round.
DataLoop. The startup’s value prop is the ability to specialize in “high volumes, high variance, and complex data, helping a range of companies create AI development and production pipelines.” The company has seen a pandemic-driven surge of demand for its services from the autonomous vehicle and healthcare sectors. The company has taken the approach of automatic annotations, which is effectively using AI to automate the data labeling process. While the UX and platform seem incredibly interactive and aesthetically pleasing, I do not have high hopes for this strategy. In the end, the human beings on the back-end who are solely there to “check the work” of the algorithms will likely dedicate an inordinate amount of time to plugging holes and debugging. In December 2020 the company raised $11mn of Series A financing.
SuperAnnotate. The Sunnyvale, California-based startup has been busy in recent months. In a 4-month stretch, the startup hired 3,000 data scientists and onboarded 100 customers including Starsky Robotics, Percepto, Acmie AI, and Clayair. The company most recently received $3mn in venture capital in June 2020.
Alegion: Based in Austin, Texas, Alegion utilizes a mix of human and machine labelers. The company claims to “offer data labeling services powered by a mix of automated systems and human workers, tailored for tasks such as AI model training, testing, and exception handling in domains like computer vision, natural language processing, and entity resolution.” To me, it sounds like the company is attempting to utilize a layering technology approach by having machine train machines. If it can pull it off, this could obviously be a vastly better approach. However, mistakes and model inaccuracies can be severely costly in this scenario. Additionally, the company claims that its data sets are 99% accurate compared to an industry average of 40–60%. I am actually more interested in the speed they achieve with respect to the labeling process. The company most recently raised a $12mn Series A round in August 2019.
LabelBox: San Francisco-based LabelBox claims to assist enterprises “get to production AI faster.” The company offers a google-drive like tool for admins and data labelers to collaborate. Similar to Scale, the company utilizes an API that data science teams can tap into and customize tools “to support specific use cases, including instances, custom attributes, and more, and label directly on photos, text strings, conversations, paragraphs, documents, and videos.” The company also employs a similar strategy to Alegion through “pre-labeling.” Unlabeled data is initially provided a label through the use of an ML predictor algorithm. Finally, the company claims it allows for labelers to browse and filter through previously trained data to find common instances of errors and mistakes. Last month, Labelbox raised $40mn.
CloudFactory: The startup is led by serial entrepreneur Mark Spears. CloudFactory has focused primarily on the human component of data-labeling. The company reportedly employs a “small army” of data labelers around the world who “tag and annotate images, audio clips, and videos on which semantic, syntax, and context detection algorithms can be trained, pursuant to customers’ needs and wants.” The company offers a host of other services, including document tagging and processing tools for sentiment analysis. 130 clients currently use CloudFactory’s services including autonomous vehicle startups and Microsoft. CloudFactory raised $65mn in growth equity in 2019.

Synthetic Data: The Next Frontier

As previously demonstrated, a number of well-capitalized startups have flooded the Data Annotation market, with Scale AI largely leading the way. For early-stage investors, the AI Data Annotation space has become increasingly crowded. But the space is also facing critical issues. Data for AI algorithms must be robust, expansive, unbiased, and innocuous to the growing set of data privacy laws.

To confront these challenges, Synthetic Data has emerged. Synthetic data is computer-generated images and datasets that can substitute real data. These images and datasets have to have approximately the same statistical and mathematical properties as the real-world data that it is mirroring. Machine Learning algorithms can be trained using this virtual data that comes fully formed with both X’s and Y’s. The data can also be heavily customized to fit the end client’s needs. The end client then uses that data to train its supervised learning algorithms for applications in retail, finance, transportation, agriculture, and healthcare (to name a few).

While there are some restrictions to what Synthetic Data can actually do in the place of real data, this is an emerging opportunity in the field of AI development and early-stage investing. Below is a non-exhaustive list of examples of early-stage startups in the field today (descriptions via Pitchbook):

AiFi creates synthetic data that is utilized to simulate shopper behavior within retails stores and to deliver auto-checkout operations for retailers. Based in Santa Clara, CA, the company most recently raised a $16.55mn Series A led by Qualcomm Ventures on December 8, 2020. The company’s pre-money valuation was pegged at $183mn at the time.
Gretel.ai is creating synthetic datasets for developers that are completely anonymized. Recently raised a $12mn Series A led by Greylock on October 21, 2020.
Mostly AI. The company’s technology helps generate an unlimited number of realistic and representative synthetic customers, matching the patterns and behaviors of the actual customers. The company is targeting the insurance and finance sectors. Raised a $5mn Series A led by Earlybird Venture Capital on March 2, 2020.
Anyverse has developed synthetic data generation software to create datasets for the autonomous driving industry. It has raised a $3.27mn Seed from Bullnet Capital
Hazy is a synthetic data platform for financial institutions that want to conduct sophisticated data analysis without compromising safety or speed. Targeting fraud and money laundering detection as its initial use cases. Raised $6.74mn Seed from Notion in January 2020.
Lexset is a platform for creating synthetic data for ML. Uses 3D simulation to deliver on-demand data to our users for Supervised learning models. Their technology is industry agnostic, and they can create synthetic datasets for a wide variety of use cases including indoor/outdoor residential, aerial images, autonomous vehicles. Raising a Seed round in 2021.

Conclusion

The Data Annotation space is set to meaningfully expand in the coming years. The near-term future of supervised learning is likely predicated on the ability of architects to properly train algorithms on contextually rich data sets. My speculation is that the startups utilizing ML to label the training data that will ultimately train ML algorithms have the highest potential upside in terms of the cost-savings they can provide enterprises and AI developers. But I would be fascinated to learn how that process actually performs. For now, I think Scale AI has chosen a less-risky path to growth within the Data Annotation space and it is the startup I am most eager to follow in the coming years. I also believe that there is a vast amount of whitespace for startups to emerge and succeed in the Synthetic Data space, given the ubiquitous need for training data across every vertical. I am excited to continue to look for attractive investment opportunities at the early stage in Synthetic Data.

Special shout-out to Professor Sendil Mullainathan and his class on Artificial Intelligence, which taught me all I know on the topic. The terms “Dataify”, “Dataification”, “Prediction”, and “Automation” were coined by Professors Sendil and Jens Ludwig. Additionally, thank you to Dan Knight for his help with this post.

Mic Check

Introducing The Chicago Capital Podcast

Fulfilled a longtime goal of mine to launch a podcast!

Chicago is home to one of the most diversified economies in the country. By some measures, it is one of the best cities for early-stage VC and angel investor returns. However, most Venture Capital podcasts are focused elsewhere.

In the wake of COVID-19, startup founders, graduates, and VC investors are looking outside of the traditional coastal hubs for their next adventure. Many of them may set a path for Chicago. As a startup community, the city has hit an inflection point.

My goal with Chicago Capital is to provide listeners with a comprehensive overview of the Chicago VC and Startup ecosystem. I hope to interview the most prominent Chicago-focused investors and founders to learn more about their career journeys, their most insightful pieces of advice, and what makes Chicago unique.

Episodes will be released weekly on Wednesday’s. If you know of any founders or VC’s that would be interested, please reach out!

Movie Rec

Another Round

Four high school teachers consume alcohol on a daily basis to see how it affects their social and professional lives.

This movie has stuck with me for two months. For me, that is typically a sign that a movie will land in my top 10 list in the future. It does not happen very often. I find myself reflecting on the themes of this movie, the message it is conveying, and that ending (THAT ENDING).

Without divulging too much of the plot, the movie takes up the mantle of mid-life crisis tropes quite subtly. The evolution of the characters from start to finish is tenderly handled and at every step feels earned. Mad Mikkelson is the star, but each of the four main characters feels alive. You understand their individual plights, and why this turn towards borderline-alcoholism is both enlivening and heartbreaking.

“You must accept yourself as fallible in order to love others and life.”

But in my opinion, this is ultimately a life-affirming movie. It is a challenging watch at times, but the cast handles every beat and drunken stupor with incredible care. This can also be an extremely funny movie at times. The drunken escapades play out in such a way where you are not entirely sure if you should be laughing hysterically, or horrified.

The end product is something truly unique. This is a movie about a man who desperately wishes to feel something again. Instead of using alcohol to numb his pain, he uses it to unburden himself. The story takes the concept of alcohol as a “social lubricant” to the very extreme. It’s a movie that will challenge you and delight you all at the same time. I think it was the best movie of 2020 and if I had a vote to give at the ole’ Oscars, this would be my vote.

The film was based on a play Vinterberg had written while working at Burgtheater, Vienna. Additional inspiration came from Vinterberg’s own daughter, Ida, who had told stories of the drinking culture within the Danish youth. Ida had originally pressed Vinterberg to adopt the play into a movie, and was slated to play the daughter of Martin (Mads Mikkelsen). The story was originally “A celebration of alcohol based on the thesis that world history would have been different without alcohol”. However, four days into filming, Ida was killed in a car accident. Following the tragedy, the script was reworked to become more life affirming “It should not just be about drinking. It was about being awakened to life”, stated Vinterberg. Tobias Lindholm served as director in the week following the accident. The film was dedicated to her, and was partially filmed in her classroom with her classmates.

As a side note, there is a touching and tragic backstory surrounding the film’s production. I could not be happier for Thomas Vinterberg, who was nominated for best friggin’ director at the Academy Awards. He was somehow able to channel what was the worst moment of his life into his art. Truly incredible he was able to continue with the movie, and that the movie turned out the way it did.

That’s all, folks. Thank you to everyone who has subscribed or taken the time to read Venture Scenes. As always, if there is a startup or piece of VC news that you find interesting, comment below, or message me on LinkedIn or Twitter!