The Future of VC: Augmenting Humans with AI

Andre Retterath

Published in

Earlybird's view

11 min readDec 1, 2020

tl;dr:

Data-driven sourcing and intelligent screening will allow VCs to identify the most promising entrepreneurs more efficiently and more effectively, at the right point in time, and wherever they are. Geography and “warm intros” will eventually become irrelevant
Midterm, sourcing will become less of a differentiator as investors will gradually leverage the same identification sources
Creativity with respect to new data points in the enrichment stage, the subsequent feature engineering and the (semi)-automated scoring (either ML-based or deterministic) in the screening stage will become core
VC brand (firm brand + personal brand of the individual investor) and fund size (who is able to pay higher valuations without sacrificing on portfolio size and shareholdings) will gain importance as access becomes the ultimate differentiator — besides personal fit (which will remain key)

Let’s start with a simple observation: Competition among VCs is increasing. Why? Well, splitting the market into supply (startups) and demand (capital to be invested) does the trick. While several sources provide evidence that the number of startups has always been — independent of economic conditions — approximately constant (source), the capital to be deployed has exploded. For example, the total value of VC funds raised grew by 4.2x from $3.5bn in 2009 to $14.8bn in 2019 (source) which can be attributed to more funds raised but also to larger funds raised (source). As a result, an increasing number of VC firms needs to invest an increasing amount of capital into a limited number of assets. Obviously, this imbalance led to an increase in startup valuations but more interestingly to VCs becoming more creative in their investment process.

Looking into the VC investment process and the characteristics of the respective stages as displayed below, it becomes clear that little has changed since the inception of the asset class about 80 years ago. As of today, the process is still highly manual, subjective, time-consuming and hard to scale.

In short, it’s about finding the right opportunities, getting access/investing and helping them grow. According to several studies (source), VCs generate about 60% of their overall value in the sourcing and screening stages. As a result of increasing competition and the fact that sourcing and screening are the major levers for value creation, VCs have started to innovate on the respective processes to identify the most promising opportunities before their competitors do. Similar to most industries, data and intelligent algorithms turned out to be the right tools for this task.

As a result of more than 100+ VC interviews, 2.5 years of PhD research on the topic of ML in VC and my own journey of building a data-driven sourcing and screening engine at Earlybird, I came up with the following structure including some (more or less obvious) examples:

🛒 SOURCING

In my mental model, I split the sourcing into two distinct stages: identification and enrichment.

1) Identification (the rows in your sheet): Find every company as early as possible. The guiding question is “where might a new company show up first?” It may be new registrations or financing rounds recorded in public registers like Handelsregister or Companies House, trending repositories on Github, products launched on ProductHunt or a founder changing her LinkedIn description to “stealth” or “Founder in XYZ”. Assuming that I’ve missed out on the incorporation of the company, I might still catch them once an angel investor or pre-seed/seed fund has invested by regularly crawling their portfolio websites. It’s all about finding as many companies as early as possible, i.e., increasing the number of observations (or rows in your sheet.)

Side note to pre-seed/seed investors: this is why a proper portfolio website might eventually make sense ;) — the next-stage investors will be forever thankful 🙏🏻.

2) Enrichment (the columns in your sheet): Once identified, I seek to collect as much information as possible to create a comprehensive company profile, i.e., increasing the number of features (or columns in your sheet.) I take the name and URL as a unique identifier and feed it into the APIs of different commercial database providers to see if they have already collected some info on the company. If yes, awesome! I collect all static info such as company description, industry classification or headquarter from the database which shows highest coverage and quality in the respective dimensions. For a detailed comparison of data coverage and quality across Angellist, Crunchbase, CB Insights, Dealroom, Pitchbook, Preqin, Tracxn and VentureSource (which got recently acquired by CB Insights), see the summary table below and a more detailed writeup on my benchmarking study here.

Once I’ve collected the static data, I focus on potentially changing info such as LinkedIn, Twitter or product reviews and trace that change back to the original sources. I call these features the “growth metrics” and will refer to them again further down the line. For the simple enrichment, I collect the absolute measures such as the actual number of employees in the company, number of tweets, followers, etc. or number and rating of product reviews. Moreover, I compliment the company profiles by collecting payment data, website traffic, Google News mentions, App-Store info and many many more data points across the web. Matching the same data point across sources oftentimes helps to verify or delete wrong info.

Clearly, the purpose of the sourcing stage is to see as many companies as early as possible and then collect as much info as available. There seems to be no downside involved and thus the more sources, the better. Every source increases the potential to find another company and creativity becomes key. Unfortunately though, in speaking to 100+ VCs and external data providers who have been working on this topic, something became clear to me: Although data-driven sourcing might be a differentiator today, there are only so many identification sources available, and the majority can be identified through the use of pretty obvious methods described above. Consequently, I assume that in the mid- and long-term, the focus and major lever will shift from sourcing to screening.

🧐 SCREENING

Assuming a comprehensive coverage of startups and a wide spectrum of information per company, my team and I have built a sourcing tool at Earlybird that results in 30k+ potential opportunities per year in Europe alone (including non incorporated concepts, beta products, etc.) Obviously, we didn’t have the resources to look at all of them individually and thus needed to automate the screening process. I wrote a dedicated piece on this topic here and split the screening into deterministic and random/ML-based approaches.

1) Deterministic

Simple to implement but highly effective, deterministic scorecards help me to achieve a detailed scoring based on the above-described “growth metrics”. As described in the 2) Enrichment stage, I collect these growth metrics to compose the initial company profile. With respect to the deterministic screening, however, I collect them on a recurring basis to move from static to dynamic data and measure changes between t1 and t2. As a result, I receive absolute growth measures such as the number of employees jumping from 13 in t1 to 20 in t2 but also the relative growth in percent such as 53% in the described example. By combining the absolute figures with their relative growth, I already receive powerful signals pointing me to interesting companies. While increasing the number of growth metrics and calculating different signals, it becomes increasingly difficult to interpret all of them manually. Therefore, I created a scoring which puts a specific weight (based on our experiences) onto the different growth metrics leading to a single score. Although this approach lifts efficiency into another sphere, it’s still highly subjective as it’s based on our own experiences regarding the individual weights.

2) ML-based

As described in my post here, ML-based approaches might solve this issue as they remove the subjectivity. Like Handelsblatt recently put it in their coverage about my research: “Instead of telling the algorithm to provide startups which fit into defined criteria, I turn it around and show some successful startups to the algorithm and ask it to pick similar ones.” I trained several supervised ML algorithms to do exactly that. In order to compare their performance to the status quo (screening by VCs), I conducted a benchmarking between 111 European VC investment professionals and the resulting algorithms. My results show that the XGBoost classifier performs at least as good as the best VC, 25% better than the median VC, and 29% better than the average VC in the test sample. (Read more here in case you missed it.)

Seems like the perfect solution. Almost. Unfortunately, there is one major problem with the ML-based approach: the algorithms need to be trained on historic data which might not contain the patterns that will help to identify future success cases. What might be a success pattern for enterprise software or consumer companies in the past, might not be true for quantum computing startups in the future. Assuming that VCs would solely rely on screening algorithms trained with historic data, they would mirror the past into the future without considering potential changes. Besides missing out on changing — not seen before — patterns and new innovations, this would also be a catastrophe for minorities who have not yet received financing, or were successful only in the past, as such models would assume this would not change in the future. To mitigate this issue, we might re-introduce deterministic rules which overrule the ML-based feature importance in some dimensions. For example, although the XG classifier might deselect a company with a minority founder for further selection, we might still include it and have it automatically selected for further evaluation.

In summary, the most promising screening approach seems like a hybrid between ML-based approaches that need to be selectively amended by deterministic rules. (Un)fortunately, computers lack — at least as of today — one major component: the (real) human factor. VCs invest in entrepreneurs and most people agree that ultimately it comes down to the team, their motivation, their grit, their spirit and many more intangible assets that we as humans can access way better than machines do. To leverage all strengths in the most efficient and effective interplay, I consequently suggest an augmented approach.

🤖 THE AUGMENTED VC 👱🏻‍♀️

I define an augmented approach as a hybrid between humans and machines. Data-driven sourcing tools help VCs to move closer to a comprehensive coverage and ML-based screening tools narrow the upper — steadily growing — part of the deal funnel to a constant number of investment opportunities. As a result, investment professionals could save substantial time spent on less promising opportunities that could then be focused on properly evaluating a pre-selection of high-potential ones. They can use the freed-up resources to build stronger relationships with the selected entrepreneurial teams and to put themselves into a better position to secure the most competitive deals. Instead of going broad and shallow by allocating limited resources to an ever-growing number of opportunities, the use of ML-based screening tools frees up time and allows a venture capitalist to go narrow and deep on a selected number of opportunities while still ensuring that promising deals are not overlooked. It’s a win-win for entrepreneurs and investors alike as both sides can get to know each other better. At Earlybird, my team and I have implemented such an augmented approach according to the following structure.

Based on the results and feedback from entrepreneurs and LPs, our approach is a clear differentiator today. It allows us to identify the most promising entrepreneurs more efficiently and more effectively at the right point in time and wherever they are. Geography and “warm intros via exclusive networks” will eventually become irrelevant (although it might take some time...)

Thinking about the mid-term future, however, the number of VCs leveraging similar tool stacks will increase. VCs will not wait for the best investment opportunities to land on their table anymore but rather reach out to the founders before their competitors do. Consequently, deal-flow will shift from mainly inbound (founders reaching out to the VCs) model of the past to an increasingly more outbound model (VCs proactively reaching out to founders) in the future.

So what might be the long-term implications? I expect they’ll be threefold:

Sourcing will become less of a differentiator as investors will gradually leverage the same identification web crawlers.
Creativity with respect to new data sources in the enrichment stage, in the subsequent feature engineering and in the (semi)-automated scoring (either ML-based or deterministic) will become core.
The time between a growth signal and an initial meeting will decrease across the industry. A promising growth signal, be it a jump in Github stars or Upvotes/“Hunt of the Day” at ProductHunt, might lead to several automated “Hi Founder, let’s speak! Choose a slot in my Calendly here”-emails from VCs. In this extreme (but IMHO likely) scenario, access to deals will become increasingly important as most investors will compete for very few high potential deals — at (more or less) the same time. Clearly, not all of them will be able to invest. While historically the best VCs could choose which founders they wanted to work with, today and even more so in the future, it will be the other way around: the best founders can choose which VCs they want to work with. As the VC industry becomes more efficient, two “access components” will become core: capital availability (which VC is able to pay the highest valuation) and VC brand (firm brand + personal brand of the individual investor) — besides personal fit (which will remain key)! Looking to our friends in the US, this seems to be true already today..

The future of VC is bright and I’m sure the above-described efforts are only the beginning of our industry transformation. Future efforts will likely involve more downstream stages of the process, ranging from Due Diligence to Exit. Let’s see what future holds! 🚀🚀

Are you a founder, industry expert, VC or researcher interested in the fields of data-driven sourcing, AI, developer tools or open-source business models? I’d be more than happy to learn about your work, so feel free to reach out via andre@earlybird.com.

Liked this piece?
>>Follow me on twitter for more AI, dev tools and VC related stuff.

The Future of VC: Augmenting Humans with AI

tl;dr:

🛒 SOURCING

🧐 SCREENING

🤖 THE AUGMENTED VC 👱🏻‍♀️

Written by Andre Retterath