The Rise of Algorithmic VC

Published in

Specter Alternative Data

8 min readJun 19, 2018

I remember one summer when I was still a teenager, my brother pitched me an idea. He said: “Dude, I have been looking at the stock market and I believe there is a causal relationship between what people search for, social sentiment, and stock price.” I know, I know. This doesn’t sound that groundbreaking in 2018, but at the time it just blew my mind. So I spent the whole summer trying to find some little piece of evidence of causation.

Every time I’d see the slightest connection between the amount of positive tweets (or rise in search volume index) and the consequent rise in stock price, I’d just email my brother and proudly announce to him that I got it.

And every time my bother would just reply: “you don’t have sh**”. So after a while I realised that nobody is going to give me a Nobel Prize for this.

Fast forward like 6 or 7 years, and I’m co-founding my first data-mining startup. Just to make sure the business was off to a good start we named it Spaceship (#cuz-moonlambo). At the beginning, we used to sell batches of 100k emails for $99, then we hiked up the price like 1500x, but that’s a different story. The point I’m trying to make is that we went through an absolutely ridiculous amount of company data.

And that’s when the Nobel Prize aspirations started to kick in again. So fast forward like a year, and I’m joining a VC fund. With no real education, no real exit under my belt, and no real data science skills — I’m on a mission to change venture capital forever through data.

And this is what I learned…

First, a little bit of history. The first mention of Moneyball approach to VC that I was able to find, comes from November 9, 2012. Since then, many VCs have turned to data as a way to keep a competitive advantage. I was able to identify roughly 50 funds which have adopted this somewhat new quantitative approach. Including top tier funds like Andreessen Horowitz, Sequoia, Index Ventures or Social Capital.

If you want are interested in the full list, please don’t hesitate to reach out.

There are basically two main ways to classify these quantitative VC attempts;

Quantitative First vs. Quantitative as an Add-on
While some funds went all in, others are quietly building a data team on the side.
Using Data for Sourcing vs. Using Data for Screening
While some funds use data purely for sourcing new investments, others use it also during screening and for due diligence. This separation isn’t really that black and white, as some funds use it for both.

Secondly, there are hundreds of different data points that come into play. To simplify it a little bit — let me split them again into 6 categories.

Dead Data — data collected retrospectively, used as a benchmark or to build models. Typically it’s data about past industry performance, total funding, funding velocity, HQ location, etc.
Hard Data — real data typically collected directly from startups (like revenue, churn, etc.) either via deck, data-room or some other form
Soft Data — (imo) this isn’t really data, as it’s quite difficult to quantify something like founder’s personality, how sharp s/he is, or if the team is complementary
Network Data — data about how is the startup and its founders connected to the ecosystem
Signal Data — data collected via third party sources. For example; web traffic, funding, app downloads, reviews, employee data, or revenue
Decision Data — data collected about (your own) past investment decisions used for machine learning

If you want to read something more advanced, I suggest to check out this paper by D.S. Hunter and T. Zaman. They have developed a framework for VC investment that is based on the randomness of Brownian motion coupled to large volumes of Dead Data.

Now, let’s dive into some of my attempts to change VC through data. During my time at Hummingbird Ventures, I was using data only for Sourcing, and I was relying mainly on Signal data.

Why?

I don’t really believe in the application of Dead data to either sourcing or screening. For me, the whole purpose of venture capital is to fund really new ventures. So ideally there isn’t already too much data available about that vertical. Secondly, tech-adoption is significantly faster than even 4 or 5 years ago. Therefore, startups grow at a much faster pace than ever.

Partially it’s probably also caused by that fact that I’m by nature a very impatient and lazy person. So it doesn’t come naturally to me to analyse the past in order to predict the future.

So, why Signal data?

Hard data like revenue is obviously the most difficult to come by. So unless you are a top tier fund which receives a ton of inbound and referrals, you won’t really get ahold of too much real data.

Soft data is tremendously difficult to quantify. Sure you can score educational background or the composition of the team. But unless you force each team member take a personality test or spend hours with them, I don’t think soft data is really applicable.

Therefore, most funds I know rely on Signal data for their quantitative attempts.

The problem with Signal data is that it’s typically based on estimates (i.e. web traffic or downloads). So it doesn’t always paint an accurate picture of startup’s traction. In addition, some Signal data like social engagement or follower base is really more of a vanity metric.

Even though, it’s the easiest to come by, it’s still very imperfect. Most startup-data providers offer incomplete startup profiles, which are missing key Signal data (i.e. funding or employee count).

In addition, I’ve learned that almost every vertical has a different main Signal metric(s). A simple example: while traffic and social engagement might be important for Marketplace startups, it’s not necessarily helpful for SaaS.

One pretty interesting experiment that I’ve done in the past was called ‘Qote’. It was a simple website with a form in which you would input startup URL, average pricing (or order value). Then, select the vertical / category. And it would instantly provide you with a revenue estimate.

It was an extremely simplistic MVP, and most of the estimates were honestly significantly off. However, it taught me that if you combine enough Signal metrics with some basic user input/feedback, you can get fairly good at spotting interesting companies at a scale. So each time someone would request a revenue estimate of a startup I’d create a new profile for that startup and @mention it in an automated tweet.

And some startups would actually reply to the tweet, especially if the estimate was significantly off. This proves that Signal data can be used to gather additional Signals. (BTW SimilarWeb is using a very similar strategy)

Finally, I call it Signal data for a reason. It’s because you’ll get most valuable insights — if you track changes in those metrics over a certain period of time. This significantly increases your chances of identifying really high growth startups. If you combine it with tracking the right metrics for each vertical, and include a relative-vertical score based on a couple of Signal metrics — you might actually build a pretty decent engine to spot interesting companies.

It’s not a perfect approach. And it’s very far from being able to call it Algorithmic VC, but unless you have access to all the 6 types of data that I’ve outlined above (and at a large scale), you won’t really be able to build a proper quantitative approach anyway.

So what would be the perfect way to build an Algorithmic VC? For a minute, try to imagine that an algorithmic or quant hedge fund would work only with revenue estimates and imperfect company data. It just sounds completely ridiculous, right? It couldn’t possibly work. And I believe the same applies to quantitative VC.

You can’t have a fully algorithmic venture ecosystem, unless there is a central stock exchange or platform which contains all the key data points.

Recently, I’ve spotted an interesting approach which was created by Follow[The]Seed. The fund basically asks startups to implement their own SDK (plugin) which tracks users, and identifies if the startup has extraordinary traction and usage metrics.

I think it’s a killer idea. However, as it will be used only by startups interested in one fund, it very likely won’t have a bigger impact on the whole VC ecosystem. Same goes for other similar CaaS (capital-as-a-service) attempts, like Social Capital which receives ~5K applications per year.

I still think it’s a very clever idea how to track early stage startups & provide them with valuable data driven feedback. However, as there are roughly 100 million businesses launched every year (that’s roughly 3 per second!), even if you receive 10,000 startup applications as a fund, your dataset is really really tiny in comparison.

Maybe it’s a little bit utopian. But I believe — what we need in order to build a truly Algorithmic VC ecosystem, is a similar plugin that all startups can simply implement into their website or app. In exchange, they will get an overview of how well they are doing compared to their peers. Their data will be kept private/anonymous on one central platform or exchange, where investors can request to get in touch (purely based on performance data).

This will allow us to add a crucial data layer that is currently missing.

It will also allow us to create an ecosystem in which performance data is transparent for both founders and investors. And we can finally start making learnings based on global performance data of startups. Heck, we could even create new universal ratios that are applicable to early stage investing.

I think the timing is right because of three main reasons:

Many VCs are already experimenting and spending a lot of resources on building a quantitative approach
Startups are tracking more competitive data than ever
There is a general trend among startups to be more open and transparent about their learnings and performance

What do you think?

If you are interested to read more about this topic, I recommend the following articles:

I’m also a big fan of the following open startups initiatives;

The Rise of Algorithmic VC

And this is what I learned…

Why?

So, why Signal data?

What do you think?

Written by Dominik Vacikar