Characterising Venture Funds using Machine Learning

Tyto.ai
H2 Ventures
Published in
6 min readFeb 19, 2018

--

Predict the causal future, not the present¹ — Vincent Vanhoucke, Google Brain

At Tyto.ai we are working on interesting problems which occasionally brings us into contact with VCs and other investors.

Most of my machine learning background is in natural language processing (NLP). I love NLP, and I could talk all day about it, but sometimes it is difficult to explain how generalizable the techniques in modern NLP are to areas outside text.

It’s pretty easy to show how a deep learning model can tell the difference between a dog and a cat, and everyone understands how that is relevant to different image types.

But techniques developed for NLP are perhaps even more powerful. NLP is — at its heart — sequential data, and the same techniques used for NLP are equally powerful elsewhere.

One example I use to show this is how the Word2Vec algorithm² can be used to characterize venture capital investors.

The basic idea of Word2Vec is to use very large volumes of text to learn the patterns in language, and express that by recording all words in a multi-dimensional matrix so that similar words are located close to each other in the matrix.

It does this by trying to predict the words surrounding the current word. This provides implicit supervision for the modeling task, which means one doesn’t need to manually do expensive labeling of data. This process lets the model learn words which are used in similar ways.

As a practical example, imagine a partial sentence like “The cat sat on the XXX”. What should XXX be? Most English speakers would immediately say “mat”, but wouldn’t be surprised with sentences like:

The cat sat on the chair

The cat sat on the bed

The cat sat on the blanket

In these cases, mat, chair, bed and blanket are all things that can be sat on. Conversely, a sentence like “The cat sat on the running” makes no sense at all. “running” isn’t something that can be sat on, and that word just doesn’t work in that context.

What about non-textual data?

Photo by Markus Spiske on Unsplash

So how does this relate to other — non-NLP — tasks? Take startup investing. There are typical patterns here which we see over and over again:

Accelerator program

Seed Round

Series A

Series B

The StitchFix example

I love the StitchFix Algorithms blog. Chris Moody’s work on multi-modal embeddings and models is an inspiration to me, and I love what they have achieved. So let’s look at StitchFix’s funding rounds up to their recent IPO.

Seed Round, Jan 2011, $750K, Baseline Ventures

Series A, Feb 2013, $4.8M, Baseline Ventures, Western Technology, Lightspeed Venture Partners

Series B, Oct 2013, $12M, Benchmark

Series C, June 2014, $25M, Benchmark, Baseline Ventures, Western Technology, Lightspeed Venture Partners

Those familiar with the field will see the VC funds involved there and say that is an altogether unsurprising pattern. Baseline Ventures frequently invests in seed rounds, and Benchmark more typically starts at Series A.

These patterns — which are well understood by humans — are exactly what NLP-like techniques can pick up.

To show this, I downloaded the open source Crunchbase data (sadly only available up to December 2015), turned it into per-company sequences of investments. I then used the wonderful Gensim Word2Vec implementation to build a 10-dimensional representation of the data³. Finally I used Tensorboard to visualize the data.

TSNE attempts to discover structure in the multi-dimensional data.

This makes for some pretty graphics, which occasionally may be more eyecatching than useful….

Tensorboard isn’t just for being hypnotized by pretty patterns though! Once all investment companies are placed in this 10-dimensional matrix one thing we can do is discover the nearest neighbors of any given fund.

Baseline Ventures

Similar firms to Baseline Ventures

The first investor in StitchFix was Baseline Ventures. They frequently do Series A and later investment, but they have a very substantial set of seed round investments. When we look for their nearest neighbors we find other firms which have very similar profiles. The two slight surprises there might be Union Square Ventures (USV) and Khosla Ventures, but looking at the data it seems they invest at the seed stage very frequently too. Interestingly, in the recent FullRachet podcast, Nick Moran interviewed USV’s Rebecca Kaden who characterized USV’s approach as putting small amounts of money in at seed stage and then following on. This interview in early 2018 confirms a pattern we can see from 2015 data. That’s a nice confirmation that this pattern detection technique is working correctly.

Benchmark Capital

The other lead investor in StitchFix was Benchmark who entered at Series B with a $12M investment. Benchmark are one of Silicon Valley’s major VC companies, with investments in companies like Twitter, Dropbox, Instagram, Snapchat etc.

Similar firms to Benchmark Capital

Looking for Benchmark;s nearest neighoubours we see the other huge, well known funds. Kleiner Perkins, Accel, Lightspeed are all there. A16Z isn’t.. but this data finished in 2015.

From these simple examples we can see how the model has learned to characterize funds.

Word2Vec can do more advanced things too: it understands relationships between words, so moving from England to London in the vector space is a similar direction and magnitude to moving from France to Paris. I experimented to see if this was possible in the investment fund vector space, but it doesn’t seem to work. I suspect the amount of data is insufficient at this point.

So what is all this for? I’m currently using it to show people how machine learning models can learn from sequences of events, and use that model to characterize data within a system. It’s true that other techniques can derive these insights equally as well, but I like using this to show the general applicability of many NLP-derived techniques.

I do think this approach has some utility on it’s own. One could use it to build predictive models for ICOs for example, by treating the publicity campaign as a sequence of events (classified into impact perhaps) and then measuring similarities to previous ICOs. Outside the financial space, I saw a great presentation by Casey Stella about using a similar technique on hospital admissions. Characterizing educational outcomes based on subjects taken, or pay based on work history are other things I suspect could work well.

Perhaps someone has access to Crunchbase data past 2015 and would like to see how it has changed things?

The possibilities are endless.

Interesting in talking more about this? I’m on Twitter as @nlothian or reach out via email.

Footnotes

[1] Vincent Vanhoucke actually said:

I think people are finally getting that autoencoding is a Bad Idea, and that the difference between unsupervised learning that works (e.g. language models) and unsupervised learning that doesn’t is generally about predicting the causal future (next word, next frame) instead of the present (autoencoding).

and

Take NLP for example: the most basic form of autoencoding in that space is linear bottleneck representations like LSA and LDA, and those are being completely displaced by Word2Vec and the like, which are still linear but which use context as the supervisory signal. In acoustic modeling, we spent a lot of time trying to weigh the benefits of autoencoding audio representations to model signals, and all of that is being destroyed by LSTMs, which, again, use causal prediction as the supervisory signal. Even Yann LeCun has amended his ‘cherry vs cake’ statement to no longer be about unsupervised learning, but about predictive learning. That’s essentially the same message. Autoencoders bad. Future-self predictors good.

I think this is a very deep insight.

[2] Word2Vec was popularized in the 2013 paper by Mikolov et al. 2013. I think the best explanation is on Adrian Colyer’s amazing “The Morning Paper” blog. See The amazing power of word vectors.

[3] Those familiar with Word2Vec will note that much higher dimensionality is normally used — 300 dimensions being the most common setting. That is needed because language is much, much richer than financial investment data, and so many more aspects of a word have to be recorded. For example, somehow the “sit-ability” of mats, beds, chairs and blankets (ie, that cats can sit on them) has to be stored as well as much more common attributes.

--

--

Tyto.ai
H2 Ventures

Nick Lothian writing about AI and Machine Learning