How VCs & LPs Apply Data Science to Make Informed Investment Decisions

A summary of discussions with data-driven VCs and LPs about using data in their daily jobs.

Published in

VCpreneur

10 min readSep 4, 2019

Why Bother?

Most of the critics are quick to dismiss using data in venture capital; they say VC is more art than science, and you can’t depend on an algorithm to decide whether to invest or not. But the question is: Can you afford ignoring the science part of it?

A data team is becoming an essential part of many VC firms. Subscribing to a data feed or a data analysis platform that everyone else has access to, doesn’t give firms an edge, they need more!

Over a 100 VC firms are spending millions of dollars every year on data infrastructure and talent. Some have hired in-house data scientist and data engineers, and some have hired contractors or consultants. Some built secretive internal software tools with cool (or weird) names, and some subscribed-to or purchased such tools from emerging service providers.

VCs of today use machine learning and AI models for due diligence, for sourcing, scoring, and supporting startups, and for raising capital. They also use it to explore markets, uncover trends, create market maps, and gather market intelligence.

None of this is meant to replace VCs in deciding on investment opportunities, but rather, to augment VCs, or enable them to quantify opportunities, and avoid bias as much as possible.

Data science in venture is not about the very limited task of invest or don’t invest! It’s not a 0 or 1 outcome, it’s a collection of 0 to 1 outcomes to give you probabilities on multiple aspects of a deal, for you to make truly informed decisions.

1. Collecting Data

To be able to be truly data-driven, you need to collect as much data as you can from every available source.

The data you need to collect depends on the stage and vertical you’re focusing on, and it would add a huge value to you if you collect data on daily basis. The more data you collect and update regularly, the more powerful your data models would be.

CrunchBase, PitchBook, CB Insights, or MatterMark are a good start, but they don’t give you the detailed business and financial metrics data you need, such data can only be given by startups in monthly updates. They also don’t give you personal data on founders and key employees, customer satisfaction, social media discussions, market growth, and thousands of other data points you need.

The bigger checks you offer, the more data you can demand from your portfolio companies. Also, the longer you stay in the business, or the more portfolio companies you have, the more data you can collect. If you run an accelerator with a fund to participate in following rounds, you can collect much more data. This data would be a huge asset for you in the future.

You need to collect historical and present data on companies that have already exited or failed, new seed stage companies graduating from accelerators, new companies that are actively advertising or hiring, any startup of interest even if you’re not currently considering it, and of course, all companies that reach out to you.

You also need to collect data on LPs, other VC firms, and angel investors. This will help you in fundraising for your funds and for your portfolio companies and in co-investment deals. It will also help you in measuring your funds’ performance against the market.

Here is a sample of what you can collect, these sources can give you thousands of data points on founders, startups, LPs and other VCs:

General startup funding data (CrunchBase, PitchBook, CB Insights, MatterMark, …) and VC market trends and insights reports
Detailed business and financial metrics collected from portfolio companies over time
Detailed market reports for every vertical of interest that includes general market data and averages for key financial and business metrics (example for Saas here)
Public data on web and mobile apps usage and rankings
Searches and keywords usages on specific problems/solutions, and on specific markets, industries, or trends.
Government or Academic data such as: regulatory filings, patent registries, publications, and white papers.
Social and professional content on startups and products reviews (Reddit, Quora, Twitter, ProductHunt, Youtube, Instagram, Startup and Tech News Websites, Financial News Websites, Blogs and Podcasts, …)
Startups public accounts activities and customer engagement metrics (Github/Gitlab for open-source contributions, LinkedIn, Twitter, Reddit, Quora, Facebook, Instagram, Medium, Youtube, SoundCloud, Blogs and Podcasts, …)
Startups hiring activities (LinkedIn, AngelList, Glassdoor, and other recruitment tools)
Founders and key employees backgrounds and social accounts (LinkedIn, AngelList, Github/Gitlab, Twitter, Reddit, Quora, Facebook, Instagram, Medium, Youtube, SoundCloud, Blogs and Podcasts, …)
Data from any accelerator or seed stage VC firm with hundreds of deals that is selling anonymized and aggregated startup metrics data.

You see how this is a lot of data, and you will need a data engineer just to continuously collect, clean, and organize this data into a usable database, so that a data scientist can start using it.

2. Using Data

2.1. Due Diligence

The first and most straight forward use of collected data is for data due diligence on post-seed startups.

Most data-driven VCs send a data scientist/analyst to the startup office for a few days to get the raw data and conduct their own analysis. The goal is to find evidence for true growth or traction, or find evidence for a network effect, or just make sure that provided summaries are accurate.

To be able to make a decision, you need a benchmark, whatever metrics data you collected earlier on markets, verticals, or specific industries come handy here to compare with.

Some VC firms went the extra mile by creating a quantitative framework for assessing product-market fit for example. They basically perform detailed analysis of growth metrics and cohort analysis, and then study the distribution of growth among cohorts.

2.2. Exploring

Some VCs just play with the data using supervised and unsupervised learning models to predict market trends, or to create clusters of startups as well as other VC funds or co-investors.

Combined with manual analysis and research, this helps answer questions like: How do I classify my fund’s portfolio companies to better estimate my IRR? Who’s focusing on what trends this year, and why? What top tier VCs are focusing on this year? What’s the average seed valuation this year? What are the verticals with the highest number of seed deals this year?

This can get more specific with co-investors analysis, to answer questions like: How do I classify my co-investors to better understand their decisions? Who would be our best co-investors in this vertical? or Who should we co-invest with based on multiple factors such as: who else invested in their own portfolio companies, how much was raised, on what valuations, how quickly did their companies grow, etc…?

Sometimes, there are no specific questions to answer, and every Monday, the data scientist presents new exploratory data analysis that pokes the partners interests and make them ask specific questions, or provide new directions for further exploration to: uncover trends, create market maps, or gather market intelligence.

2.3. Sourcing

After you’ve played with the data for a while, you start seeing the hidden gems of it. Instead of being reactive and wait founders to reach out, you can analyze startups and predict when and how much they might raise, and go make the deal before it’s even there, or before it reaches your competitors.

This is the most common use of data science in venture nowadays. Almost every top tier VC firm has a sourcing or scouting algorithm.

Three years ago, I had a discussion with a partner at a top tier VC firm that was building a sourcing algorithm. The goal was: find hot startups that we don’t know about. The algorithm continuously crawls the web to find new company profiles and starts monitoring their activities, growth, news, user engagement, advertisements, and hiring. When it finds an interesting enough startup, it notifies the partners and suggests when to contact them, and what to offer.

2.4. Scoring

This might be the most complicated task in venture. But again, It’s not a 0 or 1 outcome, it’s a collection of 0 to 1 outcomes to give you probabilities on multiple aspects of a deal, for you to make truly informed decisions.

At the core of it, data science is all about statistics and probabilities. It helps you answer questions like: What’s the probability for this startup to have a $X exist in the next Y years? What’s the probability for this startup to reach a $X revenue in Y years? What’s the probability for this startup to raise its next big round from a top tier VC? What’s the probability for these three co-founders to stick together? Which of these startups would be raising funds next year, how much, and on what terms? How do we compare to other VC funds that started at the same year? and many other questions!

You list all important aspects of a deal for you and your fund, then you build a model to quantify the opportunity in each of the listed aspects. You can combine all those numbers into one score if needed, but having multiple scores for every startup helps partners continuously discuss what’s more important for them in each deal.

Aspects can vary from firm to firm, depending on stage, industry, fund size, fund thesis, fund metrics and term, and many other factors.

If you’ve collected data on startups funding history, you can explore things like: who would most likely invest in the next round, how much, on what valuations? what would be the most likely exit scenario and value, is it an IPO or an acquisition, who would most likely acquire this company?

If you’ve collected data on customer reviews, social media sentiments, employees reviews, startups own published content and its engagement metrics, you can compare these data points to similar startups to get positive or negative signals to investigate further.

A firm asks to get direct access to the startup’s cloud services months before investing for their scoring system to recommend when and how much to invest if it sees above average growth in specific metrics.

Another firm has developed an algorithm to identify early stage startups that will most likely go viral before they reach critical mass, sometimes as early as when having a few thousand active users. Over a period of 2–6 weeks, the algorithm analyzes how customers interact with a given product, looking for specific behavioral patterns, once found, the firm makes a decision to invest.

Pre-seed VC firms look at the founders and key team members’s educational backgrounds, startup vs. corporate employment experience, global vs. local work experience, previous entrepreneurial experience, age, managerial experience, and professional network.

I know a firm that only cares about founders’ personality traits, emotional intelligence, diversity, and equity distribution regardless of the idea or the market to score pre-seed startups.

2.5. Supporting

Imagine how such data teams at those VC firms can help new startups!

They help in fundraising for next rounds. They can recommend when to raise, from whom, how much to raise, on what terms, and for how long.

A specific firm is using their data platform to allow their portfolio companies to compare their performance against a collection of comparable companies to get a sense of how well they are performing relative to the market across metrics like growth efficiency, customer retention, and operations.

Another firm has developed a software tool that aggregates data from many sources and allows their portfolio companies to take advantage of the relationships the firm has with customers, talent and investors.

Another firm is using an AI-based system for identifying and sourcing talent. It provides its portfolio companies with deep intelligence on nearly the entire talent ecosystem of the tech industry, including engineers, data scientists, product managers, designers and business leaders, ranking each person with dozens of quality dimensions, providing real-time predictions on how likely they are to switch jobs, and even proactively pushing new ones as they become available to help portfolio companies with the recruitment of rising stars.

2.6. Fundraising

Data-driven VCs apply similar techniques to raise capital for their own funds from LPs.

They research LPs activities indirectly through data on other VC firms, or directly by collecting public data on funds of funds, pension funds, endowment funds, corporate funds, and other potential LPs.

They gather intelligence on how much capital specific LPs might have to deploy, who invested in which funds, when and how much, and they try to predict their next moves based on their history and on the performance of portfolio funds.

Moreover, some VCs are using predictive AI models and complex math to enhance their funds performance in preparation for raising the next fund. For example, a firm calculates the optimal distribution of reserves for follow-on investments and plan capital calls accordingly. Another firm built a statistical model to optimize ticket sizes, valuations, and equity stakes, depending on the fund’s current performance, remaining capital, reserve, and the fund’s term.

2.7. Data-Driven LPs

All of the above is also used by LPs, to a certain extent, to source and score VC fund managers. But instead of focusing on founders and startup metrics, they focus on VC partners and other team members. They also have a different set of fund metrics to track such as: DPI, RVPI, TVPI, IRR, etc…

They collect data on fund managers like: who raised how much and when, who invested in what, when and how much, and who invested in the same deals at what stages at what terms. Then they compare funds performance by their vintage years and/or by their focus.

They also explore general trends in the VC industry such as the number of new first-time VCs every year, number of industry-specific funds, number of stage-specific funds, market gaps, correlations between funds activities locally and globally, and many other insights.

At the end of the day, the data you collect and process over time is your trade secret in the venture business. You should have a clean, organized, and up to date database that includes everything you get your hands on, and you should continuously create and update algorithms to help you make more informed decisions.