A VC checklist for technical due diligence on AI startups

Published in

PRORATA

3 min readMar 12, 2023

As a venture capitalist, I meet founders building ML-based services / products hoping that it will be the next AlphaGo or ChatGPT. My job is to view clearly what their moat (against other big tech giants and top-tier researchers around the world) is, by reading their decks and data in the dataroom (mostly financial documents). Founders rarely give VCs access to their code, data set or training models. So I have to ask questions (and keep asking until I get answers) to understand their technology, data, and what they’re trying to solve.

Here are the key questions I ask:

Data

Do they have a data moat? Do they have a proprietary data set that no other players have?

💡 examples ECG data from patients with heart-related problems, data from Higgs boson events from collisions between protons in the LHC would be examples of a data moat to name a few.

Do they collect data in cost-efficient way?

💡 it won’t make sense in commercial service if it’s not economic to collect and process data, though it may be worth doing it for academic purposes. Note that pre-processing or cleaning data may cost a lot — ask how they conduct clean-up, pre-processing, feature engineering, reducing dimension, tagging, labelling, and transforming so that the data is ready for the model to perform learning.

How big is the data? Do they have enough data to launch a service?

💡 If they’re building a product using NLP, they’ll need a large scale database of language. If they’re using IMDB to analyze movies, for example, then there are only 7.9M titles listed in IMDB. If they’re researching to identify rare cancers in children, probably even a smaller set of data will be available.

Computing

How is computation done?

How much of GPUs (or TPUs) are required? Is computation acting as a moat from competitors?

How do they pay for computing cost? Is the cost of computing reasonable and sustainable?

Model

What ML model do they use?

Is the model open source or proprietary?

💡 Most of time, startups take an open-source model (from open.ai, google, meta, microsoft, and academia), and may fine-tune it to better suit the their own purpose.

Is the service critically dependent on model?

Competency

Where do they compete?

Infra (data cloud and computing layer), Model and training layer, Middle layer (MLOps? Optimization?), B2B? B2C?

What is their core competency? Data moat, workflow, or user engagement?

Do they have a lock-in strategy to retain their users? (UX arbitrage is NOT a core competency.)

People

Does the team have core competency in the domain and ML?

💡 The team should include ML experts AND management team to transform the technology into a lucrative business.

Is the team balanced between academic and business development?

💡 In some cases, an academic-oriented team is acceptable (example: researching on pharmaceutical to develop new drugs using ML), or at other times, a more business-oriented team (example: building a social networking site using NLP / ML features).

Ethics

Is the data collected in an ethical and legal way?

💡 example Collecting healthcare data in an illegal way is NOT acceptable.

Is the data itself legal?

Is the data unbiased?

💡 Examine the startup’s data to determine if it includes all necessary ethnicities, gender, age, political views, etc. Test if it results in biased answers by asking challenging questions.

Example

LLaMA by Meta (blog, github model card)

says they are open source, non-commercial
is targeting specific users — researchers — by making their model smaller, but trained on more tokens (pieces of words ), easier to retrain and fine-tune for specific potential product use cases.
is an auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters.
requires far less computing power — less power consumption
address the risks of bias, toxic comments, and hallucinations in large language models.