10 ways to leverage LLMs as a Data-Driven VC

Abel Samot
Red River West
Published in
14 min readSep 18, 2023

--

Since my last articles on Data-Driven VC, OpenAI released chatGPT creating an unprecedented wave of interest in all AI-related topics.

As most of the data available in the VC world is textual and unstructured, at Red River West we have been leveraging transformers for multiple years (with BERT or GPT-3). But in the past year, everything accelerated, and since we already had a robust architecture in place, we decided to take a step back and deep dive into what this new technology wave could mean for data-driven VCs.

Fast forward, after months of testing different ideas, approaches, algorithms, and products, we have accumulated a lot of knowledge on the subject.

We wrote this article with my colleague Olivier Huez in order to provide a “go-to place” for VCs interested in using LLMs and Generative AI. I’ll share with you the resources, learnings, and ideas we’ve accumulated allowing us to considerably increase our productivity, source a lot of high-quality deals, and much more.

I’ll begin by defining some terms & giving you an architecture to build your tools, but if you want to look at the 10 examples of what you can do with LLMs, feel free to begin by the second part of the article 😉

If you are still a beginer in using data to improve VC workflows or want to know more about the basics behind it before diving into these more elaborate technics, be sure to read my last articles on the matter.

Ready?

Let’s dive into it !…

The components of an AI app based on LLMs

I’ll assume that you’ve all used ChatGPT, and have all heard about LLMs, so you know how it works. But have you heard about the other bricks of most generative AI applications?

I’ll begin by defining them, explaining how they could be used by data-driven VCs, and giving some examples of tools, and libraries to use these bricks.

Data collection

The first step of building any AI application is getting the data feeding the application. As VCs, you can either use native APIs from startup databases like Crunchbase, Dealroom, or harmonic.ai or scrape data from any startup website, LinkedIn, etc. The more textual data you have about a startup, the better (LLM are …well… about language, so we’re talking mostly of text here).

Data workflows

Now that you have your data, you will need to ingest it and connect your different services from your database to your CRM going through your different AI modules, etc.

To do so, you can build your data pipeline with Airflow, use the 300+ data connectors of Airbyte to process your data, or for less technical people, you can use Zapier, or n8n (which I prefer as its code is Open Source).

Embeddings

Vector embeddings are a central aspect of most NLP algorithms including LLMs. Embeddings are a way of representing words, sentences, or even entire documents as vectors in a high-dimensional space. The idea is that semantically similar items will be closer in this vector space, while dissimilar items will be farther apart.

It can be used for example to represent as a vector the description of each company of your dataset. Then you will be able to calculate the similarity between those descriptions and spot startups that might be competitors.

We tested more than 6 embedding models at Red River West for our different use cases. You can find the best Open Source models & their performance in this Embedding leaderboard. As for closed source ones, we loved the OpenAI & Cohere models which both have their pros and cons.

Vector Database

You can see Vector Databases as databases made for AI & LLM applications. They allow you to store embeddings and use them easily by using multiple approaches like question answering or similarity search which I will explain in a few lines.

Vector databases are what allow LLM applications to have memory. As a data-driven VC, you could use it to store the content of a deck, and then, query the deck content with natural language questions.

3 clear leaders have emerged in terms of vector databases: Pinecone, Weaviate, and Qdrant. They all have their pros, and cons and are optimized for scaling capacities. But as a first step, you can also use pgVector if you already have a PostgreSQL database.

Foundational Models

When I talk about Foundational Models, I’m talking about the models that are behind the well-known chatGPT. Of course, Open-AI has multiple models like GPT-4 (quite expensive) or GPT-3.5 that are available via an API.

But you can also use Open-Source models you can find on HugginFace like LLaMa 2 or Falcon that are getting better and better. If you want to play in Python with these models, a good idea could be to use a plugin called Langchain that is supposed to simplify their deployment, integration, etc (In our case, even if we were early fans of Langchain, we dropped it because it became too complex to handle for the value it created).

To improve these models for specific use cases you could use different methods :

  • The simplest one is Few-Shot Learning. It consists of writing some examples of the output you want to see in your prompt (or if you are using OpenAI API, by using GPT functions). It’s the first step to significantly improve your results. And for a lot of use cases, it’s often enough.
  • The other approach that could be a bit more complex is Finetuning. It’s the process of training further an already pre-trained model with specific data for a specific purpose. To do so, you will need to have a huge set of labeled data and use an Open-Source LLM like LLama or OpenAI API for GPT 3.5. Finetuning for specific tasks such as industry classification, could make the results of your tools 10x better.

LLMOps tools

The last brick I want to talk about is the LLMOps one, allowing for the efficient deployment, monitoring, and maintenance of large language models. Without this brick, you might have a very unpleasant surprise when you receive the OpenAI bill at the end of the month 😉

One of the tools we use for that is Helicone.

The different types of apps that you can build leveraging LLMs, embeddings, and Vector Database

Now that you know about the different bricks that will be used for your app, it’s time to explain what types of approaches you could use (before I give you some concrete examples in the last part of this article).

Similarity search

Description: Similarity search focuses on retrieving items (like documents or profiles) that are semantically similar to a given input. You can use it to find competitors for example.

Implementation: You can efficiently run similarity searches by converting textual data into embeddings and storing them in vector databases. Given a query vector, the system retrieves the nearest vectors, signifying semantic similarity.

Question-answering

Description: Question-answering systems are designed to provide precise answers to user queries. Leveraging LLMs, these systems can understand the context and semantics of a query, search a vast dataset, and retrieve the most relevant answer. You can use it to ask questions about a PDF document like the deck of a startup and find easily some information for example.

Implementation: You can easily provide any LLM with a set of text you want to ask questions about (e.g. a startup description), and just put your question in your prompt.

The issue is that it greatly increases the price of all of your requests (as you have a lot of tokens in your prompts), and you can quickly reach the text size limit, preventing you from using this approach on long texts.

To be able to implement question-answering on top of an investment deck or the entire text from a website, you want :

  1. To embed the underlying text in a vector database (by chunks of text that you embed individually)
  2. To leverage search over your database by embedding your question asked in a free text format and finding the n-closest records in terms of similarity
  3. To feed these records to the LLM in addition to your question and get an appropriate answer

Classification

Description: Classification involves categorizing items into predefined classes or labels based on their features. It could be used to automatically label startups into categories like “Early Stage”, and “Growth Stage”, or by industry sectors like “FinTech” or “HealthTech”.

Implementation: Here are 2 different ways to implement classification

  1. The simplest path is to train an LLMs on labeled datasets to perform classification (using few-shot learning or fine-tuning) and let it classify your data
  2. You can also embed your data (e.g. a startup description), and use the distance between the different vectors embeddings as parameters of a classification algorithm like a decision tree, a logistic regression, etc.

Clustering

Description: Unlike classification, clustering is an unsupervised technique that groups items based on their similarity without predefined labels. It could also be used to map companies in industries, but this time, by letting the algorithm find the patterns and industry groups.

Implementation: Using embeddings to represent your data (e.g. some startup descriptions), then you can apply clustering algorithms like K-means on top of your embeddings to find clusters in your data.

10 examples of how to use LLMs as a VC

🧐 Spoting startups amongst thousands of regular companies

One of the huge challenges of data-driven VCs is to differentiate a classic company that could be a bakery, a service company, etc., and a startup (especially in seed or pre-seed).

Solution: Train a classifier and give it all the textual data you collected about a startup (from a database like Crunchbase) for it to classify the company as a startup or not. You can also use a question-answering algorithm to which you will ask if the company seems to be a startup based on your description of it.

🔍 Searching & filtering specific startups

Each fund has its own thesis, and even people working in generalist funds often want to search for startups in a certain industry, leveraging a particular business model, etc. When you use databases like Crunchbase to spot these companies, you can try to add filters based on the specific taxonomy of these databases to find startups complying with your thesis.

But most of the time, these taxonomies aren’t good enough, and not precise enough to be able to filter for features, specific clients, etc. Besides, if you invest in pre-seed and base yourself mostly on commercial registers & LinkedIn (because the companies you are searching for are not yet in these databases), you need a way to filter through all the companies.

Solution: For me, you have 2 good options which are using a classification algorithm or a combination of clustering & question-answering (on top of the company data that you stored in your Vector Database).

  • By using a classification algorithm, you can define all the different industries or business models that you want your startup to be classified into, and then train your model to classify these startups into these specific categories.
  • Another approach could be to use clustering to find clusters of companies, and then use a question-answering model to give a name to these clusters. It allows you to have an always evolving and up-to-date taxonomy (but the clusters that emerge might not match the way you build your own investment thesis, so there is a tradeoff).

Building this kind of product could 10x the productivity of your analysts when they do market searches, deep dives & more.

🆚 Finding startup competitors

When you analyze a company, perform a market search, or just want to help your portfolio companies stand out from the competition, it’s very important to know about their competitors.

The problem is that it’s almost impossible to find exhaustive competitor databases for early-stage startups.

Solution:

  1. Collect a huge quantity of data from the companies you are targeting using their website, articles, their G2 profile, etc.
  2. Then embed those data in your vector database and perform a similarity search automatically when you want to know about any company competitors. The results won’t be perfect from day one but quite good!

You can see how EQT Ventures did their quite advanced version of this tool in this paper, and also the simpler version (and easier to implement) that Andre Retterath proposed in his newsletter.

💰 Finding M&A targets for the companies in your portfolio

When your portfolio companies begin to grow, they often turn to M&A to continue their fast development.

As a data-driven VC, you can help them with that too!

Solution: Use the competition mapping algorithm that we just talked about, and add filters by countries or by size & number of employees to spot similar companies that might be good acquisition targets.

💾 Extracting data from decks & websites

As VCs, every day we ingest a huge quantity of data from various sources to compare and analyze companies. Understanding what’s behind the marketing wording of a website or comparing the specific metrics of a startup with the ones of the startups you saw 2 years ago could be tedious. Unfortunately, we don’t have an infinite memory 🥲

That’s why one of the first use cases most VCs could think about when they heard about chatGPT and its code interpreter was data extraction.

How to do it?

  1. Use Langchain to extract the content of a PDF file or a website and transform it into a series of vector embeddings that you will be able to save in your vector database.
  2. Then ask some specific questions about the company by using a question-answering algorithm on top of your database.
  3. Finally, you can save the underlying data in your CRM afterward by using your favorite workflow tool.

Be careful, the results provided by your algorithms could be wrong because of hallucination. That’s why we added a way to be able to see the source of the answer and not the answer alone.

🌍 Evaluating the ESG policy of a startup

At Red River West, we have strong ESG convictions, we believe that investing in companies that put ESG matters at the core of their operations is not only the right thing to do but also, that they will perform better.
Companies that have best-in-class governance, take care of their employees, promote diversity, and monitor and reduce their carbon footprint are more likely to attract the best talents and retain employees better. They will improve lead conversion and customer retention too.

The startups we invest in typically don’t have requirements to report on ESG (and even the reports of the ones who do are not always so reliable), but despair not! There are very few ways to spot startups having a great ESG policy.

Solution:

1. Embed the different social media posts of companies, and their mentions in any social media post into your vector database.

2. Then use a question answering LLM to ask for any ESG policies that they promote and identify if they have a particular angle they focus on. You can also detect if they refer to general concepts or practical implementations.

This, however, would not necessarily mean that they walk to talk, (it’s very difficult for LLMs to spot greenwashing). There are a few smart ways to limit this risk but it goes beyond the scope of this article and we’re keeping some of what we work on for ourselves.

⚙️ Finding any information in your internal tools

At Red River West we are using multiple tools internally from Notion to Slack going through Ramp, or our CRM. So the information and knowledge we create is scattered across all of those tools, which means that when we want to gather all our internal knowledge about a particular topic, it becomes quite a headache.

So we decided to build a tool plugged into all of our different internal data sources working like a search bar allowing us to gather all the intel we have about a particular topic “e.g. Carbon accounting”.

How did we do it :

  1. First, you have to build a webhook that is plugged into all of your different tools in order to capture when a set of new information is added and embed that information in a vector database (to embed images, you can use Open Source LLMs such as Multimodal GPT as Open AI algorithms doesn’t understand images yet).
  2. Then build a search bar plugged into this database in order to be able to perform a similarity search between a query and all the content that you have (in order to build a robust tool, you can also train an algorithm to recognize the type of search you are going: a market search, a search about a company, etc.)
  3. For all the content retrieved, you can provide the source (e.g. Slack), and a link to see the context of where the information was retrieved (e.g. the link to a deck in Dropbox).
  4. If you want to go further, you can even provide these embeddings to a question-answering algorithm to which you will ask to sum up the information available about this topic.

💸 Finding the best VCs for the next round

Helping companies to raise the next round, is one of the fields where, as VCs, we can all be very helpful. Of course, most VCs have already developed their relationships with later-stage investors, but you can’t cover them all. There might be VCs out there that might be an even better fit for your portfolio companies thanks to their specialization or geographical presence. But how to find them?

Solution:

  1. You can use Crunchbase or Dealroom to get a list of VCs, their description, and their website link. With this link, you can also get more textual data by gathering it from their website.
  2. Once you have that data, you can use a question-answering LLM and ask it to extract the stage of investment, the geography, and the sector this VC fund focuses on.
  3. Then, embed those data into a vector database.
  4. Finally, you just have to perform a similarity search in this database by using as a parameter your research prompt which might be “I’m looking at VCs investing in Series B Hardware companies”.

A similar approach was used to build this tool: an AI-powered VC Sheet.

💌 Automating startup reach out

Like salespeople, VCs often have to reach out to hundreds of companies using cold e-mails. Writing these e-mails can be quite time-consuming.

At Red River West, we chose to continue writing these emails by ourselves, because we believe personalizing them deeply and using our knowledge of the sector will have a great effect not only on the response rate but also on our relationships with founders. But as we invest in Series B, we have fewer startups to contact, so it’s still possible to do it, which may not be the same for every VC firm.

Solution: This one is quite simple, You can provide any question-answering LLM with 2 thanks :

  • Models of e-mails you sent to companies that you previously reached out to
  • A description of the startup that you are trying to contact (e.g.: you can use the Crunchbase description).

Then ask the model to write a personalized e-mail for you. Beware, most of the time it’s not perfect, and it’s never great for a founder to receive messages written by robots.

⚙️ Getting to know the tech stack of any company

As VCs are investing in tech companies, it’s crucial to know more about the underlying technology that was used by startups to build their products. By knowing more about it, you could also compare the tech stack of different competitors and make a better-informed investment decision. But how to do it?

Solution:

  1. First (like always 😉) collect textual data on this company from your favorite startup database and from technical job posts that often mention the tech stack used by the company.
  2. Then ask a question-answering algorithm to extract the entire tech stack, and here you go, it’s as easy as that!

Conclusion

In this article, I only gave you 10 examples, but there are a lot of other use cases you could think of by scratching your head like spotting tech trends, finding potential clients for your portfolio startups, extracting the features of company products, generating investment memos, and much more.

I also only scratched the surface in all the solutions I provided (we have to keep a little bit of our secret sauce 😌). But you can go much deeper by adding other steps to prevent hallucinations, adding other data sources, other classification methods, etc.

You can also group all of these different approaches as well as more conventional productivity use cases into a single agent that would be an AI-powered VC assistant. If well done, it could 10x the productivity of each member of your team.

At Red River West, we’ve implemented many of these use cases, often by leveraging additional libraries or tools, and testing multiple approaches! Hope this article will help you with your own approach :)

We would be happy to discuss all of that and other use cases you might have found, so don’t hesitate to contact us at: abel@redriverwest.com or olivier@redriverwest.com

Special thanks to all the people who helped us build these tools including Olivier, Maxime, Daniel, Louis-Alexandre, and many others

--

--