Hey VCs, Your Outdated AI Investment Strategy Will Cost You and the Ecosystem Dearly

Freedom Preetham
The Simulacrum
Published in
13 min readJul 3, 2024

Dear VCs, if you are still evaluating the AI startups in scientific computing only on the basis of large, high-quality, proprietary data moats, then, either you have a very outdated hypothesis, or you have outdated experts on your deal due diligence panel. Both are detrimental for your future and the future of the AI ecosystem.

Why do you care so much about the size of the proprietary data that an AI startup have anyway? Shouldn't you measure the startups on the proof of the pudding? That is, how close are they to experimental validation?

When I say scientific computing, I include Life Sciences & Drug Discovery (Genomics, Transcriptomics, Proteomics, Metabolomics), Material Design, Aerospace, Nuclear Fission/Fusion, Autonomous Systems, Robotics, Space Science, Finance (High-frequency trades, Hedge Funds), Insurance, and a whole gamut of domains where mathematical modeling, synthetic-data and simulation are crucial.

In fact, this extends to all AI startups, not just those in scientific computing. Most inputs and outputs can be modeled as Partial Differential Equations (PDEs) with initial and boundary conditions, allowing for quantized readings of a continuous function represented as operators. None of the AI startups will survive in the future if they have not embraced math modeling, synthetic data and simulations.

The number of pitches I have delivered, participated in, or overheard where VCs ask questions as if it were still 2019 is astonishing.

Don’t get me wrong, this isn’t merely a VC oversight. It’s also a self-perpetuating cycle, fueled by outdated AI startups clinging to the “large proprietary data” narrative — often because their models are simply fine-tuned versions of existing Large Language Models (LLMs).

While this approach might yield short-term gains, it’s a dead end in the long run. The future of AI lies in developing reasoning and planning capabilities, a paradigm shift that many startups have yet to embrace. (But more on that in another article…)

Here is my assertion: Math, and not data, is the proprietary knowledge. Data is, and will continue to be, democratized. What will truly differentiate the wheat from the chaff is a profound understanding of domain constraints and the ability to learn and model those constraints efficiently.

How is data going to be democratized, you ask? Through data banks. For example, in biology, it will happen through large public datasets and equal access to data via pay-to-play models with BioBanks and aggregators. This process is already underway.

What is the bottom line? Simply put, it is ‘speed to scale’. Imagine that two startups, BigCorp and SmallDog, achieve the same performance, let’s say specificity in Gene Prediction. If BigCorp requires 100x more resources and 10x more time (due to their focus on building large proprietary datasets) compared to SmallDog, then I would place my bets on SmallDog.

This is not a narrative about ‘zero proprietary data’. Instead, we need to consider the appropriate quantity of high-fidelity proprietary data that is sufficient to complement mid-to-low quality public datasets (which are non-proprietary). This differs from the ‘data moat’ narrative, where large quantities of proprietary data are believed necessary to achieve comparable performance compared to a constrained model approach.

“Ok, How Much Proprietary Data Is Truly Needed? isn’t That a Data Moat?”

This is a good question to ask. Having fractional high-fidelity data is not a data moat at all. Sometimes these fractions are imputations and synthetic data, along with high-fidelity measurements, which are negligibly small.

In Biology, take proteomics for example: Assume length of a protein sequence n=100 (a protein with 100 amino acids):

  • Total sequence space: 20¹⁰⁰ ≈1.27×10¹³⁰.
  • Probability of stability: p = 10^-30. (some estimate -77 which is way too aggressive)

Number of stable sequences:

That’s 1 followed by 100 zeroes just for sequence length of 100.

If you had to actually solve the problem of protein folding through “large , high-fidelity, data-moats”, this problem would have been a non-starter until you had 10% of proteins for n=100. Which would have been 10⁹⁹ protein sequences. Even a 0.000001% of that number is 10⁹² protein sequences. Good luck with these numbers.

AlphaFold3 solved this with 10⁵ (177,000 to be exact) proteins from public dataset in addition to 10⁶ synthetic data and imputations with few thousands of high-fidelity proprietary data. Which is 2 * 10⁶ or couple million sequences at max! THIS is the point of the rest of the predication of this article.

There is an equivalent of an AlphaFold3 bound to happen at the genomic layer. The current amount of skepticism and negation in debates will not stop this from happening. Someone will figure out how to simulate the Gene Regulatory Network in silico using existing public datasets, synthetic data, and fractionally small proprietary data and imputations, along with physics-informed models.

This fractionally small proprietary data is not a data moat. It is the overall modeling, math, physics, and acute understanding of the domain constraints in biology applied during the AI modeling that will collectively form the moat.

“But Genomics and Gene Regulatory Networks Are Different From Physics”.

If this is your argument, then you may not have studied Physics or Math and I forgive you :) Everything can be approximated in Math and Physics. Cellular functions are not an exception.

I have a 6 part series on how to think about genomics from the point of view of Math here:

To stay competitive, it is imperative to shift our focus and adopt a more nuanced, forward-looking and open-minded approach. Else we shall be like crabs pulling down every voice and ideas to normalize to the old-school rhetoric that is so resounding and prevalent (Especially in Biology).

Allow me to tease out some key advancements that should redefine the evaluation criteria:

Key Areas of Expertise

To stay ahead, you must seek experts with a deep understanding of the latest developments in these areas:

  • Multi-Fidelity Learning: Techniques that combine various data fidelities to optimize model performance and insight extraction.
  • Deep Operator Learning: Innovations that leverage PDEs and stochastic operators to advance scientific computing.
  • Physics-Informed Neural Networks: Models that incorporate physical laws to enhance predictive accuracy and efficiency.
  • Diffusion Models: Generative AI models that harness stochastic diffusion processes to produce high-quality samples from complex data distributions, revolutionizing applications like image synthesis and text generation.
  • Stochastic Operator Models & Novel Architectures (FNOs, HNNs, Bistable Chains): These cutting-edge approaches enable AI models to tackle complex systems with uncertainty, multiple stable states, and underlying physical laws.

Let me breakdown each of the key area in the following section.

Multi-Fidelity Learning

Today’s leading-edge deep learning models leverage a blend of low- and high-fidelity data sources to extract valuable insights. Public datasets often provide this mix, negating the necessity for startups to invest heavily in proprietary data acquisition. Startups that have mastered multi-fidelity learning techniques can derive significant value from existing data, making them promising investment candidates.

Paper:

Applications:

  • Climate Modeling: Multi-fidelity approaches combine high-resolution climate simulations with lower-fidelity, large-scale models to predict weather patterns and climate change impacts more accurately. Large amount of low fidelity satellite imagery is available as public dataset which can be combined with in-situ, high-fidelity readings. You do not need large, high-fidelity proprietary data moats anymore!
  • Aerospace Engineering: Engineers use multi-fidelity models to integrate detailed simulations of aircraft components with broader, less detailed models of entire systems, optimizing performance and safety without excessive computational costs. Small high-fidelity datasets can be combined with large low-fidelity synthetic data from simulators.
  • Genomics: Combining high-resolution genomic data with lower-fidelity, broader biological datasets allows for more comprehensive understanding of gene functions and interactions, facilitating breakthroughs in personalized medicine. Actually, when it comes to genomics, there is a large corpus of low, medium and some high fidelity public datasets in the form of FANTOM, GEO, ENCODE, 1000 Genome project etc.. Genomics and Proteomics is one area where VCs should stop asking questions about large, high-fidelity proprietary datasets from Wet-labs! The narrative should change to multi-fidelity data and accurate modeling of domain constraints. Or just ask for proof of the pudding and stay clear from what dataset moats exist (which is immaterial if the outcome is what you are focusing on).

Deep Operator Learning

The field of scientific computing is undergoing a transformative shift away from data-heavy methodologies. Deep Operator Learning, which builds on physics-informed partial differential equations (PDEs) and stochastic operators, maps infinite-dimensional function spaces instead of finite-dimensional vectors. This revolutionary approach is streamlining scientific discovery, especially in domains like proteins, enzymes, genomes, and physics, with minimal data requirements.

Paper:

Applications:

  • Drug Discovery: By leveraging Deep Operator Learning, researchers can predict the interactions between proteins and potential drug compounds with greater accuracy, accelerating the development of new medications.
  • Material Science: Scientists can explore the properties of new materials by modeling their behaviors at the atomic level, drastically reducing the need for extensive experimental data.
  • Genomic Research: This approach enables the modeling of complex genetic networks and interactions, leading to deeper insights into gene regulation, expression, and potential therapeutic targets.

Physics Constrained Models

Physics Constrained neural networks integrate physical laws directly into the learning process. They embed the governing equations of physical systems into the neural network architecture directly (as against learning them). Models like PINNs (Physics Informed Neural Nets) offer more accurate and reliable predictions with less data.

Paper:

Applications:

  • Structural Engineering: PINNs can predict stress distributions and failure points in complex structures, improving safety and efficiency in construction and manufacturing.
  • Fluid Dynamics: These models are used to simulate fluid flow in various contexts, from aerodynamics in automotive design to predicting ocean currents for environmental monitoring.
  • Genomics/Proteomics: PINNs can model the dynamic processes of gene regulation and interaction, providing insights into cellular functions and disease mechanisms with higher precision. Physics Constrained Models also have been used extensively in the AlphaFold3 models to capture protein structure and functions. (Example in the end of the article).

Diffusion Models

Diffusion models are a powerful class of generative models that have recently gained significant attention in the AI community. By leveraging stochastic processes, these models can learn to generate high-quality samples from complex data distributions, making them valuable tools for tasks such as image synthesis, text generation, and audio synthesis.

Paper:

At the heart of diffusion models lies the concept of diffusion, a stochastic process that describes the gradual spread of particles or information over time. In the context of AI, diffusion models use a Markov chain of diffusion steps to gradually transform a simple noise distribution into the target data distribution. This process involves iteratively adding noise to the data and then training a neural network to reverse this process, effectively learning to generate data from noise.

Examples of Hybrid Models in Diffusion

  • Denoising Diffusion Probabilistic Models (DDPMs): These models utilize a series of diffusion steps to gradually corrupt the input data with Gaussian noise, and then train a neural network to denoise the corrupted data.
  • Score-Based Generative Models: These models learn the gradient of the data distribution’s log-likelihood (the “score”), which can then be used to guide the generation process through a stochastic differential equation (SDE).
  • Variational Diffusion Models (VDMs): These models combine diffusion processes with variational inference techniques to learn flexible and expressive generative models.

Applications:

  • Image Synthesis: Diffusion models have demonstrated remarkable success in generating high-resolution images with intricate details and realistic textures.
  • Text Generation: These models can be used to generate coherent and diverse text samples, such as poems, code snippets, and even entire articles.
  • Audio Synthesis: Diffusion models are being explored for generating realistic speech, music, and sound effects.
  • Drug Discovery: By generating novel molecules that adhere to desired properties, diffusion models can accelerate the drug discovery process.
  • Material Design: These models can aid in designing materials with specific properties, such as strength, conductivity, or thermal stability.

Stochastic Operator Learning

Stochastic Operator Learning (as against stochastic processes in Diffusion Models) play a crucial role in modeling systems with inherent randomness and uncertainty and you have small-data to learn from. By incorporating stochastic elements into the learning process, these operators can provide more robust and reliable models for complex phenomena. Fourier Neural Operators, Hamiltonian Neural Networks, Bistable Chains are all some examples of this hybrid models.

Paper:

Applications:

  • Financial Modeling: Stochastic operators help in predicting market trends and managing investment risks by accounting for the unpredictable nature of financial markets.
  • Epidemiology: These techniques are used to model the spread of infectious diseases, incorporating randomness in transmission rates and population behavior to improve outbreak predictions and intervention strategies.
  • Genomics: Stochastic models are essential for understanding the variability in gene expression and mutation rates, offering a deeper comprehension of genetic diseases and evolution.

Novel Architectures Powering the Stochastic Operator Learning:

  • Fourier Neural Operators (FNOs): These powerful DOL models leverage the Fourier transform to efficiently learn complex operators, enabling them to solve PDEs and model intricate physical systems with remarkable accuracy and speed.
  • Hamiltonian Neural Networks (HNNs): By encoding Hamiltonian mechanics into their architecture, HNNs can learn to conserve energy and other physical quantities, making them ideal for modeling dynamical systems.
  • Bistable Chains: These innovative architectures offer a novel approach to modeling complex systems with multiple stable states, opening up new avenues for applications in robotics, control systems, and materials science.

Example: DeepMind AlphaFold’s Physics Constrained AI Modeling

AlphaFold3 was initially trained on a vast dataset of protein sequences and their corresponding experimentally determined structures, primarily derived from the Protein Data Bank (PDB). PDB is a Public Dataset. This dataset contained over 170,000 protein structures. This is a tiny fraction compared to infinite amount of protein structures possible. So, how did DeepMind achieve this feat?

Paper:

AlphaFold utilizes a technique called “self-distillation” to enhance its accuracy. In this process, the model generates predictions for a large number of protein sequences for which no experimental structures are available (Which means, no large proprietary data moats). These predictions are then filtered to a high-confidence subset and added to the training data. This expanded dataset, which includes both labeled (experimental) and unlabeled (predicted) data, contributes to AlphaFold’s impressive performance.

It’s important to note that the exact size and composition of AlphaFold’s training data are not publicly disclosed. However, it is estimated that the model has been trained on millions of protein sequences (PDB, Uniprot Ref, BFD), making it one of the largest datasets ever used for protein structure prediction.

The combination of experimentally determined structures, self-distilled predictions, and diverse sequence databases contributes to AlphaFold’s claim of training on millions of sequences which are all of different length, fidelity, and sparsity and vastly available in public domain.

AlphaFold 3 Model Details

AlphaFold 3 incorporates physical and geometric constraints in a few key ways:

Structure Module: The heart of AlphaFold 3 is its Structure Module, which predicts the 3D coordinates of each atom in the protein. This module is designed to produce structures that adhere to known physical constraints:

  • Bond Lengths and Angles: The model is trained on a vast amount of protein structure data, learning the typical distances and angles between bonded atoms. It then uses this knowledge to predict bond lengths and angles that are consistent with known chemical properties.
  • Steric Clashes: The model also learns to avoid steric clashes, which occur when atoms are too close together. This is achieved by incorporating a loss function during training that penalizes structures with unrealistic atomic overlaps.

Evoformer: The Evoformer module is responsible for integrating evolutionary information from multiple sequence alignments (MSAs). This module helps to identify conserved regions and patterns in the protein sequence, which can provide valuable clues about the likely structure.

  • Coevolution: The Evoformer learns to identify pairs of residues that coevolve, meaning they mutate together to maintain the protein’s function. These co-evolving pairs often indicate residues that are in close proximity in the 3D structure, providing additional geometric constraints.

Refinement: After the initial structure prediction, AlphaFold 3 employs a refinement process that further optimizes the structure. This process involves:

  • Energy Minimization: The model uses an energy function to assess the stability of the predicted structure. It then iteratively adjusts the atom positions to minimize the energy, resulting in a more physically plausible structure.
  • Side Chain Optimization: The model also refines the positions of the amino acid side chains to ensure they are properly oriented and interact with their environment in a realistic manner.

By incorporating these physical and geometric constraints throughout the modeling process, AlphaFold 3 can generate protein structures that are not only accurate but also biologically plausible. This is a significant advancement over previous methods, which often produced structures that violated basic physical principles.

Embrace the New Frontier

Broadening your investment criteria to include these cutting-edge techniques will position your firm at the forefront of technological innovation. Please engage experts with current knowledge in these domains to ensure that your evaluations are based on the most advanced and effective methodologies available.

Let’s foster an open discussion about evolving our investment strategies and criteria to keep pace with the rapid advancements in AI. Your insights and experiences are invaluable as we navigate this new frontier together.

--

--