Technical Perspective — ⚙️ Machine Learning Models

An academic research analysis on Machine Learning Models

6 min readSep 21, 2022

During the process of developing the “Machine Learning & Data-Driven VC” article previously posted, I analysed the vast majority of research papers available in the academic community that intend to explore the application of Machine Learning models in deal screening.

In this article I intend to dive into some of the academic studies that I found particularly interesting in this space.

The first era of research attempting to use Machine Learning in Venture Capital, focused directly on the deal screening phase, more concretely in taking historical data of companies during a certain period of time that might have failed or succeeded, and then training ML algorithms to predict venture success. These studies focused solely on a binary definition of potential outcomes, as success was defined by the event that a company reaches IPO or not. Later studies started to also include an “acquired” status as another class of success.

In 2016, a study led by the Department of Software Engineering and Artiﬁcial Intelligence at Complutense University of Madrid in Spain (Arroyo, et al. 2019), analyzed a Crunchbase dataset of more than 120,000 early-stage companies in a realistic setting which attempted to predict their development over a 3-year time frame.

In the study the team went beyond the previously observed classes and focused on 5 different classes (or possible outcomes) — Acquired (AC, the company was acquired during the 3 years period), Funding Round (FR, the company raised at least another funding round during the simulation time frame), IPO, Closed (CL, the company closed), No Event (NE, none of the previous events happened). Given suspicions that most of the NE companies had actually closed but their status was not updated on Crunchbase, the team decided to also consider the NE class as a failed investment.

Results showed that ML can indeed support investors in their decision-making processes to find opportunities and it can help to increase the success rate of investors. In terms of particular class prediction results, the models proved to be of no use in predicting CL companies, and IPO companies alike given that there are clearly a lot of companies that have failed but their status was not updated to CL. IPO companies are also present at such a low quantity (only 143 companies out of more than 120,000) that no generalizable conclusions can be made on this class. When it comes to the NE and FR classes, ensemble classifiers yielded the best results (especially Gradient Tree Boosting). Even though their Recall was low, Extremely Randomized Trees and Random Forests proved to be the best classifiers for the AC class.

In 2021, a team of experts also set out to build an ML model that they called CapitalVX ( “Capital Venture eXchange”) (Ross, et al. 2021). They investigated 2 classes for the funding models (follow-on funding or none), 3 classes for the exit models (failure, acquisition and IPO) and the team also studied the effects of the addition of a fourth, private class for the exit models. The model tried to predict if a company would get access to additional funding or not and the results were quite promising. The three-way classification model achieved an Accuracy of around 90% whereas the four-way model achieved an Accuracy of 80%.

Like the aforementioned study, the team also used Crunchbase data to train their model. A combination of XGBoost, Random Forests, and K-Nearest Neighbors was implemented to provide a combined successful ensemble approach to predicting outcomes. IPO recall was low for the earlier stages where false negatives dominate given the logic correlation between the last funding round raised by a company and the likelihood of a great exit. A company that has reached later rounds of funding is logically more prone to reach IPO status than a company that is still at Seed stage. Furthermore, the predictions of venture failure become worse over the later rounds because the model naturally expects a company to be less likely to close.

*Source: CapitalVX: A machine learning model for startup selection and exit prediction (Ross, et al. 2021)*

Finally, to build in explainability of the model’s predictions the team experimented with the LIME (Local Interpretable Model-agnostic Explanations) and the Shapley values (SHAP) approach. Given that the SHAP approach yielded better results the team settled on using it.

However, one of the strong shortcomings of this model is that “some of the missing fields are replaced by sensible defaults such as zero in the absence of an amount for total funding”. In reality, the vast majority of companies on Crunchbase that don’t have available data for funding rounds, might have actually raised bridge rounds or just other rounds with angels and other means that are just not public.

Some studies went even further and actually started to study the performance of ML models and comparing it to the performance of actual investors.

In 2020, Torben Antretter and a team of researchers from the University of St. Gallen (HSG) (Antretter, et al. 2020) developed an algorithm and pitted it against 255 angel investors in a simulation, asking it to select the most promising investment opportunities among 623 deals from one of the largest European angel networks.

The most interesting aspect was that the results of this algorithm were then compared to actual investment decisions that had been taken by these angel investors. The algorithm achieved an average IRR (Internal Rate of Return) of 7.26% whereas the actual business angels only achieved an average IRR of 2.56%. In other words, this means that the algorithm performed almost 3 times better than human investors.

Nonetheless, it is even more interesting to observe that if we zoom into the angel investors group and just focus on the group of top tier experienced investors on the list, we can actually observe that this group of investors vastly outperformed the algorithm, achieving an average IRR of 22.75%. It is also worth noting that experience alone is not a strong enough factor to allow investors to achieve such an IRR. The results showed that experienced investors that showed high levels of cognitive bias (tendency to simplify information processing through a filter of personal experience and preferences) still performed slightly better than the average business angel investor (with an average IRR of 2.87%) but they performed much worse than their peers who were able to suppress their cognitive biases.

Finally, another really interesting example of a similar approach is the one taken by Andre Retterath (Retterath 2020). He also used Crunchbase data to train his model, but he complemented this dataset with Pitchbook and LinkedIn information.

When training the model Retterath, set performance metrics to maximize for Accuracy and Recall. Given that the XGBoost model was the one that showed the best performance regarding the aforementioned metrics he selected this model for the analysis.

*Human versus computer: Who’s the better startup investor? — Insights from an empirical benchmarking study (Retterath 2020)*

After training the model he fed data on 10 anonymized European startups right after they raised their Seed round in 2015/2016 to the model. Retterath surveyed 111 VC investors and provided them with the same information on the same 10 companies.

In reality, 5 of these startups had been successful and 5 had not. The algorithm’s results in predicting which of the 10 companies would be successful and the VC investors results were then compared. The algorithm was found to outperform the average VC by 29%. Even though these results seem to indicate that the AI performs generally better than humans, this might be a rather small and therefore skewed sample with just 10 companies, but nonetheless a good use-case.

Feel free to check-out the full original article on “Machine Learning & Data-Driven VC” here: (Link)

References

Retterath, Andre. 2020. The Future of VC: Augmenting Humans with AI.

Arroyo, Javier, Francesco Corea, Guillermo Jimenez-Diaz, and Juan A. Recio-Garcia. 2019. Assessment of machine learning performance for decision support in venture capital investments. Institute of Electrical and Eletronics Engineers.

Ross, Greg, Sanjiv Das, Daniel Sciro, and Hussain Raza. 2021. “CapitalVX: A machine learning model for startup selection and exit prediction.” The Journal of Finance and Data Science 94–114.

Antretter, Torben, Ivo Blohm, Charlotta Siren, Dietmar Grichnik, Malin Malmstrom, e Joakim Wincent. 2020. “Do Algorithms Make Better — and Fairer — Investments Than Angel Investors?” Harvard Business Review.

Retterath, Andre. 2020. “Human versus computer: Who’s the better startup investor? Insights from an empirical benchmarking study.”

Technical Perspective — ⚙️ Machine Learning Models

An academic research analysis on Machine Learning Models

Written by João Nunes