Artificial Intelligence Predicts the Success of Startups With Up to 80% Certainty, Using Publicly Available Data

Making the right investment decisions is critical to the success of VC funds and Accelerators.

Because there is so much data to be screened, processed, and analyzed investors at times forgo tapping into data products. Even if one is sitting on mountains of great data, human analysis of this information is frequently too time intensive to be of use. In the age of AI why not just let machines have a crack at the problem? They take fewer coffee breaks anyways — that’s exactly what researchers from Carnegie Mellon University did and in this blog we are presenting their findings.

So Much Data, so Little Time

The amount of data out there is growing at an exponential rate, with 90% of the data currently available created only in the last two years. How does one keep up with this growing complexity? Imagine having to screen tens, hundreds of thousands articles, companies, profiles just to select a few that are worthy of consideration. Impossible, right? Not really. With machine learning techniques we might have a chance to do something wonderful for all the venture capital investment firms out there kind of tool that will bring forth the “signal from noise” to paraphrase FiveThrityEight’s Nate Silver.

The Results

The study managed to learn to point out true positives with an accuracy of 60% to 79% based on the topic the investment was being made on. The more data there was a specific firm the more accurate the prediction — for a firm to be acquired. The false positive ratio was between 0% and 8.3%. Considering the large number of companies out there, the amount of data and current high failure rate of 4 out of 5 startups that are funded and fail, the study looks very promising. Having more analysts and people to go through the data does not necessarily guarantee higher rates of success. Let alone the costs of running such complex and talent intensive projects.

Data Used

The only source of data used in this study was CrunchBase and TechCrunch data on the companies. Data ranges were between 1970 and 2007.
The following amounts of data were analyzed:

Here it is important to mention that disparity in popularity is relatively high: Only 5,075 out of 59,631 companies had articles about them, the rest had none. Top 10 companies accumulated 13,874 articles in total, more than 1/3 of total collection, backing up the idea that world of startups is populated by several win-it-all unicorn companies.

Attributes analyzed

Features about each company analyzed included basic, financial, and managerial aspects of the companies. All data were obtained directly from company profiles or CrunchBase and their statistics were collected prior to their failure or acquisition.

Basic Features, quantified in numbers:

  • Number of employees
  • Age of company
  • Milestones
  • Company revisions on CrunchBase
  • TechCrunch articles about the company
  • Competitors
  • Competitors acquired
  • Headquarter location
  • Offices
  • Products
  • Providers

Financial Features, quantified in numbers:

  • Investments by the company
  • Acquisitions by the company
  • Venture capital and Private Equity Investments
  • People with financial background investing in the company
  • Key persons in the company with financial background:
  • Investors per funding round
  • Amount of investment per funding round

Managerial Features, in numbers:

  • Companies founded by founders previously
  • Successful companies by founders,
  • Founder experience (months)

According to the study these seemed to be the characteristics that make or break a tech startup on their path to success.

Algorithm Architecture

The core idea is that from finite sets of articles on TechCrunch and CrunchBase topics can be screened to view particular features of each company and each of these topics contains a bundle of particular words the algorithm learns and categorizes accordingly. The method of categorization was performed using Latent Dirichlet allocation. The results are shown below:

Top 10 words from each topic learned by LDA. Topic 3 coincides with mobile, and topic 5 relates closely to ads.

No method is flawless

While there is large number of companies in the list, the highest activity according to articles written and edited gathered around the most popular companies, rendering others as less predicative. In this work traditional features such as price to earnings ratio, return on average asset, etc. which likely would have helped the M&A prediction task were not used. In order to prevent misclassification, companies that went public via IPO were excluded from the sample.


The study and the algorithm produced was able to achieve fantastic results while only using publicly available data. Richer data sources locked away in private sources will likely hold greater potential in future endevors. While any claims of an all knowing AI, capable of besting humans on every front are likely overblown (for now at least), it’s indisputable that artificial intelligence will disrupt investment management and money allocation in many ways. From playing the helpful assistant; sorting through the stacks of company profiles to bring you a selection that you would be interested in. To vigilantly watching industry developments to let you know when to pay attention to a specific trend. AI and other machine learning technologies are already revolutionizing healthcare, marketing, governance, and engineering and investing can very well be it’s next major beneficiary. If you are a curious investor looking to step up your data-game, or just curious on what are we working on— drop us a line.