Watch Me Predict Startup Winners with Artificial Intelligence and Machine Learning

I returned to the west coast about 5-years ago amid a flurry of investments, buzzwords, and exciting innovations all around me. Every enclave in California had adopted a “Silicon” moniker: “Silicon Beach”, “Silicon Coast”, “Silicon Shore”… the list is long. Equally prevalent was the foreboding feeling that “the other shoe” was going to drop — a financial collapse like 2008 or worse, 2001.

Fortunately, we have some new tools at our disposal. Can machine learning and artificial intelligence help minimize the risks of a bursting bubble?

There are countless online tutorials on how to use historical stock data to predict market performance. Let’s look at how we can apply these same techniques to predict venture-backed start-up success — or failure.

For data, lets scoot over to where they correlate investor and company data with a stellar search interface. (A subscription is pretty affordable but they allow free queries as well.)

I searched for companies that were formed after 2010 and before 2015 which were funded by VCs. I pulled down 1000 companies who had a successful exit (IPO or Acquisition) and then 1000 companies which were listed with a status as “closed” (typically indicating a failure). I also pulled down a set of 1000 companies formed after 2015 and funded in 2017 — newbies. We’ll use that last data set to predict potential investments.

Here’s another way to look at out data:

  • Qualified Data — the list of companies which successfully excited. Using this will be our key target field and value.
  • Disqualified Data — the list of companies which failed.
  • Target Data — the new data where we will try to qualify or disqualify companies based on the previous 2 data sets.

Let’s use to get some visuals on this. Then we’ll use a unique modeling approach using Artificial Intelligence on text fields to try a completely different approach.

The data which CrunchBase provides is pretty extensive, so narrowing the fields down is vital to our success. We’ve used this method before trying to predict cluster demographics around the 2016 Presidential election. If you’re just getting started with — that’s a great place to start.

I’ve narrowed the fields down to variables around monetary, employee, investor, founders and categories. Too few variables and the model could have been done in Excel; too many variables and the results are just noise.

Winnowed down. Here’s what I used in the dataset:

Looking at the model things can get a little out of hand but you can see some of the trends nonetheless. The model below was refined to show a cascade of field relationships to find the best predictor of a company exiting or failing.

First, the exit. Acquired companies seem to have an interesting set of fields set together around total funding amounts, number of investors, and number of founders:

To translate this you look path from top to bottom at the right. Fields at the top are more vital than fields lower on the totem. I’ve set the “confidence” slider just below 80% to catch the best paths on the visual. The thicker the connecting line between nodes the bigger the sample set down that path. In layman’s terms — if you have a company with total funding between $800K and $4.4MM, minimal investors, fewer than 3 founders and your last funding influx was $1.3MM you’re more likely to exit through acquisition. Other paths indicate that number of employees is a good indicator as well.

Another view is the sunburst. Below we tried the “closed” (failed) prediction using the same dataset and model:

Here the prediction (the steel blue ray bursts to the north) shows that failed companies had fewer employees, fewer investors and (shocker) — little money.

Let’s go further.

One of my favorite models on is “clustering” — groupings of field averages where the data items coalesce.

This first cluster (the large red dot in the middle) shows 284 companies that exited through acquisition. They had on average 2 founders, fewer than 50 employees, with just under 2 rounds of funding. Note this cluster also found that a “seed” round with a early exit was typical.

The next cluster (or “centroid”) looks at the closed accounts and finds some other interesting data points.

The rare IPO (4 instances in this centroid) tells a very different story.

Beyond the data points we can actually use the text in the company descriptions, the categories and even the locations assigned to that company to gain some insights. We can filter the dataset just to include positive exits and see what kind of common dictionary exists around exits:

For my preference though I’m thought we should try to do some manual approaches to mix up our results a bit.

Last year @perborgen at Xeneta published a fascinating piece on Medium using machine learning algorithms to train a dataset with postive and negative datasets (qualified and disqualified) mapped against a new dataset to predict “winners” based solely on the company description. Using our 3 datasets I booted up Terminal on my Mac, dusted off my Python skills to give it a go.

With the model trained and the 1000 newly funded companies primed — the solution identified 145 companies who — based solely on their company descriptions— have a matched text “profile” to successful start-ups.

Tomorrow, we’ll use our data “centroids” to predict success but here’s the “text-centric” pass results:

— about the author:

Justin Hart is a senior executive consultant.
His primary objective: plumb the deep depths of cutting edge technologies and translate those into c-suite strategies to improve marketing and sales teams.
shorter version: mktg + bizdev + ai
Justin is a recognized industry speaker on modern marketing trends. He is currently working with several companies applying advanced tech tools like machine learning and artificial intelligence to business funnel basics.
You can find his work online at
Email Justin at at gmail.
On twitter
On Medium
Justin Hart

Justin has over 20 years experience as a senior executive of established and start-up companies and even political campaigns (as senior digital director to the Mitt Romney campaign). He currently resides in Southern California.