Factors behind 1,200 unicorn startups valuation

SHAP values of industries, investors, locations, and more

Dmytro Iakubovskyi
Data And Beyond
Published in
3 min readMar 1, 2023

--

Photo by June Gathercole on Unsplash

In this article, I use the public dataset of over 1,200 unicorn startups (private companies with a valuation of over USD 1 billion) around the world. The dataset is taken from the CB Insights — Technology Market Intelligence website and is publicly available on Kaggle. Full details of the analysis can be found in this public Kaggle notebook.

Step 1 — data preprocessing

Here, data preprocessing consists of the following steps:

  • selecting labels (unicorn valuations in billions of USD) and log10-transforming them ( x->np.log10(x) so that 1.0 billion USD transforms to 0.0, 10 billion USD to 1.0, etc.);
  • excluding the high-end tail of unicorn valuations that exceed 20 billion USD;
  • replacing some of the duplicate values in the industries and investors columns;
  • extracting information about investors with the help of CountVectorizer taking only the items with at least 20 appearances across the dataset;
  • encoding rare categorical variables (country, city, and industry) with no more than 50 different categories in each column and at least 20 records in each category;

--

--

Dmytro Iakubovskyi
Data And Beyond

Top writer in AI, Movies | Senior data scientist | Editor in Data And Beyond | https://www.linkedin.com/in/dima806/