Factors behind 1,200 unicorn startups valuation
SHAP values of industries, investors, locations, and more
Published in
3 min readMar 1, 2023
In this article, I use the public dataset of over 1,200 unicorn startups (private companies with a valuation of over USD 1 billion) around the world. The dataset is taken from the CB Insights — Technology Market Intelligence website and is publicly available on Kaggle. Full details of the analysis can be found in this public Kaggle notebook.
Step 1 — data preprocessing
Here, data preprocessing consists of the following steps:
- selecting labels (unicorn valuations in billions of USD) and log10-transforming them (
x->np.log10(x)
so that 1.0 billion USD transforms to 0.0, 10 billion USD to 1.0, etc.); - excluding the high-end tail of unicorn valuations that exceed 20 billion USD;
- replacing some of the duplicate values in the industries and investors columns;
- extracting information about investors with the help of CountVectorizer taking only the items with at least 20 appearances across the dataset;
- encoding rare categorical variables (country, city, and industry) with no more than 50 different categories in each column and at least 20 records in each category;