Predicting a Startup Valuation with Data Science

Sebastian Quintero
Jan 30, 2019 · 13 min read

The following is a condensed and slightly modified version of a Radicle working paper on the startup economy in which we explore post-money valuations by venture capital stage classifications. We find that valuations have interesting distributional properties and then go on to describe a classical statistical model for estimating an undisclosed valuation with considerable ease. With that said, we would suggest reading the entirety of this article before using the model. This is not magic and the details matter. With that said, grab some coffee and get comfortable––we’re going deep.


It’s often difficult to comprehend the significance of numbers thrown around in the startup economy. If a company raises a $550M Series F at a valuation of $4 billion [3] — how big is that really? How does that compare to other Series F rounds? Is that round approximately average when compared to historical financing events, or is it an anomaly?

At Radicle, a disruption research company, we use data science to better understand the entrepreneurial ecosystem. In our quest to remove opacity from the startup economy, we conducted an empirical study to better understand the nature of post-money valuations. While it’s popularly accepted that seed rounds tend to be at valuations somewhere in the $2m to the $10m valuation range [18], there isn’t much data to back this up, nor is it clear what valuations really look like at subsequent financing stages. Looking back at historical events, however, we can see some anecdotally interesting similarities.

Google and Facebook, before they were household names, each raised Series A rounds with valuations of $98m and $100m, respectively. More recently, Instacart, the grocery delivery company, and Medium, the social publishing network on which you’re currently reading this, raised Series B rounds with valuations of $400m and $457m, respectively. Instagram wasn’t too dissimilar at that stage, with a Series B valuation of $500m before its acquisition by Facebook in 2012. Moving one step further, Square (NYSE: SQ), Shopify (NYSE: SHOP), and Wish, the e-commerce company that is mounting a challenge against Amazon, all raised Series C rounds with valuations of exactly $1 billion. Casper, the privately held direct-to-consumer startup disrupting the mattress industry, raised a similar Series C with a post-money valuation of $920m. Admittedly, these are probably only systematic similarities in hindsight because human minds are wired to see patterns even when there aren’t any, but that still makes us wonder if there exists some underlying trend. Our research suggests that there is, but why is this important?

We think entrepreneurs, venture capitalists, and professionals working in corporate innovation or M&A would benefit greatly from having an empirical view of startup valuations. New company financings are announced on a daily cadence, and having more data-driven publicly available research helps anyone that engages with startups make better decisions. That said, this research is solely for informational purposes and our online tool is not a replacement for the intrinsic, from the ground up, valuation methods and tools already established by the venture capital community. Instead, we think of this body of research as complementary — removing information asymmetries and enabling more constructive conversations for decision-making around valuations.

Making Sense of Startup Valuations

We obtained data for this analysis from Crunchbase, a venture capital database that aggregates funding events and associated meta-data about the entrepreneurial ecosystem. Our sample consists of 8,812 financing events since the year 2010 with publicly disclosed valuations and associated venture stage classifications. Table I below provides summary statistics.

Image for post
Image for post
The sample size for the median amount of capital raised at each stage is much higher [N=84k] because round sizes are more frequently disclosed and publicly available.

To better understand the nature of post-money valuations, we assessed their distributional properties using kernel density estimation (KDE), a non-parametric approach commonly used to approximate the probability density function (PDF) of a continuous random variable [8]. Put simply, KDE draws the distribution for a variable of interest by analyzing the frequency of events much like a histogram does. Non-parametric is just a fancy way of saying that the method does not make any assumption about the data being normally distributed, which makes it perfect for exercises where we want to draw a probability distribution but have no prior knowledge about what it actually looks like.

The two plots immediately above and further down below show the valuation probability density functions for venture capital stages on a logarithmic scale, with vertical lines indicating the median for each class. Why on a logarithmic scale? Well, post-money valuations are power-law distributed, as most things are in the venture capital domain [5], which means that the majority of valuations are at low values but there’s a long tail of rare but exceptionally high valuation events. Technically speaking, post-money valuations can also be described as being log-normally distributed, which just means that taking the natural logarithm of valuations produces the bell curves we’re all so familiar with. Series A, B, and C valuations may be argued as being bimodal log-normal distributions, and seed valuations may be approaching multimodality (more on that later), but technical fuss aside, this detail is important because log-normal distributions are easy for us to understand using the common language of mean, median, and standard deviation — even if we have to exponentiate the terms to put them in dollar signs. More importantly, this allows us to consider classical statistical methods that only work when we make strong assumptions about normality.

Founders that seek venture capital to get their company off the ground usually start by raising an angel or a seed round. An angel round consists of capital raised from their friends, family members, or wealthy individuals, while seed rounds are usually a startup’s first round of capital from institutional investors [18]. The median valuation for both angel and seed is $2.2m USD, while the median valuation for pre-seed is $1.9m USD. While we anticipated some overlap between angel, pre-seed and seed valuations, we were surprised to find that the distributions for these three classes of rounds almost completely overlap. This implies that these early-stage classifications are remarkably similar in reality. That said, we think it’s possible that the angel sample is biased towards the larger events that get reported, so we remain slightly skeptical of the overlap. And as mentioned earlier, the distribution of seed stage valuations appears to be approaching multimodality, meaning it has multiple modes. This may be due to the changing definition of a seed round and the recent institutionalization of pre-seed rounds, which are equal to or less than $1m in total capital raised and have only recently started being classified as ’Pre-seed” in Crunchbase (and hence the small sample size). There’s also a clear mode in the seed valuation distribution around $7m USD, which overlaps with the Series A distribution, suggesting, as others recently have, that some subset of seed rounds are being pushed further out and resemble what Series A rounds were 10 years ago [1].

Around 21 percent of seed stage companies move on to raise a Series A [16] about 18 months after raising their seed — with approximately 50 percent of Series A companies moving on to a Series B a further 18–21 months out [17]. In that time the median valuation jumps to $16m at the Series A and leaps to $130m at the Series B stage. Valuations climb further to a median of $500m at Series C. In general, we think it’s interesting to see the binomial nature as well as the extent of overlap between the Series A, B, and C valuation distributions. It’s possible that the overlap stems from changes in investor behavior, with the general size and valuation at each stage continuously redefined. Just like some proportion of seed rounds today are what Series A rounds were 10 years ago, the data suggests, for instance, that some proportion of Series B rounds today are what Series C rounds used to be. This was further corroborated when we segmented the data by decades going back to the year 2000 and compared the resulting distributions. We would note, however, that the changes are very gradual, and not as sensational as is often reported [12].

The median valuation for startups reaches $1b between the Series D and E stages, and $1.65 billion at Series F. This answers our original question, putting Peloton’s $4 billion-dollar appraisal at the 81 percentile of valuations at the Series F stage, far above the median, and indeed above the median $2.4b valuation for Series G companies. From there we see a considerable jump to the median Series H and Series I valuations of $7.7b and $9b, respectively. The Series I distribution has a noticeably lower peak in density and higher variance due to a smaller sample size. We know companies rarely make it that far, so that’s expected. Lyft and SpaceX, at valuations of $15b and $27b, respectively, are recent examples of companies that have made to the Series I stage. (Note: In December 2018 SpaceX raised a Series J round, which is a classification not analyzed in this paper.)

We classified each stage into higher level classes using the distributions above, as one of Early (Angel, Pre-Seed, Seed), Growth (Series A, B, C), Late (Series D, E, F, G), or Private IPO (Series H, I). With these aggregate classifications, we further investigated how valuations have faired across time and found that the medians (and means) have been more or less stable on a logarithmic scale. What has changed, since 2013, is the appearance of the “Private IPO” [11, 13]. These rounds, described above with companies such as SpaceX, Lyft, and others such as Palantir Technologies, are occurring later and at higher valuations than have previously existed. These late-stage private rounds are at such high valuations that future IPOs, if they ever occur, may end up being down rounds [22].

Approximating an Undisclosed Valuation

Given the above, we designed a simple statistical model to predict a round’s post-money valuation by its stage classification and the amount of capital raised. Why might this be useful? Well, the relationship between capital raised and post-money valuation is true by mathematical definition, so we’re not interested in claiming to establish a causal relationship in the classical sense. A startup’s post-money valuation is equal to an intrinsic pre-money valuation calculated by investors at the time of investment plus the amount of new capital raised [19, 21]. However, pre-money valuations are often not disclosed, so a statistical model for estimating an undisclosed valuation would be helpful when the size of a financing round is available and its stage is either disclosed as well or easily inferred.

We formulated an ordinary least squares log-log regression model after considering that we did not have enough stage classifications and complete observations at each stage for multilevel modeling and that it would be desirable to build a model that could be easily understood and utilized by founders, investors, executives, and analysts. Formally, our model is of the form:

Image for post
Image for post

where y is the output post-money valuation, c is the amount of capital raised, r is a binary term that indicates the financing stage, and epsilon is the error term. log(c · r) is, therefore, an interaction term that specifies the amount of capital raised at a specific stage. The model we present does not include stage main effects because the model remains the same, whether they’re left in or pulled out, while the coefficients become reparameterizations of the original estimates [23]. In other words, boolean stage main effects adjust the constant and coefficients while maintaining equivalent summed values — increasing the mental gymnastics required for interpretation without adding any statistical power to the regression. Capital main effects are not included because domain knowledge and the distributions above suggest that financing events are always indicative of a company’s stage, so the effect is not fixed, and therefore including capital by itself results in a misspecified model alongside interaction terms. Of course, whether or not a stage classification is agreed upon by investors and founders and specified on the term sheet is another matter.

As is standard practice, we used heteroscedasticity robust standard errors to estimate the beta coefficients, and residual analysis via a fitted values versus residuals plot confirms that the model validates the general assumptions of ordinary least squares regression. There is no multicollinearity between the variables, and a Q-Q plot further confirmed that the data is log-normally distributed. The results are statistically significant at the p < 0.001 level for all terms with an adjusted of 89 percent and an F-Statistic of 5,900 (p < 0.001). Table II outlines the results. Monetary values in the model are specified in millions, USD.

Image for post
Image for post

The model can be interpreted by solving for y and differentiating with respect to x to get the marginal effect. Therefore, we can think of percentage increases in x as leading to some percentage increase in y. At the seed stage, for example, for a 10 percent increase in money raised a company can expect a 6.6 percent increase in their post-money valuation, ceteris paribus. That premium increases as companies make their way through the venture capital funnel, peaking at the Series I stage with a 12.4 percent increase in valuation per 10 percent increase in capital raised. In practice, an analyst could approximate an unknown post-money valuation by specifying the amount of capital raised at the appropriate stage in the model, exponentiating the constant and the beta term, and multiplying the values, such that:

Image for post
Image for post

Using the first equation and the values in Table II, the estimated undisclosed post-money valuation of a startup after a $2m seed round is approximately $9.4m USD — for a $35m Series B, it’s $224m — and for a $200m Series D, it’s $1.7b. Subtracting the amount of capital raised from the estimated post-money valuation would yield an estimated pre-money valuation.

Can it really be that simple? Well, that depends entirely on your use case. If you want to approximate a valuation and don’t have the tools to do so, and can’t get on the phone with the founders of the company, then the calculations above should be good enough for that purpose. If instead, you’re interested in purchasing a company, this is a good starting point for discussions, but you probably want to use other valuation methods, too. As mentioned earlier, this research is not meant to supplant existing valuation methodologies established by the venture capital community.

As far as estimation errors, you can infer from the scatter plot above that, for the predictions at the early stages, you can expect valuations to be off by a few million dollars — for growth-stage companies, a few hundred million — and in the late and private IPO stages, being off by a few billion would be reasonable. Of course, the accuracy of any prediction depends on the reliability of the estimated means, i.e., the credible intervals of the posterior distributions under a Bayesian framework [6], as well as the size of the error from omitted variable bias — which is not insignificant. We can reformulate our model in a directly comparable probabilistic Bayesian framework, in vector notation, as:

Image for post
Image for post

where the distribution of log(y) given X, an n · k matrix of interaction terms, is normal with a mean that is a linear function of X, observation errors are independent and of equal variance, and I represents an n · n identity matrix. We fit the model with a non-informative flat prior using the No-U-Turn Sampler (NUTS), an extension of the Hamiltonian Monte Carlo MCMC algorithm [9], for which our model converges appropriately and has the desirable hairy caterpillar sampling properties [6].

The 95 percent credible intervals in Figure V suggest that posterior distributions from angel to series E, excluding pre-seed, have stable ranges of highly probable values around our original OLS coefficients. However, the distributions become more uncertain at the later stages, particularly for series F, G, H, and I. This should be obvious, considering our original sample sizes for the pre-seed class and for the later stages. Since the data needs to be transformed back to its original scale for appropriate estimation, and the fact that the magnitudes of late-stage rounds tend to be very high, such changes in the exponential will lead to dramatically different prediction results. As with any simple tool then, your mileage may vary. For more accurate and precise estimates, we’d suggest hiring a data scientist to build a more sophisticated machine learning algorithm or Bayesian model to account for more features and hierarchy. If your budget doesn’t allow for it, the simple calculation using the estimates in Table II will get you in the ballpark.

Concluding Remarks

This paper provides an empirical foundation for how to think about startup valuations and introduces a statistical model as a simple tool to help practitioners working in venture capital approximate an undisclosed post-money valuation. That said, the information in this paper is not investment advice, and is provided solely for educational purposes from sources believed to be reliable. Historical data is a great indicator but never a guarantee of the future, and statistical models are never correct — only useful [2]. This paper also makes no comment on whether current valuation practices result in accurate representations of a startup’s fair market value, as that is an entirely separate discussion [7].

This research may also serve as a starting point for others to pursue their own applied machine learning research. We translated the model presented in this article into a more powerful learning algorithm [8] with more features that fills-in the missing post-money valuations in our own database. These estimates are then passed to Startup Anomaly Detection™, an algorithm we’ve developed to estimate the plausibility that a venture-backed startup will have a liquidity such as an IPO or acquisition event given the current state of knowledge about them. Our machine learning system appears to have some similarities with others recently disclosed by GV [15], Google’s venture capital arm, and Social Capital [14], with the exception that our probability estimates are available as part of Radicle’s research products.

Companies will likely continue raising even later and larger rounds in the coming years, and valuations at each stage may continue being redefined, but now we have a statistical perspective on valuations as well as greater insight into their distributional properties, which gives us a foundation for understanding disruption as we look forward.

Bibliographic references are available in the full PDF.

Journal of Empirical Entrepreneurship

Data-driven insights on startups and venture capital from…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store