A Quantitative Approach to Sourcing DealFlows
link to repository: https://github.com/noah40povis/Build-Week-2.git
VC firms have been using sourcing algorithms for a few years now because it allows them to be more efficient in identifying potential investments. I do not know to what extent these firms rely on predictive modeling to find the next Uber because VC's like to keep their secret sauce internal so I thought creating a basic predictive model would help me better understand what the potential upside and downside these models could be providing for these firms.
I would assume Top Tier VC firms gather data from the startups they invest in, consider investing in, advance webcrawling techniques, and platforms like PitchBook and Crunchbase. Knowing this and understanding my limited capabilities (and limited funds) as a young data scientist I settled with using Crunchbase's trial period and retrieved all the companies that raised a Seed and Series A round from Dec 31, 2018 — Present.
I wanted to see if the data I gathered could help me predict whether or not a company who raised a Seed round will go on to raising a Series A. My original dataframe contained 27,131 rows and 18 columns. In order to prepare my data for modeling, besides the standard cleaning and wrangling tasks, I had to create my target vector: 'Raised Series A' by identifying all the companies that raised Series A's and labeling them as 1 and the companies that did not as 0. In addition, I took the investor column and industry column that contained multiple strings of investors/industries and split them into their own columns in order to potentially extract more information from them through categorical encoding. I also knew that I would definitely run into potential leakage, so I ran at barebones xgb classifier a few times to pick off certain columns that were giving me an accuracy of over 99%. A few columns that stuck out as leakage columns were 'Total Funding Raised,' and 'Total Funding Amount Currency (in USD).' At this point my dataframe contained 12,623 rows and 24 columns and an accuracy score of 95%.
The first issue I came across after cleaning up my data was determining how to deal with the inherent skewness of my target vector. In general startups are known to fail at an alarmingly high rate, which is why it is so important for VC's to source multiple deals to diversify their risk.
According to my data over the last two years, startups from around the globe that raised a Seed Round, only 6% of them went on to raise an additional round of funding.
I considered unskewing my target vector to increase my prediction accuracy, but that would be unrealistic and would devalue my models real world application. Instead, I looked into how a binary classification model should be assessed depending on the skewness of the data.
I found three models worth looking into for this project: CatBoost, XGBoost, and Logistic Regression. How I would measure their performance would be on their recall and precision score. I chose these two scoring methods because of the real world implications when it comes to identifying potential investments in Venture Capital. Foremost, a false negative is far more costly than a false positive in Venture Capital because it is often one investment that returns 100x or even 1000x your initial investment (take Accel's initial investment in Facebook compared to it other investments around that time). Thus optimizing for a higher Recall vs Precision will be my goal.
Furthermore, VC's invest thousands to millions of dollars in individual funding rounds only for a majority of them to fail. VC's promise their investors a return so the consequences for False Positives could also be very costly. Take Wework or Theranos as a prime examples of investments that cost private investors millions in dollars as well as a hit to many of their reputations. This also validates the need for models to take an additional look at a company because (historically) individuals get blindsided by the emotions of investing in the "next big thing."
The first model I ran was a XGB Classifier because of its speed and its built in regularization parameters that will help prevent my model from overfitting. In addition, I also added RandomizedSearchCV in order to randomly identify best fitting parameters, particularly for alpha and lambda, which both contribute to the regularization of my model.
The results did not surpass my baseline, but it was a great starting point to identify potential reason why.
I had hypothesized that my model would do poorly on the first try, so the next things I wanted to add was an nlp technique: topic model, on my company description column. The reason I drew to this conclusion was mainly because I felt like I was lacking in information that could provide true predictive power.
After preprocessing the Organization Description column I plotted a distribution plot because I wanted to see how the description length matched up in terms of companies that did or did not go onto to raise a Series A. Companies that went onto raise an additional round, on average, had lengthier descriptions than companies that did not.
Continuing exploratory analysis, I wanted to see what words were most frequently used to see any underlying trends regarding companies business description. I predicted to see words regarding hot topics during the last few years which I assumed would be related to Big Data, AI, Social Media, Healthcare, and Mobile. All of these topics were references in the word count above.
After fitting and modeling the description column with LDA (LatentDirichletAllocation), I was outputted 10 topics (columns), which I attached to my original dataframe.
Original Dataframe Shape: (12623, 24) New Dataframe Shape: (12623, 34)
This time around my model's precision/recall score did worse. At this point I wanted to know if there were still any potential leakage and if my topic modeling was actually effective. Therefore, I ran a feature importance analysis.
From the looks of the above graph there seemed to be issues with the high cardinality of the investor and industry columns because it would not make sense for their importance to rank above a column like 'Money Raised.' As a result of this outcome, I realized the way I wrangled the original 'Investor' and 'Industry' columns, which both contained comma separated strings, was a mistake. I thought by splitting the columns into multiples for each individual string I could capture more data than the individual column. Yet, when I went back and undid this part and instead ran the two columns through LDA, the model predicted an even worse outcome than my initial score. Additionally, 'Organization Name,' should not hold such a high importance unless it has leakage, so I got rid of this column as well.
Therefore, after adjusting my feature list I ended up using the following columns: ‘Money Raised Currency (in USD)’, ‘Organization Location’, ‘Primary Industry’, 'Investor1', and Topic modeled columns 1–10.
To my delight I saw a huge gain in both precision and recall scores as well as a reduction in accuracy (reduced overfitting?).
Redoing my visualizations, all three showed significant improvements with a deeper explanation as to why. My model was able to predict 23 more true positive values, which greatly improved the recall and precision, which is what I was most focused on improving.
I continued to experiment with CatBoosting Classification as well as Logistic Regression, but I was unable to find anything that could topple my score above.
After reviewing my results and thinking of how I could improve this model I have come up with a gameplan for next steps. I want to gather another round of US startups that have raised a Seed Round, but also get information on their founders as well as information about their company that may be available on twitter. Both of these pieces of information require web crawling and scraping techniques that I will focus on learning so I can continue to build upon the progress I made with this model. I want to continue testing my hypotheses regarding what feature combination will help in determining whether or not a company will be able to successfully grow past the Seed stage. All this exploration and testing is so I can eventually deploy a model that will allow a VC to plug in a companies information and see what their likelihood is of succeeding and comparison to companies in a similar domain. In addition, it will feed the VC information on any companies the model feels is relevant for further investigation. This model will continually web crawl for information on new companies/current companies as well as allowing the User to update information from their end. I am excited to continue to grow in my DS journey!