How to Assess Startups Using Machine Learning: Part III — The GASP For Predictive Modeling
The data geeks of the VC community finally have their newsletter. Subscribe here!
How to Assess Startups Using Machine Learning
Part I — Introduction
Part II — The GASP Open Source Framework
Part III — The GASP for Predictive Modelling
The GASP framework to standardize data collection
In Part II, we explained the motivation behind the need to standardize the collection of data and how to do it with our GASP open-source framework. The ultimate goal of this framework is to let any investor build their own database of quantifiable startup information. But it doesn’t matter how rich and updated your startup database is, if you only use it to store information and glance at it from time to time then you’re missing half the value you can generate from it. Always remember the golden rule: data is only valuable when transformed into insights, don’t be a data-hoarder. Indeed, after collecting a good number of startup data with the GASP, your ability to manually derive insights naturally diminishes. This is where machine learning enters the arena!
In this article we are going to look at how machine learning can help you make sense of all of your startup information, a task better suited to machines than manual analysis. Check Part I of our series to weight the pros and cons of using predictive modeling in this particular context! Today, we will illustrate the behind-the-scene predictive modeling process using the PreSeries’ Analyst Platform (powered by BigML). Investors and other business-type users prefer using our Dashboard or have the predictive models directly working inside their in house software, connected to their CRM, or else. More about results accessibility at the end of article.
PreSeries platform abstracts most of the complexity inherent to predictive modeling and allows us to focus on what matters most: easily generating predictions, not data engineering!
Step 1 — from GASP spreadsheets to datasets
If you are already using the GASP open-source framework, you should already have a good number of startup metrics stored internally, in spreadsheets or in a database. Below, is an example of how a customized GASP can look like in a spreadsheet. For the sake of simplicity, here we reduced the number of features to 20 and the time periods covered to 2 (see Part II for the whole picture).
The particular example above covers one startup that raised a Seed round. In real life, you’ll be dealing with many startups at different points in time and that’s where dealing with spreadsheets clearly shows limitations. Spreadsheets (or forms in general) are great to input data, but if you have hundreds or more GASPs to deal with, it’ll be impossible to manage. That is why we recommend that ultimately the information collected be stored in your own private database like the ones we deploy at PreSeries (ask us, we’ll automate it for you). By having it all under one roof, you can now combine everything in one dataset, a necessary step before any predictive modeling task can take place.
Now that you have everything ready to roll we need an objective. What do we want to predict? The lack of clear question to answer is the 4th most common challenge faced by machine learning practitioners. It needs to be clearly defined in order for any modeling approach to make sense and the answer will depend on your investment criteria as VC.
For example, if you are a seed investor, maybe the most important prediction you want answers to, is: “How likely is this applicant to raise a Seed round?”.
For the purpose of this article, let’s pretend you are a seed investor (Wait a minute! Not a lot of data available about seed startups… — bear with us!) and you are flooded with requests from entrepreneurs desperate to get an interview with you. “What’s the problem?” you might ask yourself. “The more applicants, the more to choose from. Abundance isn’t an issue!”. Well, Getting in touch these days is easy, but processing dealflow at the speed required not to miss a good investment opportunity is not.
It doesn’t matter how many opportunities you have coming your way if you miss hidden gems. Moreover, you cannot interview everybody, you need a reliable pre-selection process to allocate time only for the most promising applicants. In other words, the question you want to answer is: Should I interview this startup? To answer this question, let’s start with the data. The dataset we’ll use as an example is partly based on Angel List data (San Francisco startups) retrieved from Kaggle.
Step 2 — From dataset to predictive model
At PreSeries we automate all of the following for you, but here’s a sneak peek at our machine learning approach using the Analyst Platform. PreSeries users have access to the Analyst Platform (Dashboard & API) including all the online data and predictive models privately generated for you. For better, more accurate predictive models you are invited to include your own data and if you are in the market for a custom private deployment, ask for our PreSeries OS solution here.
First things first, importing your dataset manually is really simple on the PreSeries Analyst Platform (API available for automation). You just go to “Sources”, and click on “Create a source”. For the sake of simplicity, let’s assume your historical dealflow data is already exported on a *.csv file (see previous step). Now you need to transform your source into a machine-learning ready dataset by selecting the features you want to influence your predictive models. The dataset we’re using already includes all the variable we want, so no preparation work needed for this example. PreSeries ca, handle different types of fields: text, numerical, categorical,… Take a look at the dataset and see!
At this point, you probably want to take some time to explore your dataset: evaluate missing values, any errors or analyze the distribution of each of the fields. If you want to create a new variable out of the current variables (like a ratio, a label, or others) or import features from other dataset, this is also the right moment. Everything can be done on the platform in a point-and-click, no coding required.
Always remember, before generating a predictive model you want to split the dataset into a training and a testing dataset. The model will be “trained” on the training dataset and be put to the test on the “testing” one in order to assess accuracy. PreSeries offers a 1-click Training|Testing split, so you don’t have to worry about doing it manually.
Now we are ready to train a model. There is a wide variety of predictive models you can generate on of the PreSeries platform. Because our problem is a classification problem (Interview? Yes or No) we need to opt for a classification model, here a decision tree. Not sure which model to choose from? the OptiML features will automatically generate a wide range of different predictive models (ensembles, logistic regression, deepnets, …) and benchmark them to figure out which one performs best, all in one click.
Now, let’s make sure we select the appropriate objective field, aka what you want to predict. In this case, the field “invited to interview”. Then, simply click on “1-click model” to generate a decision tree without having to think about configuration parameters.
Et voila! Below you can see the decision tree model that we just created with our historical dealflow application dataset. On the left side is a representation of the model with an illustration of the probabilities walking down the nodes. On the right, a model summary, for you to understand which seem to be the variables better at predicting our objective variable (aka should we invite them to an interview).
Ok, now that we have a model to predict… is it any good? We need to make an evaluation using our testing dataset to see how good the predictions actually are.
As you can see in the above the confusion matrix, there are many metrics you can look at to decide if your model is good enough to put it into production. Which metric should you focus on to evaluate the quality of the model? Well, that will depend on what represents a greater cost for you. In our example the “concept” of cost is related to the resources required to run the interviews: time, money, etc.
We can encounter two types of errors, the first could be to miss an interview with a startup that we’ll never should have lost (Type I error, reject the null hypothesis when true, also known as False Negative), or having an interview with a startup that is really not worth our time (Type II error, accept the null hypothesis when false, also known as False Positive).
If we want to minimize the number of False Negatives, we should focus on the Recall metric (the higher the better). Otherwise, if your main concern is to reduce the number of False Positives, we should focus on the Precision metric (the higher the better too).
If you want to focus on the overall performance of the model, the better trade-off between the both types of errors, you should focus only on the F-measure or Phi Coefficient metrics. The F-measure represents the average (balanced harmonic mean) of Recall and Precision. And the Phi Coefficient, who also takes explicitly into consideration the True Negatives, giving a significant importance to do not schedule interviews with startups that we shouldn’t.
Looking at the Precision, 17 of each 20 interviews predicted by the model are ones we should pursue. And if we look at the Recall, we can see that we are losing 3 interviews opportunities for every 10 interviews.
Based on these numbers, we could enter in an spiral of optimization training the model again and again with: different combinations of features, tuning the different parameters of our model, adding more features, etc. But this is out of the scope of this post.
Step 3— Seamlessly make predictions about your inbound dealflow
Now we know that we can trust our model, so let’s put it to practice. There’s multiple ways in which PreSeries can work at your organization. (Get in touch here to know more!).
PreSeries Models are easily exportable on Google Spreadsheets, Microsoft Excel, or standalone. Here is how it works in Google Spreadsheet. First, access your spreadsheet where the dealflow data lies and load the PreSeries add-on. Note that, of course, there should be a field named “Invited to interview?”, similarly to what exists on your historical dataset.
You login using your PreSeries API credentials and you can now access your model library. Find the appropriate model and click on predict.
The model adds predictions automatically to the column left blank, recommending you or not to interview each company, along with the confidence of the prediction. Now you can filter those companies and start scheduling the meetings. Ready to save time and free your analysts from tedious work? Get in touch here!
You made it, you read it all. Congratulations! You now might be thinking: is it a good fit for me? Well, if you are an angel investor with large volume of deals to take care of, an early stage venture capital firm, or a growth venture firm we have the right solution for you. You can save management fees leveraging more data and predictive modelling while you make sure no good investing opportunity gets lost. It’s not very expensive, and as you can see, it works in very concrete ways, away from the machine learning hype.
We will be developing more case studies like this one on how to leverage the PreSeries data to train recommendation models to find investors, competitors or acquirers for your portfolio companies. Opportunities are endless!
We offer discount to VCs interested in a data partnership with us. Contact us here.