A Quantitative Approach to Sourcing DealFlows Pt. 2
Testing Natural Language Processing and Latent Dirichlet Allocation (Topic Modeling/Unsupervised Learning)
This is part 2 of my journey to build a machine learning model that can predict whether or not a startup will go on to raising an additional round of funding after raising a seed round. This time around I decided to utilize the power of the NLP (natural language processing) and Topic Modeling (Latent Dirichlet Allocation).
Quick recap: Venture capital is a form of private equity and a type of financing that investors provide to startup companies and small businesses that are believed to have long-term growth potential. After thoroughly researching the industry I noticed that machine learning has not been fully utilized and is actually under utilized at most institutions because of a lack of knowledge/dedication to research. I took this observation as a glaring opportunity to do some research for myself and test the feasibility of solving this type of question.
In my last blog post I wrote about how my best results came from utilizing xgboost classifier combined with topic modeling of a single text column. After reviewing the results I thought it could be beneficial to utilize topic modeling on additional text columns and see if it could increase my recall score.
Topic Modeling is an unsupervised classification method that is capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents into a given number of topics. In this experiment, I am using topic modeling (LDA) for dimensionality reduction because my last model’s Accuracy (90%) indicated an overfitting issue. In my last model, I categorically encoded my text columns which led to a high number of features and a lack of predictive power that I knew could be drawn from these columns because of their real world implications.
Moreover, how LDA works is that is assigns topics to arrangements of words. There are two main assumptions that LDA works with. The first one is that it believes documents are written with arrangements of words and therefore those arrangements of words determine the topics of that document. LDA also assumes that all words in a document can be assigned to a probability of belonging to a topic. However, a major downside of LDA is that it ignores syntactic information and treats documents as bag of words.
Step by Step Guide into building my model with results:
I used the same data that I originally trained my previous model on, which was a dataset from Crunchbase that contained 27,131 rows of different companies and 18 columns with information on each one.
- The first thing I did was apply a function I created to wrangle the data in order to set it up with the correct features and target vectors I planned on inputting into my model.
My resulting data frame contained 7 columns.
2. The next step was to apply some preprocessing NLP techniques on the text columns.
In this step I imported WordNetLemmatizer from the NLTK library. Lemmatization is the natural language preprocessing technique that takes a word such as tackling and converts it to its base word: tackle. This is useful for machine learning because it reduces the inflectional forms and sometimes derivationally related forms of a word to a common base form that makes it easier for the model to interpret by creating a mapping of words that are related to its base form. Additionally, lemmatization considers the context of the word as well in order to convert it to a more meaningful base form.
3. I then imported CountVectorizer from sklearn to apply on the Organization Description.
The purpose of CountVectorizer is to convert the text into a matrix of token counts on the basis of the frequency of each word that occurs in text. Also this is a mandatory preprocessing step for the text before I put the column through the LDA model because LDA does not accept text.
4. Now that I have the text prepared for LDA I imported LatentDirichletAllocation from sklearn and fit my document term matrix to the LDA model. I chose 10 topics arbitrarily after testing a few different numbers. Once I had DTM fitted I transformed it and converted it to a data frame that I concatenated to my wrangled data frame from above.
5. I repeated steps 3 and 4 for Investor Name and Organization Industries.
6. Now that I have run topic modeling on all three of my text columns I was ready to split the data frame into train and test observations.
7. Once I had my data test-ready, I created my machine learning pipeline that utilized XGBClassifier and Randomized Search Cross-validation.
8. Results after running the model on my test data are below.
Here are the results from my previous model.
As you can see by running LDA on two additional text columns I was able to get a better recall score, which was what I was optimizing for. In addition, my accuracy was brought down to a more realistic 83%, which signifies a reduction in variance. Overall, these results make me much more inclined to find ways to utilize my text data as much as possible when trying to extract the most value from it when creating predictive models.
In conclusion, I am very pleased with my findings and it has opened my eyes to exploring more natural language processing techniques as well as exploring additional features related to the founders of the company. While researching what information I should use for my model I noticed that there is black hole around information on the founders of these companies. As someone who has spent time in the startup scene and has met many successful founders I am beginning to hypothesis that information on the founders might be the biggest thing holding my model back from being able to provide substantial value to investors. Thus, my next iteration will be focused on sourcing information around the founders of the companies: personality, background, career, education, etc.
Latent Dirichlet Allocation(LDA)
A statistical model for discovering the abstract topics aka topic modeling.
NLP: Extracting the main topics from your dataset using LDA in minutes
Doing cool things with data!
Introduction to Topic Modeling
Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting…