Natural Language Processing To Predict Success of Kickstarter Campaigns (92% Precision)
TL;DR This project aims to predict a high Precision rate for Kickstarter campaigns that are likely to succeed. A precision rate of 92% was achieved if only campaigns with higher than 90% probability of succeeding were considered.
Brief Overview of Kickstarter
Kickstarter is a global crowdfuning site that gives entreupreuners the platform to raise funds for their projects from the public. In return, investors might be given interest in the entreupreuner’s business, free products or other financial rewards. Hence, Kickstarter serves both budding entreupreuners as well as investors. The catch is that there is a 5% platform fee charged by Kickstarter for every project and there is a strict no-refund policy for failed projects.
The potential to lose one’s capital through a failed investment certainly presents high risks for investors. For example, Yogsadventure, a game from a popular YouTuber, raised $567,000 before it was cancelled!
Additionally, there are a few other traits of such Kickstarter products which present high risks. Firstly, products are often new and novel and have not been tested in a mature market. This makes assessing whether the product will succeed that much harder.
Secondly, these entreupreuners may be inexperienced and lack the skills neccessary to create and launch the products. Also, these start-ups are often small, and may lack financial backing and manpower.
Lastly, there may be a lack of information and track record for such products. Hence, investors here seems to take on a job of a Venture Capitalist.
Plan of Attack
For this project, I assume the role of a potential conservative investor of Kickstarter products, only investing in campaigns that have a high likelihood of success. I am also using the dataset on historical successes and failures of Kickstarter campaigns from 2017, which you can access here.
This dataset contains 215,000 descriptions and whether the campaigns had succeeded or not, with 50% of them succeeding and vice-versa.
As an investor, I would only care about investing in the projects which have the highest probability of success. Moreover, given the large number of projects on Kickstarter, there would be more than enough campaigns with a high probability of success for an investor to choose from!
Hence, I argue that the more important evaluation metric to look at would be the precision of the model. That is, out of the projects that the model predicted would be successes, how many turned out to be actual successes.
Part 1: Data Preprocessing
First, we read the dataset into Python, and remove any missing values. Compared to other datasets, we can’t impute missing values here, since each product is distinct, and there is only one feature used for prediction here.
Second, we also provide the label “1”, if a campaign is successful, and “0” if it’s not successful. But before we do that we can also return all the unique values from the column “state” to check that only the words “successful” and “failed” were used as labels.
Next, we can preprocess the text descriptions. We first start with removing all non-alphabetical characters followed by converting the words to lowercases. Here, we assume that non-character words, such as numbers and punctuation, play a minimal role in prediction. Converting the words to lowercase also helps with standardisation of text.
The words are also then stemmed into their root words, and Stop Words are also removed. Stop Words are words that are essential to sentence structure, but don’t provide much essential meaning to the sentence such as “the”, “its” and “to”.
One key aim of such text preprocessing is to reduce non-essential words. For example, by stemming the word “liking” to the root word “like”, we are reducing the number of unique words in our corpus of descriptions. The same logic applies as to why we remove Stop Words and non-alphabetical characters.
Part 2: Creating the Bag of Words Model
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=2500) #Keep top 2500 most frequently used words
X=cv.fit_transform(corpus).toarray() #Develop a sparse matrix for every word
After which, a Bag of Words model is created. Basically, the descriptions would be transformed into a large sparse matrix, with rows representing each observation (campaign), and columns representing each unique word present in our entire corpus of descriptions. If a word is present in a particular description, then a value of “1” is returned at that column
For example, let’s take the description “the dog is on the table” as the description of our first obervation. The numbers in the image represent the first row of our Bag of Words model. We can see that a value of 1 is returned if a particular word is present in our description and 0 otherwise.
However, the model we created would have many more columns than this, since we have hundreds of thousands of unique words from all our descriptions. Hence, to simplify the model, our Bag of Words model only considers the top 2,500 most used words. Given over 200,000 different campaign descriptions, 2,500 words may seem too few. However, after much experimentation, using more than 2,500 words adds little difference.
At this stage, we can also used a word cloud to visualise the most used words in our dataset.
# Word Cloudfrom wordcloud import WordCloud
import matplotlib.pyplot as plt
for i in range(len(dataset.blurb)):
if i==44343 or i==62766 or i==97999:
wordcloud = WordCloud(max_font_size=50, max_words=40,background_color="white").generate(text.lower())
Part 3: Fitting our Classifier
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)# Fitting Logistic Regression to the Training setfrom sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
Finally, we split our data into a training and test set, with a test size of 20% and a training size of 80%. A Logistic Regression model is then fitted.
Part 4: Model Results
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, pred)
print("Precision is "+precision)
Lastly, we use our model to perform our predictions. As mentioned earlier, I only chose to look at projects that have a predicted probability of more than 90%, and returned a value of 1 if this condition is met.
Here, a Precision of 92% was achieved, meaning that 92% of these projects that were predicted to have more than 90% probability of success came out to be successful. As an investor, I would be pretty satisfied with such results!
Moreover, out of more than 40,000 test observations, close to 1,500 were predicted to have more than 90% probability of success. This suggests that out of 200,000 (original dataset) projects, 7,500 would have a really high chance of success. I’m sure that out of so many projects to choose from, you would be able to find a few that you would be interested in and willing to fund :)
Part 5: Other Methods Tried
Despite the simplicity of the model, I have actually tried more sophisticated methods. For example, I tried running a neural network through the bag of words model. I have also attempted to use Word2vec to convert the words to vectors through the GloVe data, and then run a neural network through it. I have even attempted to cluster the observations through K-Means clustering based on the descriptions to achieve better results. Dimensionality reduction techniques did not do much as well.
Despite their greater sophistication and purported effectiveness with text, these models could not beat the accuracy of a simple Logsitic Regression. In fact, such models took a much longer processing time.
Part 6: Further Research / Improvements
I’m sure this model can be improved further. Bi-drectional RNNs might be much better at such text classification problems and I plan to explore this alternative soon too.
By chance, I had also discovered that using only the first 20,000 observations from the dataset produced a much better accuracy. This seems to suggest that there are specific clusters in the dataset, which is intuitive, considering that the products have likely been categorised by different product types. Sadly, this categorisation can’t be found from the data.
In this post, I went through a simple NLP prediction through Kickstarter descriptions. I have tried to keep this post succinct, so I apologise if the tone sounds a bit cold, or if more elaboration would be desired.
In any case, I would love to hear feedback so please your comments :)