Kickin’ it with Kickstarter

Predicting the success of a Kickstarter campaign at its inception

10 min readDec 13, 2017

Team Members: Sonia Taneja, Kevin Liang, Matthew Tan, Sagar Krishnaraj

GitHub repository: https://github.com/soniakt23/KickstarterFundingPredictor

Introduction

Backing a project on Kickstarter can be a daunting task. When a new project is created, limited information is provided on the profile — for example, who created it, what it is about, and his/her funding goal. While many projects fail to meet their goal by the funding deadline, a significant amount of projects are successful; these ideas either strike a chord with a niche audience, have over-the-top marketing campaigns, or are refreshingly innovative. For the average person browsing through the newest Kickstarter projects, it can be confusing to understand which projects have potential and which ones are busts. Our project aims to clear up this confusion through the use of data science and machine learning techniques.

Goal

Our goal is to predict the success or failure of a Kickstarter project. For the purposes of our project, we define a successful Kickstarter project as one that meets or surpasses its funding goal by the funding deadline. Conversely, we define a failed Kickstarter project as one that falls short of its funding goal by the funding deadline.

Context

Kickstarter is regarded as one of the most popular and coveted crowdfunding websites. The projects that are crowdfunded on Kickstarter play heavily into public perception and expectation — thus allowing us to explore an interesting avenue through our data science project. To elaborate, the model we build must determine whether or not users are willing to provide sufficient donations to fund the project. From the perspective of an investor, or even a Kickstarter user, it is useful to be able to have some predictive power about whether a certain product is going to be successful or not over the duration of its fundraising. If a project will likely be unsuccessful, perhaps it is not worth investing money; conversely if a project will likely be successful, it may be more appealing to make a large donation to help it reach its stretch goal. A note of caution: This can be dangerous as it can encourage/discourage users from going against existing trends and may significantly influence outcomes. On the flip side, creators of new Kickstarter projects can leverage our tool to understand the realism of their project’s goals and constraints. This allows project creators more certainty in answering the questions — is the project worth pursuing, and what resources must be allocated in order to reasonably ensure a project’s success.

Data

Building the dataset

To accurately predict success rate of a kickstarter project, we searched for a large dataset to train our model on. We found a link (https://webrobots.io/kickstarter-datasets/) which contains kickstarter datasets collected every month. Each dataset is composed of all kickstarter projects up to a certain maximum for each category on kickstarter. We utilized the data from October 2015 which comprised of approximately 150,000 various kickstarter projects.

Figure 1: Screenshot of the website with the datasets

The dataset consists of 35 features which pertain to each kickstarter project. All features that we used from the dataset were both publicly available and provided at the time of the project’s creation. Many of the features — including creation date, funding deadline, and pledged amount — were of numerical data type. However, other features — such as creator profile, location, and category — were non-numerical and hence required preprocessing.

Pre-Processing

Our first step was to filter out unrelated data points from our dataset. We combed through all of the headers to decipher their relationship to kickstarter projects. This required significant research since our team had limited previous experience with Kickstarter. Secondly, projects that were currently live, suspended, or cancelled were removed from our data because our goal was to predict the success of only completed projects.

In order to get a better understanding of the distribution of data, we conducted analysis of the features prior to processing. In our initial analysis, we discovered that the distribution of successes and failures was approximately 50/50 within the data. Thus, our label of prediction was evenly distributed throughout the data, and we treated it as a balanced dataset. Additionally, when we plotted the distribution of goal amounts for all of the projects, we found that more than half of the goals for Kickstarter projects were less than $10,000. The distribution plot of the goal feature shown below represents 95% of the data. The remaining 5% of project goals are those that lie above $40,000.

Figure 2: Distribution of successful and failed projects

Figure 3: Histogram of funding goals found in the dataset

Figure 4: Summary of the data for related features

From the summary above, we found that the average goal set by creators was approximately $3,000 and average pledged amount was almost $9,000. Although we do not know the pledge amount at the time of launch date, it provided some interesting insights when analyzing our data.

Once we were satisfied with the data cleanup, we decided to engineer/extract some of our own features by using Natural Language Processing (NLP) and data scraping techniques.

NLP

There were three text field features that provided information about the project: name, blurb, and slug. The name feature contained information about the name for the Kickstarter project chosen by the creator. The blurb feature is short description of the project. Finally, the slug feature was a hyphen-delimited version of the name feature with a max length of 50 characters. As this information was already represented by name, we decided to remove slug.

Figure 5: Text features on which NLP was performed

We decided to take a bag of words approach for the features name and blurb. Initially, we cleaned the data by converting each of the text fields into a list of lowercase words. We removed stopwords — words that did not provided significant information about the project — and also lemmatized each list of words such that each word in the list was converted to its dictionary form; this allows for more effective grouping of words. We then utilized sklearn CountVectorizer to tokenize and count occurrences of a word for each of the features. However, we ran into a problem. This resulted in approximately 8010 unique words. To make our project feasible, we decided to work with the top 100 words for each feature.

Figure 6: Illustration of the bag of words approach

One Hot Encoding

Since many of features available were categorical, we performed one hot encoding. In their original string formats, these features would have been unusable by our model. Features such as country and currency provided valuable information so dropping them would have proved harmful to our predictive accuracy. We considered integer encoding and one hot encoding; however, since our categories did not have a natural ordered relationship between them, integer encoding would have been potentially harmful to our model’s predictive accuracy, and as a result we chose one hot encoding.

Twitter Scraping

In order to analyze public interest regarding each Kickstarter project, we scraped tweets on each of the Kickstarter projects. Utilizing Selenium, a tool that mimics human interaction with a computer, we were able to measure the public popularity and sentiment of each project. Thus, we built a scraper that created a query composed of the project name as a the keyword searched for tweets between the launch date and deadline of the project.

Figure 7: Selenium web scraper automating Twitter scraping using a query

Although this provided pertinent information regarding the projects and a possibly successful feature, we came to the conclusion that these tweets would not have been available at the start of the project and are thus not a useful feature to predict the success of a Kickstarter project.

Profile Scraping

One of the features contained the url for the project creator’s profile. It was stored in JSON format so we parsed it and extracted the user profile url. To scrape the user profiles we used Beatiful Soup, which is a framework that is designed to scrape and parse the HTML on static websites. We used Beautiful Soup to crawl the user profiles pages and gather data. We looked for elements under the body tag which provides us with data contained in the body of the kickstarter profile. We then looked for a class name count to find the number of projects backed and number of projects created by each Kickstarter profile, which allowed us to extract both of those as features. We were also able to extract a comments feature, which was a number representing the number of comments each Kickstarter creator had commented on other Kickstarter projects. We believed this this would give us insight and act as a measure of creator “activeness and involvement” in the Kickstarter community.

Models

We chose four different types of models in order build our final model. These included — Random Forest, Adaboost, XGBoost, and SVM. We chose these classification models due to their predictive power in binary classification problems. Adaboost and XGBoost are both boosting methods which choose to train one strong learner (the overfitting decision tree) on multiple weak learners (underfitting learners) in sequence. To avoid overfitting due to our limited dataset, K Fold cross validation was implemented initially with ten folds. However, we found that implementing ten folds provided only small improvements in scoring as opposed to the significant increase in training time that followed. Additionally, we chose to implement a stacking model that combined the predictive power of our top three models in order to build a more robust model. The stacked model combined the predictions of our three most accurate models as features and was overlaid by an XGBoost model.

Evaluation

In order to evaluate our training modes, we chose the ROC_AUC scoring metric due to its ability to consider all thresholds of true positives and false positives. The results are plotted below:

Figure 8: The ROC_AUC scores for the top three models and stacked model

The results indicate that the XGBoost model had the highest predictive accuracy, with an ROC AUC score of .91. The Adaboost and Random Forest models each had scores of 0.89 and 0.86 respectively. However, the stacked model achieved a significantly lower accuracy in ROC_AUC score leading us to believe an error was made in its development considering the model can only be as weak as its weakest model.

Feature Importance

Figure 9: The Random Forest model had launched at date, deadline date, and number of comments made by creator as its top three most important features

Figure 10: The XGBoost model had launched at date, goal amount, and number of comments made by creator as its top three most important features

Figure 11: The Adaboost model had the goal amount, launched date, and number of comments made by creator as its top three most important features

Across the three models, the three most prominent features were the number of comments the creator had given to other projects on the Kickstarter website, the goal money to raise, and the project launch date. We hypothesize that the number of comments could be related to the success of a Kickstarter project, due to the fact that a more active user may be more committed to the success of their own projects and achieve more visibility in the Kickstarter community. Furthermore, setting up a reasonable goal may allow a creator a higher chance of success.

Future Work

The current project allows us to predict success of a project by classification (each output is either success or failure as determined by the model). However, a more interesting but related project would be predicting the actual funding dollar amount of a project. The flexibility of this output provides significantly more information than the binary success output, however it is difficult to predict consistently. The models used would likely need to be updated as classification and regression require different approaches. The data we use is currently useful but not complete for such an application, as more expressive data present on the project description page itself will be valuable.

We would also like to look into utilizing neural networks to analyze the trustworthiness of images provided on the project and creator pages. This will allow us to make decisions based on the software’s perception of images etc. An alternative, but very interesting scraping approach would be to monitor a set of Kickstarter projects through their lifespan while sampling at set intervals. This sort of data would allow us to implement an LSTM.

Conclusion

In conclusion, we were able to build a successful machine learning model that took in solely the data available at the initial stage of the product (eg. project descriptions, creator information, and category information) and predicted whether or not the project would be successful by the deadline with more than 91% accuracy. As a team, we accomplished defining a focused problem, scraping relevant features from Kickstarter and Twitter, employing several data preprocessing techniques, implementing natural language processing, and training ensemble models to achieve this accuracy. We believe this model can have significant impact in the entrepreneurship community by providing creators, investors, and customers with a metric for evaluating the success of a Kickstarter project — thus saving valuable time and money for creators, backers, and everyday users.