Predicting the success of the Kickstarter project


Abstract

Having reviewed many articles related to predicting the success of Kickstarter project, most of used dataset is balanced. However, in reality, the ratio of successful project is about 30%, which make the real dataset is strongly imbalanced. In this article, we present the work on such a dataset.

Dataset

Since the homepage of Kickstarter displays 200 projects per each category. In order to have large amount of data, we need to use other source where the project’s URL is stored. Finally, we have collected 19798 projects from Kicktraq and 16384 projects from Webrobots, totally having 36182 projects.

Note: Our analysis is done using the collected data, it does not reflect the actual representative of data on Kickstarter.

Distribution of Kickstarter projects

Because the status of a project is only determined after the campaign finishes, we cannot use those projects which are in LIVE status for investigation. For those projects which have been cancelled or suspended, such statuses are decided personally by creator, besides, to the problem of this project, we want to classify between successful and failed projects only, therefore, cancelled and suspended projects have been removed from our dataset. Finally, we have 23320 failed project and 7496 successful project in our dataset. In total, 30816 projects were employed.

Ratio of successful project per each category
Ratio of successful project per each child category of top three category which contains highest ratio of successful project(order: Film&Video, Art, Design)

Our dataset contains following type of features:

Project features:

Year: year of launching

Goal_amount_USD: goal amount set by creator (amount of money the creator would like to get funded)

Duration: time from starting date to end date of campaign

ContentImageCount: number of image used in content of the project

ContentVideoCount: number of video used in content of the project

PackageCount: Number of offered packages in the project

DescriptionWordCount: Count of words in the project’s description

ContentWordCount: Count of words in the project’s content

RiskWordCount: Count of words in the project’s risk part

MinPackageAmount: Minimum USD amount of offered packages

MaxPackageAmount: Maximum USD amount of offered packages.

MeanPackageAmount: Mean USD amount of offered packages

Pledged_amount_USD: pledged amount the project got after the campaign finishes.

Category: Category of the project. e.g. Art, Food, Technology

ChildCategory: e.g. a project belonging to category Art may belong to one of following categories: Illustration, Public Art, Painting. etc.

Goal_currency: the currency of setting goal.

Creator features:

BackedProjCount: Number of backed project in the past of the creator

CreatedProjCount: Number of created project in the past of the creator

NLP features:

Beside project’s and user’s features, we also extract NLP features from parts containing text in the project. In order to do so, we concatenated name, description, content, risk of project together, the text was then preprocessed to remove number, special characters and keep only word stem. We extracted 300 word embedding features using pre-trained model of Google(https://github.com/mmihaltz/word2vec-GoogleNews-vectors).

Finally, we separated our dataset into three different subsets. One contains only projects + creators features, one contains NLP features and the rest contains combination of all features above.

Feature Engineering

Outlier removal:

Of Kickstarter projects, there are few projects in which goal amount has been set too high, so did pledged amount. Such projects are not representative of typical case, but we still keep them and did not consider them as outlier because these cases still happen in reality, we want our model learns these cases as well.

Impute missing values:

Missing values in dataset

As can be seen from above figure, missing values is found in following features: ChildCategory, BackedProjCount, CreatedProjCount. For those missing values in ChildCategory, we impute them by using mode of ChildCategory in the project’s Category, e.g. If the project category is Journalism, Web will be its ChildCategory. For missing value in BackedProjCount and CreatedProjCount, this is because the creator have not created any projects or have not backed any projects in the past, therefore, these missing values are filled by 0.

One-hot encoding categorical features

Goal_currency, Category and Child Category is transformed to numerical variable using One-hot encoding technique.

Explore data analysis

Number of projects per year:

Number of projects per year

The above figure shows the number of projects per each year. In our dataset, 2018 see none of successful projects while in other years, the number of successful projects is relatively high, especially during the period 2010–2013.

It is always useful to look at the description of the dataset.

It is interesting to note that maximum BackedProjectCount in 75% of the projects is 2 but that of all projects is 907. There is a big difference here. This also happens in CreatedProjCount feature.

Some takeaways notes from analysis of dataset description:

Mean duration is 34 days
Mean ContentVideoCount is 0.1 and 75% projects have no video, reflecting the fact that most projects does not show video in their project content.
There exists project with 104 packages while the maximum value of 75% projects is 9.
The minimum setting goal is $0.62, do not know why the creator set this goal?? Maximum goal amount is $100 millions, is this extremely high? After checking the State of these two projects, we see that all of them failed. Really interesting.
Still, there exists project which could not get any pledged amounts. The maximum PercentFunded is 3320, lets check which project is this? This project belongs to Film & Video category, having no image and video on its content, goal amount is $1 only, package count is 1, however, it still succeed with the pledged amount of $3320.0. Its creator has backed 8 and created 1 project in the past.
There is a project with the Pledged_amount_USD = $10266845.0. The creator of this project has backed 109 projects of other users, created 3 projects. This project is in Design category, and having setting goal amount of $100000. This project received highest pledged amount, it maybe caused by the support from this creator to other users in the past.
Comparison between successful projects and failed projects

As stated in above image, there are big differences between two project’s State in term of mean of BackedProjectCount, Goal, RiskWordCount, ContentWordCount and PackageCount. Successful projects seem to have higher BackedProjectCount and PackageCount compare to that of failed projects. In contrast, failed projects have higher GoalAmount and RiskWordCount. Duration of failed project is also higher than that of successful project but the difference is not significant.

Lets check the correlation between features

Year is the features having highest correlation with the State of the project(-0.69), followed by PackageCount(0.31), BackedProjCount(0.23) and ContentWordCount(0.15). Other features do not have strong correlation with State.

It is apparently that the correlation between MaxPackageAmount, MinPackageAmount with MeanPackageAmount, except that correlation, ContentImageCount and ContentWordCount is a pair having highest correlation among features.

Scatter plot between ContentImageCount and ContentWordCount

In the scatter plot above, we can see that the higher number of images, the shorter the content the project is.

Playing around with text features:

Lets check top 50 most used words in successful projects:

Top 50 used words in successful projects

How about top used words in failed proejcts

Top 50 used words in failed projects

Looking at those words, we can not draw anything from top used words, it seems like NLP features may not play important roles in detecting the success of the project, we will examine this later in next section.

Building predictive model

Training-testing ratio

Dataset is separated into training set and testing set using the ratio of 80–20. In term of model selection, we used LightGBM model which is fast and is a high-performance gradient boosting framework.

Metric for evaluation

Three subsets are evaluated for comparison. Since our dataset is imbalance, we cannot use accuracy as a metric because with random guessing, we are able to predict correctly up to 75%. In this experiment, Average Precision(AP) and Matthews Correlation Coefficient(MCC) are used as primary metrics for evaluation.

Value of MCC ranges from -1 to 1, indicating how correct the model is on binary classification task, if MCC equals to -1, the model is totally wrong in predicting the success of Kickstarter project, otherwise, it is completely correct if MCC is 1.

Result

Hyperopt is used as a framework for fine tuning, each subset is optimized using different combination of hyperparameters.

Final result is described in figure below.

Comparison of result between different subset of features

Using NLP features show poor result comparing to normal features. One of the reason is each project has its our words, there is no common features between text of successful projects as well as failed projects.

The Precision and Recall of Normal features(combination of project feature and creator feature) and NLP+Normal features are equal, however, the MCC and AP of Normal feature is slightly better than that of NLP + Normal feature. Since MCC and AP are better metric for evaluating the classification result(https://clevertap.com/blog/the-best-metric-to-measure-accuracy-of-classification-models/) compared to Precision and Recall. Besides, NLP features need long time for preprocessing. We finally came up with using Normal features for our predictive model.

Feature Importance

SHAP values is a powerful tool for interpreting tree-based models. In this blog, the author evaluated five feature attribution methods and recognized that Tree SHAP is one among three methods where the individual and global contribution are consistent. Because of this reason, we use SHAP for interpreting our model.

The order of feature importance

Following conclusions can be drawn from above figure (we analyzed top 10 features):

  • Year is the most important factor in predicting the success of Kickstarter project, the successful projects almost falls into prior years, recent years see more failed projects.
  • If the creator have backed high number of projects in the past, the creator will get more chance of success on the new project
  • RiskWordCount and GoalAmount are two factors which negatively affect the success of Kickstarter project. It means the higher these values are, the less chance the project succeeds.
  • A project is easier to succeed if it has more images and the content is explained in detail.
  • A project has more chance of getting success if it belongs to category Food, otherwise, it is easier to fail if it is Comedy project.
  • Beside above features, duration is also important as the project with short campaign will get higher chance of getting success.

Conclusion

In this article, we presented our works on Kickstarter dataset which is collected by ourselves. The dataset is extremely imbalanced so we have to use AP and MCC metrics to evaluate our model instead of merely using Accuracy.

Our study has evaluated different subsets of features and found that Project and Creator features have more predictive power than NLP features in predicting the success of Kickstarter project.

Based on the conclusion from Feature Importance section, we suggest that the creator should support other projects before creating their own project. It is also helpful to setting a reasonable Goal amount, our research indicated that many projects failed because of crazy Goal amount. RiskWordCount is another important factor to the success of the project but further investigation is needed to examine this fact. Putting more figures and describe the projects more detail is also helpful. Creators should also about the duration of their project since short duration has higher chance of success.

Although Year is the most important factor, the creator cannot adjust this feature, hence, if they are going to create new project, it is suggested that they should start as soon as they can.

Additionally, after investigating on many features, we figured out that the success of project depends largely on the interaction between the creator and the users on Kickstarters. During the research, we have used some features such as updated count, comment count and IsProjectWeLove(marked by Kickstarter management team when the project is stand out). These features helped the performance of predictive model improve significantly. However, as our goal is predicting the success of Kickstarter project right from the time the project launches, these features have been removed from input features.

In additions, features indicating whether the creator connects social network profile with their Kickstarter’s profile is also a good feature, by getting this information, we can track the number of followers or friends the creator has, as well as the last login time. These features are good indicators for evaluating the interaction between creators and investors.