KickAssist
An Interactive Dashboard to help maximize the probability of successfully funding your Kickstarter campaign
Contents
- Introduction/Context
- Objective
- Domain Knowledge
- Data Source
- Framing into ML Problem
- Methodology and results
- Limitations
All of the code for the project can be accessed in my GitHub Repository
The dashboard can be accessed here
Introduction/Context
Kickstarter is a crowdfunding platform for ideas and projects of diverse categories such as films, games, and music to art, design, and technology.
Individuals/Entrepreneurs with ideas or products start a campaign by creating a project on the platform, explaining what the idea is about
- Creator must specify a date and goal(amount) when creating a project which is not flexible.
- Creators are allowed to give rewards to Backers(Who are people that donate to the project) based on the amount these rewards can be anything depending on the project, mostly its the product itself which is being donated for(if its a course then early access to a course or if its some tech product them limited edition of the product etc.)
- If the project for some reason is not able to reach the goal amount by the given date then the money collected up until that point is returned back to the backers and the project gets nothing( all-or-nothing funding model)
- So the idea of Kickstarter website is more like people bringing the project to life which makes it interesting.
Objective
As discussed above there are a few decisions which creator needs to make when starting the campaign, Which include the following.
- Deciding when to start the campaign(Launch Date).
- When to set the deadline.
- Goal Amount.
- Deciding the right amount for the rewards.
Since it is all-or-nothing funding model decisions creator takes with the above variables play a key role in deciding the success of the project.
Therefore the objective of the project is to assist the creator of the campaign in deciding what would be the optimal values for the above 4 features which would maximize the probability of a successful campaign.
Domain Knowledge
Condensing most of the content about what goes into a successful campaign can be summarized by the following(based on several tip/tricks given by successful campaigners in the past)
- Having a well-polished landing page with videos and images.
- Interaction with donors(through the comment section and others)
- Marketing and networking as a whole.
- Well planned rewards.
- Feasible duration for the given goal.
- Delivering rewards without delay.
Data Source
A Web Crawler Platform named Web Robots has a few free data projects one of them happens to be Kickstarter data, which is scrapped every month off the platform.
It provides a bunch of CSV files with data loaded as dictionaries(JSON), Following are the features which can be extracted from the source
- Status(Success or failed)
- Category and Sub Category of the project
- lunch and deadline date
- Author and creator information
- Country of origin and currency to be used for a transaction
- and a few other columns with metadata
Framing into ML Problem
- Based on objective and data source, there are two kinds of features(variables)
- Variables Which are related to our product itself and are fixed, can not be changed for the purpose of increasing the chances of success, variables like category, subcategory etc. These variables can not be changed after we decide what our campaign is going to be because they are about the product of the campaign itself, For naming reasons lets call them fixed variables.
- Variables which are to be decided strategically and which can(should) be altered if doing so increases our probability of being successfully funded, this type of variables include rewards, goal, deadline, launch date etc., Let’s call them flexible variables
- As the target feature(success or failure) is a categorical feature, it will be a classification problem
- Since our objective is not to make predictions about the campaign’s success based on the features rather suggesting the optimal features for maximum chances for success, this slight variation in our problem statement demands our Model to be interpretable
- The requirement for Interpretability does not only affect our choice of ML model but it essentially drives our Machine Learning pipeline from preprocessing to model deployment, It mainly affects our feature engineering process, wherein we are bound not to make any transformations(like dimensionality reduction) to our variables, Therefore accuracy/performance is the price to pay for interpretability.
- The model would be served to the end-user as an Interactive dashboard, Where the creators can tune/adjust both their fixed, flexible variables and visualize not only the probability of success but also, How has each of the flexible variables affected the probability.
- This way creators can adjust those variable’s values to the closest plausible value to increase the chances for a successful campaign.
- A model with 80% accuracy(given balanced dataset) seems to be acceptable, as we will be using the model for interpretation, not the predictions themselves
Methodology and results
All the code and a clear procedure is provided here
- Extracting the data from the source mentioned above
- Exploring the variables and getting the feel of the data.
- Performed basic cleaning like removed null values(beyond a threshold) and duplicate values.
- Performed basic preprocessing(Label Encoding, OneHot Encoding) and modelling
Feature Engineering: As identified in the Introduction, features present in the data source are not suffice to full fill our objective because as identified in the domain knowledge section, rewards is an important feature and there is no feature indicating any measure of rewards in the dataset and we can also see that marketing and networking increases the chances of success since we do not have any direct way of extracting that information for each campaign, therefore Scraping is performed on the Kickstarter website and following are the features being scrapped from each campaign
- Rewards
- Number of Campaigns the Creator already had
- Number of Campaigns the creator has already funded
- When would the rewards be delivered
EDA along with Statical tests(t-test, ANOVA) are performed to find out interesting insights and validate various assumptions about the data(as follows)
- One of the important findings in this process is data prior to 3–4 years to the current date can be considered stale as it does not have similar patterns, It not only helps us reduce our training size by a large factor but also when updating the model each year it need not be trained on aggregated data of all the previous years.
- An interesting observation is the use of the word “Help” in the description of most failed campaigns.
As discussed previously, due to the interpretability constraints, Transformations such as dimensionality reduction(PCA) and polynomial interactions can not be performed, Having said that clear distinction has been made between features as “fixed” and “flexible”, flexible variables are not required to be interpretable according to our business objective, therefore, Feature encoding is used on categorical variables of flexible type
The categorical variable has a natural sense of order to them which is the ideal case to use Helmert encoding
At this stage modelling is performed, as the relationships are clearly not linear as observed in EDA, Random Forest and XGBoost are trained and XGBoost is chosen as it produces the highest accuracy of 86% and is compatible with the tool being used for interpretation
ELI5 package is used for interpreting the XGBoost model, This implementation suits our objective as it generates feature contribution for each instance predicted, Allowing creators/users to experiment with various values of variables.
Model(XGBoost) and ELI5 tool are wrapped into an API so that it can serve dashboard built using Plotly, App is hosted on Heroku for creators to validate and tune their decisions to maximize the probability of success, the dashboard can be accessed here
Limitations
- Content such as videos and images used to present the project compels the donors to some degree and its signal is not being taken into consideration.
- Experience of the person hosting the project (how many backed and how many pledged) is not being taken into consideration which gives us an insight into, how well the creator can market his/her product.
- Utility and relevance of the overall products are not being quantified.