Kickstarter Projects — Do They Succeed?

Aditya Patkar
Nov 1 · 14 min read

A Data Science Perspective

Image for post
Image for post
“Fundraising is the gentle art of teaching the joy of giving.” — Henry Rosso

Kickstarter. Most of us have seen Kickstarter projects be that in social media ads, blogposts or search results. Hell, you can even be one of those who treat Kickstarter like Instagram and browse it daily.

For the small portion of readers who don’t know what Kickstarter is, Kickstarter is a platform for launching your own products or services through crowdfunding. You can create a page with information about your creative product which can be anything from a tech product to a art masterpiece. You then set a funding goal. This money is then used for different aspects related to making that product come to life.

As cool as it sounds, Kickstarter projects are not guaranteed to succeed. Through this article, we’ll analyze the past performances of Kickstarter projects based on year of release, category, funding goal etc. The dataset we’re using is a Kaggle dataset comprising information about more than 300,000 Kickstarter projects up to the year 2018. Special thanks to Mickaël Mouillé for making this dataset a reality.

A quick shoutout to and its founder Aakash NS ,who is also the instructor of ‘Data Analysis with Python: Zero to Pandas’ which is a free of cost course to get started with data science. The course is extremely thorough and easy to understand. In fact, this project is part of the same course. Do check it out if you want to get started with Data Science.

I have compiled my class notes in a Notion notebook folder which can be a handy guide for you. Do check it for a reference while going through the code.

If you want to play around with the code, an executable Jupyter Notebook can be found here. Without further ado, let’s get started.

Downloading the Dataset

Let's begin by downloading the data, and listing the files within the dataset.

dataset_url = ''import opendatasets as od

The dataset has been downloaded and extracted.

data_dir = './kickstarter-projects'import os

Output: We have listed the files that have been downloaded. We will use the updated 'ks-projects-201801.csv' file for our use.

Out :['ks-projects-201612.csv', 'ks-projects-201801.csv']

Data Preparation and Cleaning

In our case, let’s start by converting the dataset into a Pandas dataframe. Pandas is a python library which gives us handy functions for data cleaning, merging, operations etc. It creates an object called as Data Frame which is basically the data displayed in tabular form. If you know a little bit of coding, a Data Frame can be considered as a dictionary of lists.

We can read different types of files eg. CSV, JSON, XLSX etc. and create a Data Frame using the same. To know more about Pandas, check my Notion notebook.

#importing the pandas libraryimport pandas as pd#reading the CSV file that we downloadedkickstarter_data = pd.read_csv('./kickstarter-projects/ks-projects-201801.csv')

Let's take a look at our dataset

We can check the shape of our dataset using the .shape method. It returns a tuple in the form (Number of Rows, Number of Columns) As we can see, the dataset contains information about more than 300000 Kickstarter projects.

We don’t really need the ID and name columns as they don't play any part in our analysis. We can remove columns by passing a list of their names to .drop method of the Data Frame. We have to provide axis argument where 0 = rows and 1 = columns.

kickstarter_data = kickstarter_data.drop(['ID','name'], axis =1)

Going through the Kaggle discussion, it seems that the pledged column contains the amount pledged by crowd in the currency that it is listed in. It is important that the amount for all the rows is converted to one currency for accurate analysis. The amount converted to USD is listed in two columns i.e usd pledged and usd_pledged_real. Similar can be said about the goal column. Going through more discussions on Kaggle page, we can assume that usd pledged column contains conversion made using Kickstarter which contains a lot of errors. The usd_pledged_real and usd_goal_real column is converted using more accurate API, which we will use. Thus we can drop pledged, goal and usd pledged columns.

kickstarter_data = kickstarter_data.drop(['pledged', 'goal', 'usd pledged'], axis =1)

Now let’s see how the Data Frame looks like.

The exact time to the accuracy of seconds is not really important for us in launched and deadline column. We can just extract the year, month and day of launch for the same. First let's convert the columns to datetime objects. Datetime objects are recognized by python as dates which will help us to do further operations.

#Converting to DateTime objectspd.to_datetime(kickstarter_data.launched,format='%Y-%m-%d', errors = 'coerce');
pd.to_datetime(kickstarter_data.deadline,format='%Y-%m-%d', errors = 'coerce' );
#Extracting year, month and day of the weekkickstarter_data['launched_year'] = pd.DatetimeIndex(kickstarter_data['launched']).year
kickstarter_data['launched_month'] = pd.DatetimeIndex(kickstarter_data['launched']).month
kickstarter_data['launched_day'] = pd.DatetimeIndex(kickstarter_data['launched']).weekday
#Dropping the deadline column as we don't need it.kickstarter_data = kickstarter_data.drop(['deadline'], axis = 1)

With this, we have completed preparing our data for the analysis. Now, let’s explore the characteristics of the data.

Exploratory Analysis and Visualization

Learn more about matplotlib and seaborn here. You can also go through the documentation for both by simply visiting their respective websites.

Data insights using .describe()

This gives us some interesting preliminary insights into our data. As we can see,

  1. The average goal is around 45000 USD.
  2. The average pledged is around 9000 USD which is way less that our average goal.
  3. The data for goal varies more with respect to the mean than the data for amount pledged.
  4. We have data varying all the way from 1970 to 2018. Though 75% of data is between 2013 and 2019.
  5. On an average, around 105 people have backed a Kickstarter project but the variance is very high. In fact, 75% of projects have had less than 56 backers.

Let’s importmatplotlib.pyplot and seaborn.

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Comparing number of projects per year using bar chart

#Sorting the data into groups kickstarter_obj =  kickstarter_data.groupby(['launched_year', 'launched_month'])#Getting counts of projects in a group using .size()
kickstarter_by_year = kickstarter_obj.size().reset_index(name='counts')

This is how the newly created Data Frame looks like :

Now that we have the data frame, let’s plot the bar graph. Here we’re using the barplot() function from the seaborn library. As arguments, we'll provide launched_year series for x-axis, counts series for y-axis and a data argument. Seaborn has a good support for pandas. By providing a data frame in the 'data' argument, seaborn will automatically find the x and y series' from the given data frame.

Image for post
Image for post

As we can see above,

  1. The number of projects on Kickstarter seem to have peaked in 2015.
  2. The data for 2018 is incomplete so we should not take it as definitive. This is assuming that all Kickstarter projects from 2009 onwards are included in the dataset.
  3. The data from 1970 seems to be bad or insignificant data.

Plotting a heatmap for yearly and monthly distribution of projects

#sorting the data into groups by year and monthkickstarter_by_year_month = kickstarter_obj.size().reset_index(name='counts')

Now we use a special funcion provided by pandas called .pivot() which converts the Data Frame into a 2D matrix, taking first argument (series) as rows, 2nd argument being columns and third argument being the values corresponding to series 1 and series 2. Check it out below.

#Creating a 2D Matrix
kickstarter_by_year_month = kickstarter_by_year_month.pivot('launched_year', 'launched_month', 'counts')

Now we plot a heatmap by passing the dataset to the heatmap() function provided by seaborn. Theannot argument can be set to True if you want to show the values in the blocks.

Image for post
Image for post

The above graphs tells us more about the data.

  1. The data collection started in April 2009.
  2. It seems like the data entries without launch date were initialized to January 1970 or “Unix Time”. We should ignore those for statistics involving dates, for other statistics, these entries are still relevant.
  3. The data available is only up to January 2018.
  4. As the bar plot suggested amount of projects peaked between 2014 and 2015. With maximum projects coming in July 2014.
  5. There seems to be increase in number of projects during October and November as compared to August of every year and a decrease in December. We can find why so with research.

Goal amount vs pledged amount using line chart

#Creating groups and computing mean for every group
kickstarter_data_mean = kickstarter_data.groupby(kickstarter_data.launched_year).mean()
#Resetting the index
kickstarter_data_mean = kickstarter_data_mean.reset_index('launched_year')

This is how our newly created data frame looks.

We’ll use Seaborn to plot the line plots by passing in both the series’ to sns.lineplot() function. We can plot both the lines in same graph by writing both sns.lineplot() functions in the same cell as seen below. we will limit the x axis values from 2009 to 2018 as we found out that rows with launched_year = 1970 were defaulted to that value and that would not give us accurate results.

Image for post
Image for post

This is a very interesting chart. It tells us that,

  1. As the lines do not intersect at any point, the mean amount pledged does not exceed the mean goal for any year.
  2. The mean goal, just like number of projects, seems to have peaked in 2015 before going down in 2016, 2017.
  3. The line of mean amount pledged is fairly flat indicating that even though projects on the platform increased in 2013–2015, the amount pledged fairly remained the same for any project. For being more sure about this, let’s quickly plot the cumulative sum data instead of mean.
kickstarter_data_sum = kickstarter_data.groupby(kickstarter_data.launched_year).sum()
kickstarter_data_sum = kickstarter_data_sum.reset_index('launched_year')
Image for post
Image for post

As we can see, even the line for cumulative sum of amount pledged remains flat for all the years where the goal amount peaked. This might have made many projects fail in the peak years, making the goal amount go down again.

Yearwise status comparison

#Grouping the data and getting counts
kickstarter_state = kickstarter_data.groupby([kickstarter_data.state, kickstarter_data.launched_year]).size().reset_index(name='counts')
Image for post
Image for post

As we can see,

  1. Failed projects are more than successful projects almost every year.
  2. As previous graphs told us, while the amount of goal increased during 2014–2015, the amount pledged remained pretty much similar to previous years. This in-turn increased the gap in this graph between number of successful projects and number of failed projects as the goal was not met.
  3. Until 2014, number of successful projects was on par with number of failed projects.
  4. Though if we account cancelled + suspended + failed = unsuccessful, number of unsuccessful projects for every single year is more than number of successful projects.

Pie chart to analyze the overall success rate

First we will group the dataset based on state and then count the number of projects for each state before plotting the pie chart.

We should drop the live projects as we don’t know if they eventually succeeded. We will also create a percentage column to display percentage of every state in the chart. round() function takes the value and rounds it down to nearest number with specified (here 2) number of decimals.

Now let’s plot the pie chart.

Image for post
Image for post

We can now say that out of all the projects in the dataset, 35.64% have succeeded while 52.6% have failed. Other states along with failed can be deemed as “unsuccessful” as percentage of ‘undefined’ is negligible. On Kickstarter, unsuccessful projects are more than successful projects.

Deep Dive

Now we can answer a few more interesting questions about the dataset which can help us understand the data even better.

Q1: Which main categories have more successful projects than failed ones?

To answer this, we can group the data based on main_category and state and count the items for each row. We’ll call that column projects. Then we can plot a bar chart with projects on the X axis and main_category on the Y axis with state set as hue to compare and find out the answer.

kickstarter_category = kickstarter_data.groupby([kickstarter_data.main_category, kickstarter_data.state]).size().reset_index(name = 'projects')
Image for post
Image for post

AnswerComics, Dance, Music and Theatre are the main categories where successful projects are more than failed projects. Insights :

  1. Looks like performance arts projects are succeeding more than they are failing.
  2. Art, Crafts, Design, Fashion, Film and Video, Food, Games, Journalism, Photography, Publishing and Technology have more failed projects than successful ones.
  3. Film and Video has the largest number of failed projects.

Q2: Which category had the most projects?

Let’s group the data based only on main_category and then use the size() function to calculate the sum. Then we'll plot a basic bar chart.

Image for post
Image for post

Answer — Clearly film and video projects are maximum followed by music, which is followed by publishing and technology.

Q3: Which category has the highest goal amount per project?

For that, we will group the data based on main_category and compute the sum of usd_goal_real for every column before dividing it by number of projects for each category. Finally, we will plot a bar chart to compare the same.

kickstarter_category2 = kickstarter_data.groupby(kickstarter_data.main_category).sum().reset_index('main_category')
kickstarter_category2['goal_per_project'] = kickstarter_category2['usd_goal_real'] / kickstarter_category1['projects']
Image for post
Image for post

Answer — The most amount of goal per category in USD is for technology. This is a really interesting chart.

  1. Even though number of projects is maximum for Film and Video, goal set by the project creator for technology exceeds that of any other category.
  2. Journalism comes second with Film and Video following it.
  3. The least goal is for dance.

Let’s compute the same for amount pledged and add dots for goals using sns.scatterplot() for comparing.

Image for post
Image for post
  1. Interestingly, pledged amount per project for design exceeds that of technology.
  2. Games which was not prominent in previous statistics suddenly rises to third spot in the graph.
  3. Goal amount per project for every category is higher than the pledged amount per project.
  4. It is still not clear where the difference between goal per project and pledged per project is maximum.

Q4: Which category has the maximum difference in goal amount and pledged amount per project?

We will simply compute the difference between goal_per_project and pledged_per_project and then plot the graph of difference.

kickstarter_category2['Goal_pledged_difference'] = kickstarter_category2['goal_per_project'] - kickstarter_category2['pledged_per_project']
Image for post
Image for post

Answer — The difference between Goal per Project and Pledged per Project is maximum for Journalism.

  1. Journalism is followed by Technology and Film and Video.
  2. The difference is lowest for dance followed by photography.

Q5: Which category has maximum pledged amount per backer?

We will calculate this by dividing usd_pledged_real by backers for every category. Finally we will plot a bar chart.

#Calculating amount pledged per backer
kickstarter_category3['pledged_per_backer'] = kickstarter_category3['usd_pledged_real'] / kickstarter_category3['backers']
Image for post
Image for post

Answer — Pledged per Backer is maximum for Technology.

  1. Technology category is a clear standout here with more than USD 120 per backer.
  2. Comics has the lowest pledged amount per backer with just over USD 50 per backer.


The bigger picture is, we learned a lot about data today. How raw data can be made sense of, how visualizations play an important role in analyzing data and how great python libraries are! :)

References (Special Thanks)

  1. Matplotlib documentation :
  2. Seaborn Documentation :
  5. and Aakash NS :
  6. Zero to Pandas course :
  7. Kaggle Dataset :

Future Work

  1. Comparison based on country.
  2. Kickstarter vs Indiegogo
  3. Predicting if a Kickstarter project would succeed or not.

The Startup

Medium's largest active publication, followed by +729K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store