One Way to do a Descriptive Analysis: A Step-By-Step Guide

Jeremy Pagirsky
The Startup
Published in
7 min readJan 24, 2021

As I’m starting to work on my second project for the course, I can’t help but feel nostalgic for the simplicity of my last project. We did our first major descriptive analyses on movie data and were asked to make recommendations for a large company that wanted to start making movies. I love films (I feel like everyone says that), so it was exciting to blend familiar interests with newly acquired Python skills. I’m hoping this can be a helpful guide for anyone starting off with their first descriptive analyses.

The main goal of this project was quite open-ended: to see which factors are related to movie success. As such, these guidelines presented us with good news and bad news. Good news: we defined “success.” Bad news: we defined “success.” We had the liberty of asking and answering our own questions on what makes a film successful. While this autonomy was exciting, it was inextricably overwhelming. There are so many factors that contribute to these massive creative projects and an infinite number of lenses in which a film can be perceived as “successful.”

Me at the beginning of this project

After much deliberation, I was able to narrow my focus onto three measures of movie success across movie genres: gross domestic income (USD), online ratings (out of 10), and popularity (number of ratings). These measures existed across three separate dataframes. The majority of my preprocessing steps were performed using Pandas, one dataframe at a time. I began with the Box Office Mojo data and started removing any unnecessary columns with

df1.drop(columns= ‘’)

Afterwards, I checked for null values throughout the dataframe with

df1.isna().sum()

For this dataframe, there were only 28 null values out of the 3387 observations, which accounts for 0.8% of the data. To remove observations with null values, I ran

 df1 = df1.dropna(subset=[‘domestic_gross’])

It is especially important here to consider the magnitude of the data loss and how big of an effect that will produce onto your datasets. I am usually happy with removing no more than 5% of the observations.

Afterwards, I checked for duplicates with

df1.duplicated().sum()

which returned 0 values. To see if there were any undetected duplicates, I ran

df1[df1[‘title’].duplicated()] 

It is usually a good idea to check for duplicates using your unique identifiers (in this case, title and using year as a reference).

Afterwards, I removed unnecessary columns, nulls, and duplicates and repeated these steps for the second IMDB dataframe and the Box Office Mojo dataframe; this all prepared me to merge the two dataframes. The two IMDB dataframes contained the same unique movie identifier (‘tconst’), so I was able to seamlessly merge them on that identifier using

df2.merge(df3, on=’tconst’, how=’outer’)

The Box Office Mojo dataframe, on the other hand, did not contain the same unique identifier. I ran an outer merge to join all of the data together

joined_df = imdb_df.merge(df1, how=’outer’)

At this point, the newly joined dataframe contains all of the columns that I need for analysis. However, one tricky thing I ran into was the genres column. Each row contained a string of all of the movie’s genres, but I needed to isolate the genres for the analysis. One way to do this is to create a new, isolated Pandas series by isolating the genres column from the joined dataframe (I called mine new_genres_col). To separate each element, I ran

new_genres_col = new_genres_col.str.split(‘,’) 

Now here’s where it really gets fun. You can create a nested for loop to iterate each item in each row. I used this code:

list_of_genres = []for row in new_genres_col:for x in row:list_of_genres.append(x)

I then created a set of this new series, using the code below, to return its unique values.

list_of_genres = list(set(list_of_genres))

We’ll save this for later and return back to our joined dataframe.

List of genres

Now that we have a joined dataframe, our preprocessing is not over yet. It is important, now that we have tconst matching additional features (i.e., title and genre), we need to remove any other potential duplicates. I used:

joined_df = joined_df.drop_duplicates(subset=[‘title’, ‘genres’], keep=’first’)

And because there are differences in null values for each measure, I created separate corresponding dataframes (to be seen later).

One of the last preprocessing steps is to remove outliers. To help ascertain a range for outliers, you can use df.describe() to look at the interquartile range for numerical data. Based on those numbers and the range you feel is best for your analysis, you can use .loc to locate values based on the criteria you choose. An easy one I started off with was year and only including movies up until 2020.

joined_df = joined_df.loc[joined_df[‘year’] <= 2020]
Summary of new dataframe

Now, back to more exciting things, the list of genres. For each genre, I want to gather its average rating, median domestic gross, and the median number of votes per genre. One way to do this is to create a function that inputs the genre and outputs the desired associated value. Here is the code I used as a template and subsequently plugged in the respective columns.

def med_gross(genre):"""Finds the median domestic gross for a genre.Input: Genre as a string.Output: Domestic gross as an integer."""genre_dom_gross = []for x in dom_gross_df.index:if genre in dom_gross_df['genres'][x]:genre_dom_gross.append(dom_gross_df['domestic_gross'][x])return(np.median(genre_dom_gross))

Afterwards, I created new dataframes from list_of_genres and the output values. Even though this is quite repetitive, this method generated what I was looking for. I defined my new dataframe as med_dom_gross_df by using

med_dom_gross_df = pd.DataFrame({“genres”:list_of_genres, “median_domestic_gross”:genre_med_gross})

Afterwards, I created new dataframes for each measure using the above code as a template.

As you can see there are null values. While we initially got rid of our null values before merging, removing outliers could have removed the values for these genres. Regardless, let’s hold onto these for now until we repeat these steps for mean rating and median number of votes. Afterwards, you can create new columns using the code below

med_dom_gross_df[‘mean_rating’] = rating_of_genre 

and then decide which genres to remove in the newly redefined new_joined_df.

It looks like there are only null values in the median_domestic_gross column. I removed these with

new_joined_df = new_joined_df.dropna()

Now, onto visualizations; these are crucial for our stakeholders to understand what’s under the hood. Matplotlib.pyplot (imported as plt) in itself is totally fine for visualizations; they are simple and effective. At the same time, I’m one for aesthetics. Luckily, there is Seaborn (imported as sns), which has a wide array of options to give your visualizations a bit of character. Here they are side-by-side (left is matplotlib and right is seaborn). What’s neat is that you can pull from both plt and sns together to generate a visualization. I like to use plt code for the size dimensions of the visualizations because I find it more intuitive, but it’s all personal preference!

plt.figure(figsize=(15,8))# bar plot with matplotlibplt.bar('genres', 'median_domestic_gross',data=gross_df)plt.xlabel("Genres", size=15)plt.xticks(rotation=90)plt.ylabel("Gross (USD)", size=15)plt.title("Median Domestic Gross by Genre", size=18)plt.show()
plt.figure(figsize=(15, 8))sns.set_style('darkgrid')columns=['Responses']ax = sns.barplot(x='genres', y='median_domestic_gross', palette='viridis_r', saturation = .75, data=gross_df)plt.title('Median Domestic Gross by Genre')plt.xlabel('Genre')plt.ylabel('Median Gross (Million USD)')plt.xticks(rotation=90)plt.show()

To pull all three together, plotly express is able to generate scatterplots that incorporate color and size. While not shown in the photo, you can hover over the points and it will display the precise values for each measure.

df = new_joined_dffig = px.scatter(df, x="mean_rating", y="median_domestic_gross", text='genres', log_x=True, size='median_numvotes', color="median_numvotes",labels={"genres": "Genre","mean_rating": "Mean Rating (out of 10)","median_numvotes": "Median Number of Votes per Movie","median_domestic_gross": "Median Domestic Gross (Million USD)" },title= "Median Number of Votes, Average Rating, and Median Domestic Gross per Genre")fig.update_layout(margin=dict(l=50, r=50, t=50, b=50),paper_bgcolor="LightSteelBlue"),fig.show()

I hope this has been helpful in your descriptive analysis, from first wrangling the data to creating visualizations and making recommendations for your stakeholder. I understand this may not be the most efficient route for descriptive analysis, but it can certainly provide the necessary information. Please feel free to check out the Github repository for further info. Eager to hear your thoughts on this process!

Me at the end of this project

--

--