Image for post
Image for post
Picture reference

A Preliminary Analysis with Python

Understand your data before applying the machine learning techniques

Reza Rajabi
Jul 24, 2019 · 10 min read

In the previous article, I showed how we can clean a messy dataset (I used a subset of IMDB movies dataset). The next phase of the cleaning process is analyzing data with the hope of reaching your goal or even discovery. In fact, the type of analysis that you want to do highly depends on the goal of your analysis. Do you want to classify the data or just cluster them? Is your data labeled or not? Do you want to predict a value in your data or just perform a statistical analysis? There are some other factors that you should take into your consideration including number of features (columns) your dataset has, the size of your dataset, type of features, and so on.

Running a Python program in the cloud

This may not relate to the analysis, but for running a Python program at any time and at any place you need to have your code on a cloud. What about running your program on Google cloud? If you have a Google account, you just need to create your notebook in Colaboratory. It is just like a Jupyter notebook, but in a cloud. It is beneficial to run your examples or your simple projects on the cloud, however, note that the program is not running on your high speed local machine. Another thing, installing python libraries on a cloud is a bit different from installing it on a local Jupyter notebook, as it requires a few additional steps or sometimes adding additional commands! All in all, it is GOOD to run your code in Colaboratory when:

  • you want to share your Python code with someone else
  • you want to run a code at any time on any machine
  • your projects and data are allowed to be published on a cloud (privacy!)

Another plus of running your code on a cloud service is having access to your data on the cloud. For example, if you want to read a CSV file, you just need to copy it on your Google drive. Although an authentication step is needed to access your Google drive, it is worthwhile. As you can see in the figure below, you need to mount your drive to access the main folder of your Google drive. The only thing you should do is authorize Colaboratory to access your Google drive.

Image for post
Image for post
Google authentication for accessing google drive

Then you can access your CSV file!

Image for post
Image for post
Access to data in Google drive

Dataset analysis

First of all, let’s see what we have in the dataset.

Image for post
Image for post
dataset columns

It is a sample of IMDB dataset including a movie’s director name, duration, gross amount, genre, title, the year it was produced, the country, budget, its IMDB score, number of its Facebook likes, the actors’ names (only first three main actors), and its GOB (this is gross over budget metric that we created in the previous article).

Analysis questions

To perform a preliminary analysis, we should outline a set of questions that we want to answer as a result of our analysis.

Note: Our data is a subset of IMDB dataset. The reports show the result on this sample, not all the movies!

You can comment on this article, if you have more interesting questions. My questions are:

  • Who are the directors of the top GOB movies?
  • Which year had the highest GOB?
  • Can we cluster the movies based on their GOB/IMDB scores?
  • Is there any relationship between IMDB score and facebook likes?

Who are the directors of the top GOB movies?

To get the highest GOB scores, first, we sort the dataset based on GOB descending. In the following code, I get the top 15 highest GOB scores and store it in top_GOB dataframe.

top_GOB = dataset_imdb.sort_values(‘GOB’,ascending=False).head(15)

As we have the directors’ full names, I add another column to shorten their names to better visualize them in a graph. To do that, I just take the second part of the name, which might be their family name.

top_GOB['director_familyName'] = dataset_imdb["director_name"].str.split(" ", n = 2, expand = True) [1]

The next step is simply visualizing them in a simple bar chart, as follows:

fig,ax = plt.subplots(figsize=(7, 5))# Draw a bar graph
ax = sns.barplot(x=”director_familyName”, y=”GOB”, data=top_GOB,ci=None)
# Rotate the directors' name 45 degrees
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
# Title the graph
fig.suptitle(‘Top movie directors with highest GOB’, fontsize=12)
# Set font size of axis label
ax.set_xlabel('Direcot name',fontsize=20)
ax.set_ylabel('Gross over Budget',fontsize=20)
# Set tick size of axis
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=15)
# Show the graph;
Image for post
Image for post
Top GOB movie directors

Which year had the highest GOB?

The next report is getting the years with high gross over budget scores for all the movies. To do that, we should group the records based on title_year. Pivot table in python is used to group columns and aggregate them based on numeric columns, as shown in the following code.

# Group data based on movie year
values=['GOB'], aggfunc=np.mean ,margins=True)

The year column in the dataset includes some missing values, we will ignore it.

# Some cells are empty in year column
dataset_pivot =dataset_pivot.loc[dataset_pivot['title_year']>0]

Finally, we will draw a bar chart that simply shows the year and average value of GOB for each year.

# We do not need last record which includes the summary metric
dataset_pivot = dataset_pivot[:-1]
# As index changed after creating a pivot table, we reset the index
# Some cells are empty in year column
dataset_pivot =dataset_pivot.loc[dataset_pivot['title_year']>0]
fig,ax = plt.subplots(figsize=(9, 8))
ax=sns.barplot(x="title_year", y="GOB", data=dataset_pivot)
fig.suptitle('GOB over the years', fontsize=12)
plt.xlabel('Movie year')
Image for post
Image for post
Top GOB over the year

Can we cluster the movies based on their GOB/ IMDB scores?

As our data is not labeled,(meaning that they are not in different categories or classes) we can use an unsupervised learning algorithm to cluster them. We would like to know if the machine can find groups or categories of data for us, instead of adding the labels by a user? To this light, we can choose a clustering algorithm to find groups of data. the algorithm we use, depends on several factors including the nature of data, data size, number and types of features in our dataset, etc. I use the K-Means algorithm,one of the popular clustering algorithms, that works iteratively to assign each data point to one of K groups based on the features that are provided. Below, you can see how this algorithm works in the figure. Also, you can find more information about clustering at here.

Image for post
Image for post
K-Means clustering algorithm (picture from here)

In our dataset, we are looking for a relation between a GOB and IMDB score to see whether or not we can cluster movies based on these two metrics. First, we need to specify the number of clusters we are looking for. I chose three. If you visualize your data in a scatter plot, you may guess about the number of groups you might get from the algorithm, but if not, do not worry. You can change the number of clusters afterwards and run the algorithm again to see if you get a better result or not! I use a simple scatter plot in Python to visualize these two metrics (the GOB and IMDB score) on a graph.

plt.scatter(x=dataset['imdb_score'], y=dataset['GOB'])

And get this as the result:

Image for post
Image for post
IMDB score and GOB in a scatter plot

Now, I ask the algorithm to cluster my points into three groups.

# Exclude the missing values and selecy only GOB and IMDB score
(dataset['GOB']>0) & dataset['imdb_score']>0][['imdb_score','GOB']]
# Clustering the dataset usinig K-Means algorithm
cls = KMeans(n_clusters=3)
# Fit the model into the algorithm
# Give me the center point and label of each group
labels = cls.labels_

As the next step, I would like to color the points based on their groups and specify the center of each group with a marker (x).

# We have three cluster, one color for each cluster
colors = ["g.","r.","b."]
# Create an array of each dataset to traverse over the points
dataset_array = np.array(selected_dataset)
# ploting the point on a scatter plot
for i in range(len(dataset_array)):
dataset_array[i][1], colors [labels[i]], markersize =25);

plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=150);

The output looks like the following:

Image for post
Image for post
K-Means clustering algorithm visualization

We see three clusters in the figure. First cluster (blue), is the movies with the low GOB score (not sold well) and low IMDB score (not ranked well by IMDB users). The second cluster includes the movies with a very good IMDB score, but not very high GOB score. The third cluster includes few movies with IMDB scores between 7 and 8 and very high GOB. Like I said, this is a subset of movies, you can have a better interpretation when you apply the algorithm over all the movies.

Is there any relationship between IMDB score and Facebook likes of movies?

To find a relationship between an IMDB score and Facebook likes, we can use regression algorithm. A simple Linear Regression can be used to model the relationship between a single input independent variable (in our case Facebook likes) and an output variable (e.g., IMDB score) using a linear model i.e a line. This is just an example. In reality we may not find a linear relationship between two variables in a dataset. Find more information about the regression algorithms here.

First of all, we make the Facebook likes in a low scale, as they all have more than 10,000 likes, to better visualize the number in the graph.

selected_dataset['movie_facebook_likes']=dataset['movie_facebook_likes'].apply(lambda row: row/10000)

Then we specify the X and Y axis to be used the regression graph:

X = selected_dataset['budget'].values[:,np.newaxis]
y = selected_dataset['movie_facebook_likes'].values

Seaborn library has a regression plot, which is a very nice and powerful tool to plot a linear regression graph. You just need to specify the X and Y axis along with the actual dataset, as followed:

ax=sns.regplot('imdb_score', 'movie_facebook_likes',  selected_dataset);

Then we set the other configurations to make the graph more readable and fancy:

fig.suptitle(' Relation between IMDB score and facebook likes ', fontsize=15)
# Set x-axis label
plt.xlabel('IMDB score');
# Set y-axis label
plt.ylabel('Facebook like (10K)');
font = {'family' : 'Arial',
'size' : 14}
plt.rc('font', **font);;

And get this as the result:

Image for post
Image for post
Linear regression using Seaborn regplot

You can set the graph setting in the function to change the marker’s shape (marker='+') or change the color with color=’g’or change the markers’ size with scatter_kws={'s':100}. If you do not have a large dataset, we can specify ci in the configuration as well. ci is size of the confidence interval for the regression estimate, and it will be drawn using translucent bands around the regression line. We can also set fit_reg=True to estimate and plot a regression model relating the x and y variables. There are other parameters in the documentation that you can set, based on the type of analysis you want.

ax=sns.regplot('imdb_score', 'movie_facebook_likes',  selected_dataset, ci=70, scatter_kws={"s":100}, order=2, fit_reg=True,color="g");

After using the Python code above, I redrew the graph, and this is what I got:

Image for post
Image for post
linear regression between IMDB score and Facebook likes

Based on this graph, there is a linear relationship between movies’ IMDB score and their Facebook likes, which means a higher IMDB score results higher Facebook likes.

Having said that, we can use a simple linear regression algorithm to predict Facebook likes of a movie based on its IMDB score. In the following code, a movie with IMDB score 8.3 has an estimation of 111,000 likes.

# Assign imdb_score to X as one of our variables
X = selected_dataset['imdb_score'].values[:,np.newaxis]
# Assign Facebook likes to Y
y = selected_dataset['movie_facebook_likes'].values
# Specify the type of mode
model = LinearRegression()
# Fit the model with X and y, y)
# IMDB score input for our prediction
# make a prediction based on linear regression
y_predict = model.predict(X_test)
print('Facebook likes estimation (10K):',y_predict)fig,ax = plt.subplots(figsize=(9, 8))# Set x-axis label
plt.xlabel('IMDB score');
# Set y-axis label
plt.ylabel('Facebook like (10K)');
# Visualize data and the linear regression line
plt.scatter(X, y,color='b')
plt.plot(X, model.predict(X),color='g')
# Plot the prediction point with a color and size
plt.scatter(X_test, y_predict, color='r', s=100)

The scatter plot:

Image for post
Image for post
Linear regression with prediction

I’m wrapping up the second part of article here. The type of analysis depends strongly on what goal(s) you want to achieve. Setting the goal before preforming the actual analysis is crucial. In any case, I always start with a preliminary analysis using some visualizations such as a scatter plot, bar chart or line graph over my dataset to see what is going on my data. If you do not have any goals and want to publish a paper or start a new project, this approach gives you an idea about WHAT you can do with your dataset.

Well Red

Exploring the tech industry one bit at a time. Blog by REDspace —

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store