# A Preliminary Analysis with Python

## Understand your data before applying the machine learning techniques

In the previous article, I showed how we can clean a messy dataset (I used a subset of IMDB movies dataset). The next phase of the cleaning process is analyzing data with the hope of reaching your goal or even discovery. In fact, the type of analysis that you want to do highly depends on the goal of your analysis. Do you want to classify the data or just cluster them? Is your data labeled or not? Do you want to predict a value in your data or just perform a statistical analysis? There are some other factors that you should take into your consideration including number of features (columns) your dataset has, the size of your dataset, type of features, and so on.

# Running a Python program in the cloud

This may not relate to the analysis, but for running a Python program at any time and at any place you need to have your code on a cloud. What about running your program on Google cloud? If you have a Google account, you just need to create your notebook in Colaboratory. It is just like a Jupyter notebook, but in a cloud. It is beneficial to run your examples or your simple projects on the cloud, however, note that the program is not running on your high speed local machine. Another thing, installing python libraries on a cloud is a bit different from installing it on a local Jupyter notebook, as it requires a few additional steps or sometimes adding additional commands! All in all, it is GOOD to run your code in Colaboratory when:

- you want to share your Python code with someone else
- you want to run a code at any time on any machine
- your projects and data are allowed to be published on a cloud (privacy!)

Another plus of running your code on a cloud service is having access to your data on the cloud. For example, if you want to read a CSV file, you just need to copy it on your Google drive. Although an authentication step is needed to access your Google drive, it is worthwhile. As you can see in the figure below, you need to mount your drive to access the main folder of your Google drive. The only thing you should do is authorize Colaboratory to access your Google drive.

Then you can access your CSV file!

# Dataset analysis

First of all, let’s see what we have in the dataset.

It is a sample of IMDB dataset including a movie’s director name, duration, gross amount, genre, title, the year it was produced, the country, budget, its IMDB score, number of its Facebook likes, the actors’ names (only first three main actors), and its GOB (this is gross over budget metric that we created in the previous article).

# Analysis questions

To perform a preliminary analysis, we should outline a set of questions that we want to answer as a result of our analysis.

Note: Our data is a subset of IMDB dataset. The reports show the result on this sample, not all the movies!

You can comment on this article, if you have more interesting questions. My questions are:

- Who are the directors of the top GOB movies?
- Which year had the highest GOB?
- Can we cluster the movies based on their GOB/IMDB scores?
- Is there any relationship between IMDB score and facebook likes?

# Who are the directors of the top GOB movies?

To get the highest GOB scores, first, we sort the dataset based on GOB descending. In the following code, I get the top 15 highest GOB scores and store it in `top_GOB`

dataframe.

`top_GOB = dataset_imdb.sort_values(‘GOB’,ascending=False).head(15)`

As we have the directors’ full names, I add another column to shorten their names to better visualize them in a graph. To do that, I just take the second part of the name, which might be their family name.

`top_GOB['director_familyName'] = dataset_imdb["director_name"].str.split(" ", n = 2, expand = True) [1]`

The next step is simply visualizing them in a simple bar chart, as follows:

fig,ax = plt.subplots(figsize=(7, 5))# Draw a bar graph

ax = sns.barplot(x=”director_familyName”, y=”GOB”, data=top_GOB,ci=None)# Rotate the directors' name 45 degrees

ax.set_xticklabels(ax.get_xticklabels(), rotation=45)# Title the graph

fig.suptitle(‘Top movie directors with highest GOB’, fontsize=12)# Set font size of axis label

ax.set_xlabel('Direcot name',fontsize=20)

ax.set_ylabel('Gross over Budget',fontsize=20)# Set tick size of axis

ax.tick_params(axis='x', labelsize=15)

ax.tick_params(axis='y', labelsize=15)# Show the graph

plt.show();

# Which year had the highest GOB?

The next report is getting the years with high gross over budget scores for all the movies. To do that, we should group the records based on `title_year`

. Pivot table in python is used to group columns and aggregate them based on numeric columns, as shown in the following code.

`# Group data based on movie year`

dataset_pivot=dataset_imdb.pivot_table(index=['title_year'],

values=['GOB'], aggfunc=np.mean ,margins=True)

The year column in the dataset includes some missing values, we will ignore it.

`# Some cells are empty in year column`

dataset_pivot =dataset_pivot.loc[dataset_pivot['title_year']>0]

Finally, we will draw a bar chart that simply shows the year and average value of GOB for each year.

# We do not need last record which includes the summary metric

dataset_pivot = dataset_pivot[:-1]# As index changed after creating a pivot table, we reset the index

dataset_pivot.reset_index(inplace=True)# Some cells are empty in year column

dataset_pivot =dataset_pivot.loc[dataset_pivot['title_year']>0]fig,ax = plt.subplots(figsize=(9, 8))

ax=sns.barplot(x="title_year", y="GOB", data=dataset_pivot)fig.suptitle('GOB over the years', fontsize=12)

plt.xlabel('Movie year')plt.show()

# Can we cluster the movies based on their GOB/ IMDB scores?

As our data is not labeled,(meaning that they are not in different categories or classes) we can use an unsupervised learning algorithm to cluster them. We would like to know if the machine can find groups or categories of data for us, instead of adding the labels by a user? To this light, we can choose a clustering algorithm to find groups of data. the algorithm we use, depends on several factors including the nature of data, data size, number and types of features in our dataset, etc. I use the K-Means algorithm,one of the popular clustering algorithms, that works iteratively to assign each data point to one of *K* groups based on the features that are provided. Below, you can see how this algorithm works in the figure. Also, you can find more information about clustering at here.

In our dataset, we are looking for a relation between a GOB and IMDB score to see whether or not we can cluster movies based on these two metrics. First, we need to specify the number of clusters we are looking for. I chose three. If you visualize your data in a scatter plot, you may guess about the number of groups you might get from the algorithm, but if not, do not worry. You can change the number of clusters afterwards and run the algorithm again to see if you get a better result or not! I use a simple scatter plot in Python to visualize these two metrics (the GOB and IMDB score) on a graph.

`plt.scatter(x=dataset['imdb_score'], y=dataset['GOB'])`

plt.show()

And get this as the result:

Now, I ask the algorithm to cluster my points into three groups.

# Exclude the missing values and selecy only GOB and IMDB score

selected_dataset=dataset.loc[

(dataset['GOB']>0) & dataset['imdb_score']>0][['imdb_score','GOB']]# Clustering the dataset usinig K-Means algorithm

cls = KMeans(n_clusters=3)# Fit the model into the algorithm

cls.fit(selected_dataset)# Give me the center point and label of each group

centroids=cls.cluster_centers_

labels = cls.labels_

As the next step, I would like to color the points based on their groups and specify the center of each group with a marker (x).

# We have three cluster, one color for each cluster

colors = ["g.","r.","b."]# Create an array of each dataset to traverse over the points

dataset_array = np.array(selected_dataset)# ploting the point on a scatter plot

for i in range(len(dataset_array)):

plt.plot(dataset_array[i][0],

dataset_array[i][1], colors [labels[i]], markersize =25);

plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=150);

The output looks like the following:

We see three clusters in the figure. First cluster (blue), is the movies with the low GOB score (not sold well) and low IMDB score (not ranked well by IMDB users). The second cluster includes the movies with a very good IMDB score, but not very high GOB score. The third cluster includes few movies with IMDB scores between 7 and 8 and very high GOB. Like I said, this is a subset of movies, you can have a better interpretation when you apply the algorithm over all the movies.

# Is there any relationship between IMDB score and Facebook likes of movies?

To find a relationship between an IMDB score and Facebook likes, we can use regression algorithm. A simple Linear Regression can be used to model the relationship between a single input independent variable (in our case Facebook likes) and an output variable (e.g., IMDB score) using a linear model i.e a line. This is just an example. In reality we may not find a linear relationship between two variables in a dataset. Find more information about the regression algorithms here.

First of all, we make the Facebook likes in a low scale, as they all have more than 10,000 likes, to better visualize the number in the graph.

`selected_dataset['movie_facebook_likes']=dataset['movie_facebook_likes'].apply(lambda row: row/10000)`

Then we specify the X and Y axis to be used the regression graph:

`X = selected_dataset['budget'].values[:,np.newaxis]`

y = selected_dataset['movie_facebook_likes'].values

Seaborn library has a regression plot, which is a very nice and powerful tool to plot a linear regression graph. You just need to specify the X and Y axis along with the actual dataset, as followed:

`ax=sns.regplot('imdb_score', 'movie_facebook_likes', selected_dataset);`

Then we set the other configurations to make the graph more readable and fancy:

fig.suptitle(' Relation between IMDB score and facebook likes ', fontsize=15)

# Set x-axis label

plt.xlabel('IMDB score');

# Set y-axis label

plt.ylabel('Facebook like (10K)');fig.set_size_inches(9,8)

font = {'family' : 'Arial',

'size' : 14}

plt.rc('font', **font);

plt.show();

And get this as the result:

You can set the graph setting in the function to change the marker’s shape `(marker='+')`

or change the color with `color=’g’`

or change the markers’ size with `scatter_kws={'s':100}`

. If you do not have a large dataset, we can specify `ci`

in the configuration as well. `ci`

* *is *s*ize of the confidence interval for the regression estimate, and it will be drawn using translucent bands around the regression line. We can also set `fit_reg=True`

to estimate and plot a regression model relating the `x`

and `y`

variables. There are other parameters in the documentation that you can set, based on the type of analysis you want.

`ax=sns.regplot('imdb_score', 'movie_facebook_likes', selected_dataset, ci=70, scatter_kws={"s":100}, order=2, fit_reg=True,color="g");`

After using the Python code above, I redrew the graph, and this is what I got:

Based on this graph, there is a linear relationship between movies’ IMDB score and their Facebook likes, which means a higher IMDB score results higher Facebook likes.

Having said that, we can use a simple linear regression algorithm to predict Facebook likes of a movie based on its IMDB score. In the following code, a movie with IMDB score `8.3`

has an estimation of `111,000`

likes.

# Assign imdb_score to X as one of our variables

X = selected_dataset['imdb_score'].values[:,np.newaxis]# Assign Facebook likes to Y

y = selected_dataset['movie_facebook_likes'].values# Specify the type of mode

model = LinearRegression()# Fit the model with X and y

model.fit(X, y)# IMDB score input for our prediction

IMDB_score=8.3

X_test=np.array([IMDB_score]).reshape(1,-1)# make a prediction based on linear regression

y_predict = model.predict(X_test)print('Facebook likes estimation (10K):',y_predict)fig,ax = plt.subplots(figsize=(9, 8))# Set x-axis label

plt.xlabel('IMDB score');

# Set y-axis label

plt.ylabel('Facebook like (10K)');# Visualize data and the linear regression line

plt.scatter(X, y,color='b')

plt.plot(X, model.predict(X),color='g')# Plot the prediction point with a color and size

plt.scatter(X_test, y_predict, color='r', s=100)

plt.show()

The scatter plot:

I’m wrapping up the second part of article here. The type of analysis depends strongly on what goal(s) you want to achieve. Setting the goal before preforming the actual analysis is crucial. In any case, I always start with a preliminary analysis using some visualizations such as a scatter plot, bar chart or line graph over my dataset to see what is going on my data. If you do not have any goals and want to publish a paper or start a new project, this approach gives you an idea about WHAT you can do with your dataset.