# A Preliminary Analysis with Python

## Understand your data before applying the machine learning techniques

In the previous article, I showed how we can clean a messy dataset (I used a subset of IMDB movies dataset). The next phase of the cleaning process is analyzing data with the hope of reaching your goal or even discovery. In fact, the type of analysis that you want to do highly depends on the goal of your analysis. Do you want to classify the data or just cluster them? Is your data labeled or not? Do you want to predict a value in your data or just perform a statistical analysis? There are some other factors that you should take into your consideration including number of features (columns) your dataset has, the size of your dataset, type of features, and so on.

# Running a Python program in the cloud

• you want to share your Python code with someone else
• you want to run a code at any time on any machine
• your projects and data are allowed to be published on a cloud (privacy!)

Then you can access your CSV file!

# Dataset analysis

First of all, let’s see what we have in the dataset.

It is a sample of IMDB dataset including a movie’s director name, duration, gross amount, genre, title, the year it was produced, the country, budget, its IMDB score, number of its Facebook likes, the actors’ names (only first three main actors), and its GOB (this is gross over budget metric that we created in the previous article).

# Analysis questions

To perform a preliminary analysis, we should outline a set of questions that we want to answer as a result of our analysis.

Note: Our data is a subset of IMDB dataset. The reports show the result on this sample, not all the movies!

You can comment on this article, if you have more interesting questions. My questions are:

• Who are the directors of the top GOB movies?
• Which year had the highest GOB?
• Can we cluster the movies based on their GOB/IMDB scores?
• Is there any relationship between IMDB score and facebook likes?

# Who are the directors of the top GOB movies?

To get the highest GOB scores, first, we sort the dataset based on GOB descending. In the following code, I get the top 15 highest GOB scores and store it in `top_GOB` dataframe.

`top_GOB = dataset_imdb.sort_values(‘GOB’,ascending=False).head(15)`

As we have the directors’ full names, I add another column to shorten their names to better visualize them in a graph. To do that, I just take the second part of the name, which might be their family name.

`top_GOB['director_familyName'] = dataset_imdb["director_name"].str.split(" ", n = 2, expand = True) `

The next step is simply visualizing them in a simple bar chart, as follows:

`fig,ax = plt.subplots(figsize=(7, 5))# Draw a bar graphax = sns.barplot(x=”director_familyName”, y=”GOB”, data=top_GOB,ci=None)# Rotate the directors' name 45 degrees ax.set_xticklabels(ax.get_xticklabels(), rotation=45)# Title the graphfig.suptitle(‘Top movie directors with highest GOB’, fontsize=12)# Set font size of axis labelax.set_xlabel('Direcot name',fontsize=20)ax.set_ylabel('Gross over Budget',fontsize=20)# Set tick size of axis ax.tick_params(axis='x', labelsize=15)ax.tick_params(axis='y', labelsize=15)# Show the graphplt.show();`

# Which year had the highest GOB?

The next report is getting the years with high gross over budget scores for all the movies. To do that, we should group the records based on `title_year`. Pivot table in python is used to group columns and aggregate them based on numeric columns, as shown in the following code.

`# Group data based on movie yeardataset_pivot=dataset_imdb.pivot_table(index=['title_year'],                       values=['GOB'], aggfunc=np.mean ,margins=True)`

The year column in the dataset includes some missing values, we will ignore it.

`# Some cells are empty in year columndataset_pivot =dataset_pivot.loc[dataset_pivot['title_year']>0]`

Finally, we will draw a bar chart that simply shows the year and average value of GOB for each year.

`# We do not need last record which includes the summary metricdataset_pivot = dataset_pivot[:-1]# As index changed after creating a pivot table, we reset the indexdataset_pivot.reset_index(inplace=True)# Some cells are empty in year columndataset_pivot =dataset_pivot.loc[dataset_pivot['title_year']>0]fig,ax = plt.subplots(figsize=(9, 8))ax=sns.barplot(x="title_year", y="GOB", data=dataset_pivot)fig.suptitle('GOB over the years', fontsize=12)plt.xlabel('Movie year')plt.show()`

# Can we cluster the movies based on their GOB/ IMDB scores?

As our data is not labeled,(meaning that they are not in different categories or classes) we can use an unsupervised learning algorithm to cluster them. We would like to know if the machine can find groups or categories of data for us, instead of adding the labels by a user? To this light, we can choose a clustering algorithm to find groups of data. the algorithm we use, depends on several factors including the nature of data, data size, number and types of features in our dataset, etc. I use the K-Means algorithm,one of the popular clustering algorithms, that works iteratively to assign each data point to one of K groups based on the features that are provided. Below, you can see how this algorithm works in the figure. Also, you can find more information about clustering at here.

In our dataset, we are looking for a relation between a GOB and IMDB score to see whether or not we can cluster movies based on these two metrics. First, we need to specify the number of clusters we are looking for. I chose three. If you visualize your data in a scatter plot, you may guess about the number of groups you might get from the algorithm, but if not, do not worry. You can change the number of clusters afterwards and run the algorithm again to see if you get a better result or not! I use a simple scatter plot in Python to visualize these two metrics (the GOB and IMDB score) on a graph.

`plt.scatter(x=dataset['imdb_score'], y=dataset['GOB'])plt.show()`

And get this as the result:

Now, I ask the algorithm to cluster my points into three groups.

`# Exclude the missing values and selecy only GOB and IMDB scoreselected_dataset=dataset.loc[(dataset['GOB']>0) & dataset['imdb_score']>0][['imdb_score','GOB']]# Clustering the dataset usinig K-Means algorithmcls = KMeans(n_clusters=3)# Fit the model into the algorithmcls.fit(selected_dataset)# Give me the center point and label of each groupcentroids=cls.cluster_centers_labels = cls.labels_`

As the next step, I would like to color the points based on their groups and specify the center of each group with a marker (x).

`# We have three cluster, one color for each clustercolors = ["g.","r.","b."]# Create an array of each dataset to traverse over the pointsdataset_array = np.array(selected_dataset)# ploting the point on a scatter plotfor i in range(len(dataset_array)):    plt.plot(dataset_array[i], dataset_array[i], colors [labels[i]], markersize =25);plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=150);`

The output looks like the following:

We see three clusters in the figure. First cluster (blue), is the movies with the low GOB score (not sold well) and low IMDB score (not ranked well by IMDB users). The second cluster includes the movies with a very good IMDB score, but not very high GOB score. The third cluster includes few movies with IMDB scores between 7 and 8 and very high GOB. Like I said, this is a subset of movies, you can have a better interpretation when you apply the algorithm over all the movies.

# Is there any relationship between IMDB score and Facebook likes of movies?

To find a relationship between an IMDB score and Facebook likes, we can use regression algorithm. A simple Linear Regression can be used to model the relationship between a single input independent variable (in our case Facebook likes) and an output variable (e.g., IMDB score) using a linear model i.e a line. This is just an example. In reality we may not find a linear relationship between two variables in a dataset. Find more information about the regression algorithms here.

First of all, we make the Facebook likes in a low scale, as they all have more than 10,000 likes, to better visualize the number in the graph.

`selected_dataset['movie_facebook_likes']=dataset['movie_facebook_likes'].apply(lambda row: row/10000)`

Then we specify the X and Y axis to be used the regression graph:

`X = selected_dataset['budget'].values[:,np.newaxis]y = selected_dataset['movie_facebook_likes'].values`

Seaborn library has a regression plot, which is a very nice and powerful tool to plot a linear regression graph. You just need to specify the X and Y axis along with the actual dataset, as followed:

`ax=sns.regplot('imdb_score', 'movie_facebook_likes',  selected_dataset);`

Then we set the other configurations to make the graph more readable and fancy:

`fig.suptitle(' Relation between IMDB score and facebook likes ', fontsize=15)# Set x-axis labelplt.xlabel('IMDB score');# Set y-axis labelplt.ylabel('Facebook like (10K)');fig.set_size_inches(9,8)font = {'family' : 'Arial',        'size'   : 14}plt.rc('font', **font);plt.show();`

And get this as the result:

You can set the graph setting in the function to change the marker’s shape `(marker='+')` or change the color with `color=’g’`or change the markers’ size with `scatter_kws={'s':100}`. If you do not have a large dataset, we can specify `ci` in the configuration as well. `ci` is size of the confidence interval for the regression estimate, and it will be drawn using translucent bands around the regression line. We can also set `fit_reg=True` to estimate and plot a regression model relating the `x` and `y` variables. There are other parameters in the documentation that you can set, based on the type of analysis you want.

`ax=sns.regplot('imdb_score', 'movie_facebook_likes',  selected_dataset, ci=70, scatter_kws={"s":100}, order=2, fit_reg=True,color="g");`

After using the Python code above, I redrew the graph, and this is what I got:

Based on this graph, there is a linear relationship between movies’ IMDB score and their Facebook likes, which means a higher IMDB score results higher Facebook likes.

Having said that, we can use a simple linear regression algorithm to predict Facebook likes of a movie based on its IMDB score. In the following code, a movie with IMDB score `8.3` has an estimation of `111,000` likes.

`# Assign imdb_score to X as one of our variablesX = selected_dataset['imdb_score'].values[:,np.newaxis]# Assign Facebook likes to Yy = selected_dataset['movie_facebook_likes'].values# Specify the type of modemodel = LinearRegression()# Fit the model with X and ymodel.fit(X, y)# IMDB score input for our predictionIMDB_score=8.3X_test=np.array([IMDB_score]).reshape(1,-1)# make a prediction based on linear regressiony_predict = model.predict(X_test)print('Facebook likes estimation (10K):',y_predict)fig,ax = plt.subplots(figsize=(9, 8))# Set x-axis labelplt.xlabel('IMDB score');# Set y-axis labelplt.ylabel('Facebook like (10K)');# Visualize data and the linear regression lineplt.scatter(X, y,color='b')plt.plot(X, model.predict(X),color='g')# Plot the prediction point with a color and sizeplt.scatter(X_test, y_predict, color='r', s=100) plt.show()`

The scatter plot:

I’m wrapping up the second part of article here. The type of analysis depends strongly on what goal(s) you want to achieve. Setting the goal before preforming the actual analysis is crucial. In any case, I always start with a preliminary analysis using some visualizations such as a scatter plot, bar chart or line graph over my dataset to see what is going on my data. If you do not have any goals and want to publish a paper or start a new project, this approach gives you an idea about WHAT you can do with your dataset.

Written by

Written by

## Reza Rajabi

#### Data Scientist @REDspace 