Exploratory Data Analysis on Facebook Utilization Data

Abhinav Chintale
Analytics Vidhya
Published in
9 min readAug 28, 2021

Well first of all performing EDA on whatever the raw data that is given to you is one of the most important stuff in analyzing the data and uncovering the story behind it.

Without any further adieu let’s dive right into it.

The excel file which I am using for the data is given here. If anybody wants to use it please feel free to do so.

I am using Google Colab to write the code you can use any of your preferred IDE. And as for the excel file, it was on my drive. So it is important to mount the drive on colab.

Mounting the drive on Colab

Mounting Drive

The above code mounts the drive onto colab.

IMPORTING LIBRARIES:

Libraries needed for EDA

Given below is some information about the libraries feel free to skip it if you already know about them.

import numpy as np:

NumPy is a python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices. It was created in 2005 by Travis Oliphant. It is an open-source project and you can use it freely. NumPy stands for Numerical Python. In Python, we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists. Arrays are very frequently used in data science, where speed and resources are very important.

import pandas as pd

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, statistics, analytics, etc. Python was majorly used for data munging and preparation. It had very little contribution to data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

import matplotlib.pyplot as plt

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python and the advantage of being free and open-source. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python and the advantage of being free and open-source. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.

import seaborn as sns

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and is closely integrated with pandas data structures. Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on data frames and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

The below code is used to read the excel file.

pd.read_excel is an inbuilt command in Pandas that basically helps to read the excel files. This data frame is stored in a variable df.

The head() function is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. By default, n=5.

Reading the data

Just to check whether our data has been loaded and read the below code is executed and as the output suggests the data has been loaded and read.

df.info This function gives information about the columns present i.e. its name, count, Dtype, and whether it is null or not.

Output about df.info()

The describe() function computes a summary of statistics pertaining to the Data Frame columns. This function gives the mean, std, and IQR values. And, function excludes the character columns and given a summary of numeric columns.

In the below output we can see the gender column is excluded as it contains non-numeric values.

MISSING VALUES

Sometimes it may happen that there might be some missing data or some inconsistency in the data which may create problems further while analyzing it so it is important you replace the missing values or the null values

Normally there are two ways that can be followed

  1. Either you delete the entire row which has missing data.
  2. You fill it with the mean, mode, or median of the data present in the column.

I personally prefer the second way because deleting the row may lead to losing the data and while analyzing or while training it is good to have as much data as possible for predicting better results.

The below code shows us how many null values are present in each of the columns in the data frame.

Gender: 175 Null values

Tenure: 2 Null values

Null values for every single column

In case you would like to check null values for a particular column then you can use the below command.

Here I would initially like to explain how I have replaced the Null values in the Gender column and then how I implemented it in code. I replaced all the null values in the gender column with the mode of the data the basic logic behind it was, well, in the given data gender is a categorical variable i.e. either it can be male or female. So in the case of categories when there are just two variables available the most occurring value should be the ideal value which should be replaced with null values.

No of males users and No of Female users

Here I replaced the null values in the gender column with the mode of data.

replacing with mode

Finally, after the replacement, we can see the null values are 0.

Number of null values after replacing with mode.

Tenure is number of days since the user has been on FB so it is logical to replace it with median in the below four blocks of code we can see that

a) Number of null values in tenure: 2

b) Median_value =412.0

And finally, when the tenure is replaced with the median the null values turn to 0.

Replacing tenure with the median.

Finally just to check whether we have handled all the missing values let us run the df.isnull command once.

No null values present

Correlation Matrix

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

Correlation Matrix

Plotting Heat Map for the correlation Matrix.

code to plot heat Map
Heat Map

Let us say our target variable is gender so let us do analysis according to it.

Count plot male vs female

By above count plot we can clearly see that number of male exceeds that of female by a quite large value.

In the given data sheet friend_count refers to the number of friends each user has so let’s see which category of gender has more friends?

Here I used groupby function.

DataFrameGroupBy.agg(arg, *args, **kwargs) → (Little info about Groupby)

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.

In the below block of code, as you can see by using groupby we can find the sum of friends each category of gender has and also a barplot associated with it.

classification of number of friends each category of gender has

Observation:

  • Friend count of males and Female are almost same as we can see from above bar graph
  • But according to the values there is a slight difference between the count of male and female
  • Female: 9740258
  • Male: 9699059

Now we will check which category has initiated more number of friendships or rather in simple words which category of gender has sent more number of friend request.

In the below block of code, as you can see by using groupby we can find who has initiated more number of friend requests for each category of gender and also a barplot associated with it.

classification based on number of friendship initiated by each category of gender

Observations:

  • From the above code we can see males have initiated more friendships as compared to female.
  • Here in the graph also we can see the clear difference between the friendships initiated.
  • Female: 4584894
  • Male: 6053223

Now we will check which category has spent more days on facebook than the other one.

In the below block of code, as you can see by using groupby we can find who has spent more time on facebook for each category of gender and also a barplot associated with it.

Observations:

  • From the above code we can see males have been using facebook for a longer time as compared to female.
  • Here in the graph also we can see the clear difference between the friendships initiated.
  • Female: 23637975.0
  • Male: 29614237.0

Analysis based on the least active users on Facebook

There would mainly be three question on least active users

  1. How many users have no friends?
  2. How many users did not like any posts?
  3. How many users did not receive any likes?

Here the analysis is user wise:

Finding this quite simple.

For each column check if the value is 0 and just add them.

So, we get

a. 1962 users do not have any friend.

b. 22308 users did not like any post.

c. 24428 users did not receive any likes.

Analysis based on the user accessibility (Mobile Devices vs. Web Devices)

  1. What is the average number of posts liked by users (based on gender) through web vs. mobile devices?

Here we can see in both the cases avg likes to a post by female is more as compared to male. Below are two bar graph which shows comparison of likes by both the gender with respect to mobile device likes vs web device likes.

mobile_likes_female vs web_likes_female
males_like_male vs web_like_male

2. What is the average number of likes received by users (based on gender) through web vs. mobile devices?

Here we can see in both the cases avg likes received by female is more as compared to male. Below are two bar graph which shows comparison of likes received by both the gender with respect to mobile device likes vs web device likes.

mobile_likes_recevied_female vs web_likes_received_female
mobile_likes_recevied_male vs web_likes_received_male

This is some EDA done by me if you have anything to ask please feel free to contact.

--

--