A Beginners Guide to Exploratory Data Analysis (EDA) with Python

Taimoor Muhammad
9 min readJan 17, 2023

--

In this article, we will be diving into the world of Exploratory Data Analysis (EDA) for a customer churn dataset using Python libraries.

Source: https://analyticsarora.com/exploratory-data-analysis-in-python-beginners-guide-for-2021/

Churn, also known as customer attrition, is a crucial metric for any business as it measures the rate at which customers are leaving a company. By understanding and analyzing customer churn, companies can take steps to reduce it and improve their bottom line. In this post, we will discuss the importance of EDA in understanding customer churn, the different techniques used in EDA, and how to apply them to a customer churn dataset. By the end of this post, you will have a better understanding of EDA and how to use it to analyze data.

The Dataset

Found on Kaggle.com, the dataset we will be using for this project is called “Predict the Churn Risk Rate” and it contains data on customers of a telecom company. The dataset is in the .csv format and consists of 36992 rows × 23 columns. We can start our analysis by importing the dataset using pd.read_csv function by Pandas.

#Importing the dataset
path='churn.csv'
df=pd.read_csv(path,index_col=0)
df.head()

Although not shown in the snapshot above, the target variable for this project is churn_risk_score which indicates whether the customer has left the company (1: Yes, 0: No).

Understanding The Data

Before starting with data manipulation, a good practice to use .dtypes function to list out the data type of each column. This helps in differentiating between numerical and categorical columns. A good thing to remember is; even though all of the string columns are categorical features, not all of the numerical columns are continuous features.

For our dataset,

df.dtypes
age                               int64
gender object
security_no object
region_category object
membership_category object
joining_date object
joined_through_referral object
referral_id object
preferred_offer_types object
medium_of_operation object
internet_option object
last_visit_time object
days_since_last_login int64
avg_time_spent float64
avg_transaction_value float64
avg_frequency_login_days object
points_in_wallet float64
used_special_discount object
offer_application_preference object
past_complaint object
complaint_status object
feedback object
churn_risk_score int64
dtype: object

We can see there are 6 numerical columns and 17 categorical columns (categorical columns in Python are represented as object types). Moving forward, we will now know how to treat each column while performing EDA.

Before we start transforming our data, we need to ask ourselves; Which of these columns provides relevant information to our analysis? For this project, we want to build a machine-learning model that predicts which customer is about to churn. To align with the scope of our project, we can remove columns representing user ids, so we remove security_no and referral_id, moving on, we remove joining_date as it is not needed in our analysis, and finally, we remove last_visit_time and avg_frequency_login_days since we can conduct our analysis using days_since_last_login.

df=df.drop(columns=['security_no', 'referral_id','joining_date','last_visit_time','avg_frequency_login_days'])

Tip: Keep checking the shape of the dataset after each data transformation stage. In case of an error, it helps in identifying where the bug might be.

df.shape
(36992, 18)

Handling NaN values

Handling missing or null values, also known as “NaN” values, is an important step in the data-cleaning process. When analyzing a dataset, it is common to encounter missing values, which can have a significant impact on the analysis. There are several ways to handle missing values, such as:

  • Dropping rows or columns with missing values
  • Imputing missing values with a specific value (e.g. mean, median, mode)
  • Using machine learning algorithms that can handle missing values
  • Creating a new category for missing values

It is important to choose the appropriate method of handling missing values based on the specific dataset and the research question. In Python, NaN values can be very easily filtered out using built-in functions from Pandas.

#Checing NaN values in each column
check_nan = df.isnull().sum()
print("The number of NaN values in each column are: \n",check_nan)
The number of NaN values in each column are: 
age 0
gender 0
region_category 5428
membership_category 0
joined_through_referral 0
preferred_offer_types 288
medium_of_operation 0
internet_option 0
days_since_last_login 0
avg_time_spent 0
avg_transaction_value 0
points_in_wallet 3443
used_special_discount 0
offer_application_preference 0
past_complaint 0
complaint_status 0
feedback 0
churn_risk_score 0
dtype: int64

Since our dataset has a large number of tuples, and region_category and points_in_wallet are important aspects of our analysis, we will opt with dropping the rows with NaN values. Which can be done using the command listed below.

#Removing rows with NaN values as the remaining number of rows is sufficient
df=df.dropna()
df.shape
(28373, 18)

Data Transformation

Data transformation is the process of modifying the dataset in a way that makes it easier to analyze and understand, fix data quality issues, make the data more suitable for modeling, or uncover patterns or relationships that are not immediately apparent in the raw data.

We will split the data transformation stage into two parts:

  1. Categorical
  2. Numerical
  3. Categorical:

Handling categorical columns is not a one-size-fits-all approach and it’s important to carefully select the right method based on the characteristics of the dataset and the model being used. For this project, we will check and remove junk values from the categorical variables, then we will encode the data to prepare it for our machine-learning model.

Junk Variables

Junk values can have a negative impact on the performance of machine learning models. They can skew the distribution of the data and create bias in the model. Removing these values can improve the overall quality of the data by removing invalid or incorrect entries.

We now filter out the names of all the categorical columns and visualize them using the libraries Pandas, Seaborn, and Matplotlib.

# #Getting the names of categorical columns
category = ['object']
categorical_column=df.select_dtypes(include=category)
categorical_column.columns
len(categorical_column.columns)
12
fig, axs = plt.subplots(len(categorical_column.columns))
fig.set_figheight(40)
fig.set_figwidth(10)
for i,j in enumerate(categorical_column.columns):
sns.histplot(ax=axs[i],data=df,x=j,hue=j,multiple="stack",shrink=0.3)

Sorry for the overlaps with the names, please refer to the legend in such a case.

Using matplotlib’s subplot function, we use seaborn to plot histograms of each categorical column. Through these visualizations, we can observe that gender, joined_through_referral, and medium_of_operation contain junk values so we remove the rows with those values.

#gender
df = df.loc[df.loc[:,'gender'].isin(['M','F']), :].copy(deep=True)
df.shape
(28330, 18)

#joined through referral
df = df.loc[df.loc[:,'joined_through_referral'].isin(['No','Yes']), :].copy(deep=True)
df.shape
(24166, 18)

#medium of operation
df = df.loc[~df.loc[:,'medium_of_operation'].isin(['?']), :].copy(deep=True)
df.shape
(20694, 18)

df.reset_index(inplace=True,drop=True)

Encoding

Moving on, before we feed these categorical variables as inputs to our machine learning models, we need to change them to numerical categories. Why? Machine learning models are mathematical models that work with numerical values, not strings. String categories are typically encoded as numerical values before they can be used in machine learning models.

There are multiple techniques to transform categorical data, however, each of these differs in operation. So, before applying any of these techniques, it is best to understand them:

  • One-hot Encoding: This technique is used when the features are nominal(do not have any order). In one hot encoding, for every categorical feature, a new variable is created. However, this technique creates a binary column for each category, resulting in a large number of columns which can lead to overfitting and memory issues. Another variation of this technique is called dummy encoding, which uses one less data to represent data.
  • Label Encoding: This is a simple technique where each category is assigned a unique integer value. It implies an ordinal relationship between the categories, and the model may treat the integer values as if they were ordinal (e.g. 1 is “less than” 2 and so on) which may not be the case.
  • Binary Encoding: This technique assigns a unique binary code to each category. Since it does not imply any ordinal relationship between the categories, the weight of each category is not increased or decreased based on its value.

For our project, our categorical variables do not have inherent ordinality, so we will opt for binary encoding.

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
for i in categorical_column.columns:
x=np.array(df[i]).reshape(-1,1)
y=enc.fit_transform(x).toarray()
converted_list = [[str(int(i)) for i in sublist] for sublist in y]
converted_list = [[''.join(sublist)] for sublist in converted_list]
df[i]=np.array(converted_list)

2. Numerical:

Handling numerical columns is an important step when preparing a dataset for analysis. If the values of a numerical column are not on the same scale, it could cause some variables to dominate others during the training of a model. Also, if there are outliers in a numerical column, it could skew the results of the model.

Handling Outliers

We will now check the numerical variables for any outliers.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
#Getting numeric columns
df_numeric = df.select_dtypes(include=numerics)
numeric_col=df_numeric.columns

#Deciding the shape of the subplot
shape_subplot=int((len(numeric_col)/2))

# #Plotting boxplots
figure, axis = plt.subplots(2,shape_subplot)
figure.set_figheight(10)
figure.set_figwidth(20)
for i,j in enumerate (numeric_col):
if i<3:
axis[0,i].boxplot(df[str(j)])
axis[0, i].set_title(j)
else:
axis[1,i-3].boxplot(df[str(j)])
axis[1, i-3].set_title(j)

Since the existence of negative values in our numerical columns in against logic, we will further drop the rows in avg_time_spent, days_since_last_login, and points_in_wallet that have negative values. The reason we are not removing all outliers? Our dataset is a depiction of real-world human behavior, which is represented by these outliers. So, our model needs to be trained enough to handle such cases. So, the shape of the dataset after removing tuples with negative values is

df=df[df['days_since_last_login']>0]
print(df.shape)
df=df[df['avg_time_spent']>0]
print(df.shape)
df=df[df['points_in_wallet']>0]
print(df.shape)
(19550, 18)
(18628, 18)
(18562, 18)

Scaling

Scaling is an important step when preparing a dataset for analysis because it adjusts the values of a numerical column so that they are on the same scale. This ensures that all features have a similar impact on the model by bringing their values to the same range. There are several ways to scale numerical columns, the most common methods are:

  • Min-Max Scaling: The values are transformed to a given range, usually between 0 and 1.
  • Standardization: This method scales the values so that they have a zero mean and unit variance.
  • Normalization: It scales the values between -1 and 1.

We will first check if the numerical data is normalized or not by using the function sns.displot(df, x=column name). Although not shown in this article, but included in the code file, the numerical values in these columns are not normally distributed. Therefore, normalization would be an appropriate scaling method to use for this dataset.

Using the maximum absolute scaling; The maximum absolute scaling rescales each feature between -1 and 1 by dividing every observation by its maximum absolute value. We can apply the maximum absolute scaling in Pandas using the .max() and .abs() methods, as shown below

#scale the target variable
# apply normalization techniques
for column in numerical_column.columns:
df[column] = df[column] / df[column].abs().max()

Understanding Correlation

Checking for correlation between variables is an important step in the EDA phase of a machine learning project. Understanding the relationship between variables can provide insight into the relationship between variables, which can be useful in developing hypotheses and understanding the underlying structure of the data. Furthermore, correlation can be used to identify the most important features in the dataset and to understand how they are related to the target variable.

We will use the seaborn library to plot a correlation heat-map between all the columns

x=cor.corr()
sns.heatmap(x)

We can observe that the correlation between the variables is not high. Since the variables are negatively correlated, this indicates they are independent of each other, hence, we can use decision trees, random trees, and random forests as our classifiers.

Target Variable

Finally, we will now check if our target variable is balanced or not.

#We check to see if the target variable is balanced or not
#define Seaborn color palette to use
colors = sns.color_palette('pastel')

#create pie chart
pie=df['churn_risk_score'].value_counts()
pie.plot.pie(autopct="%.1f%%")

Since our target variable is almost fairly balanced. We can move towards the next stage:

  • Split the dataset into training and testing
  • Apply classification algorithms(s)
  • Evaluate performance

A Python Notebook is available on my GitHub if you would like to follow up on the code step-by-step, and offer insights, recommendations, and code improvement.

--

--

Taimoor Muhammad

Graduate Student at Boston University | Data Scientist | Python | SQL