Mastering Exploratory Data Analysis (EDA): Everything You Need To Know

Published in

Data And Beyond

18 min readApr 6, 2024

The journey from raw data to actionable insights is often paved with challenges and uncertainties. At the heart of this journey lies Exploratory Data Analysis (EDA), a foundational process that serves as a compass, guiding data scientists through the intricate landscape of their datasets. EDA is not merely a preliminary step; it is a profound exploration that unlocks the hidden treasures buried within data, revealing patterns, anomalies, and relationships that form the bedrock of informed decision-making. Lets start with, what exactly is EDA?

Exploratory Data Analysis (EDA) is an analytical approach aimed at uncovering the inherent characteristics of datasets, utilizing statistical and visualization techniques.

Unlike hypothesis-driven analyses driven by prior domain knowledge, EDA is a flexible, open-ended exploration that allows data scientists to delve into the data without preconceived notions. Essentially, EDA serves as a preliminary step to inspire hypothesis generation by unveiling intriguing patterns, trends, and correlations within the data. In practical terms, EDA enables us to formulate hypotheses based on data-driven insights, which can then be tested against hypotheses grounded in domain knowledge, thereby enriching our understanding and validating our findings.

Exploratory Data Analysis (EDA) typically consists of several key components or stages that guide data scientists through the process of understanding and exploring a dataset. These components can vary depending on the specific goals of the analysis and the characteristics of the data, but commonly include:
1) Data Collection
2) Data Cleaning and Preprocessing
3) Descriptive Statistics
4) Univariate Analysis
5) Bivariate Analysis
6) Multivariate Analysis
7) Feature Engineering
8) Visualization

In this post, I will share my own method of EDA once I have fixed on a dataset. There will be some differences from project to project, but the system is there so that most of the items are covered. I will split my EDA into 3 parts:
1) EDA Level 0 — Pure Understanding of Original Data
2) EDA Level 1 — Transformation of Original Data
3) EDA Level 2 — Understanding of Transformed Data

I will be using a some examples from a proper EDA I did. The purpose of this post is just to share and log the codes used and some of the examples of how EDA can be done. There might be parts where the insights found do not make sense as this just a part of a bigger EDA.

EDA Level 0 — Pure Understanding of Original Data

I did a basic check on the column datatype, null counts, distinct values, to get a better understanding of the data. I also created a distinct values count dictionary where I go the top 10 counts and their distinct values displayed so I could roughly gauge how significant the distinct values are in the dataset.

def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df

# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)

Below is another code snipper that can extract more info.

# Gets additional value such as min / median / max etc.
def column_summary_plus(df):
    result_df = pd.DataFrame(columns=['col_name', 'col_dtype', 'num_distinct_values',
                                      'min_value', 'max_value',
                                      'median_no_na', 'average_no_na','average_non_zero',
                                      'null_present', 'nulls_num', 'non_nulls_num',
                                      'distinct_values'])
    
    # Loop through each column in the DataFrame
    for column in df.columns:
        print(f"Start processing {column} col with {df[column].dtype} dtype")
        # Get column dtype
        col_dtype = df[column].dtype
        # Get distinct values and their counts
        value_counts = df[column].value_counts()
        distinct_values = value_counts.index.tolist()
        # Get number of distinct values
        num_distinct_values = len(distinct_values)
        # Get min and max values
        sorted_values = sorted(distinct_values)
        min_value = sorted_values[0] if sorted_values else None
        max_value = sorted_values[-1] if sorted_values else None

        # Get median value
        non_distinct_val_list = sorted(df[column].dropna().tolist())
        len_non_d_list = len(non_distinct_val_list)
        if len(non_distinct_val_list) == 0:
            median = None
        else:
            median = non_distinct_val_list[len_non_d_list//2]

        # Get average value if value is number
        if np.issubdtype(df[column].dtype, np.number):
            if len(non_distinct_val_list) > 0:
                average = sum(non_distinct_val_list)/len_non_d_list
                non_zero_val_list = [v for v in non_distinct_val_list if v > 0]
                average_non_zero = sum(non_zero_val_list)/len_non_d_list
            else:
                average = None
                average_non_zero = None
        else:
            average = None
            average_non_zero = None

        # Check if null values are present
        null_present = 1 if df[column].isnull().any() else 0

        # Get number of nulls and non-nulls
        num_nulls = df[column].isnull().sum()
        num_non_nulls = df[column].notnull().sum()

        # Distinct_values only take top 10 distinct values count
        top_10_d_v = value_counts.head(10).index.tolist()
        top_10_c = value_counts.head(10).tolist()
        top_10_d_v_dict = dict(zip(top_10_d_v,top_10_c))

        # Append the information to the result DataFrame
        result_df = result_df.append({'col_name': column, 'col_dtype': col_dtype, 'num_distinct_values': num_distinct_values, 
                                      'min_value': min_value, 'max_value': max_value,
                                      'median_no_na': median, 'average_no_na': average, 'average_non_zero': average_non_zero,
                                      'null_present': null_present, 'nulls_num': num_nulls, 'non_nulls_num': num_non_nulls,
                                      'distinct_values': top_10_d_v_dict}, ignore_index=True)
        
    return result_df

# Example usage:
# Assuming df is your DataFrame
summary_df = column_summary(df)
display(summary_df)

If there are any errors, it is most likely due to datatype of the pandas dataframe. For those who are converting from one format to another, it is important to keep the datatype info. One example is saving a pandas dataframe into a csvfile, and then loading it back again, it is so important to save the datatype and reload it with the datatype.

### To Save Pandas to CSV
def dtype_to_json(pdf, json_file_path: str) -> dict:
    '''
    Parameters
    ----------
    pdf : pandas.DataFrame
        pandas.DataFrame so we can extract the dtype
    json_file_path : str
        the json file path location
        
    Returns
    -------
    Dict
        The dtype dictionary used
    
    To create a json file which stores the pandas dtype dictionary for
    use when converting back from csv to pandas.DataFrame.
    '''
    dtype_dict = pdf.dtypes.apply(lambda x: str(x)).to_dict()
    
    with open(json_file_path, 'w') as json_file:
        json.dump(dtype_dict, json_file)
    
    return dtype_dict

def download_csv_json(df, mainpath):
    csvpath = f"{mainpath}.csv"
    jsonfp = f"{mainpath}_dtype.json"
    
    dtypedict = dtype_to_json(df, jsonfp)
    df.to_csv(csvpath, index=False)
    
    return csvpath, jsonfp


# Example usage:
download_csv_json(df, "/home/some_dir/file_1") 

### To Load CSV to Pandas
def json_to_dtype(jsonfilepath):
    with open(jsonfilepath, 'r') as json_file:
        loaded_dict = json.load(json_file)
    return loaded_dict

def csv_to_pandas(csvpath, jsonpath):
    dtypedict = json_to_dtype(jsonpath)
    pdf = pd.read_csv(csvpath,dtype=dtypedict)
    
    return pdf

# Example usage:
csvfp = "/home/some_dir/file_1.csv"
jsonfp = "/home/some_dir/file_1_dtype.json"
df = csv_to_pandas(csvfp, jsonfp)

By doing this check, I think one of the obvious issues is that the C_ID column was not a primary key, since the number of distinct values is not equal to the number of non-nulls.

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df.head())
print(df.describe())
print(df.duplicated().sum())

I will also run the 3 lines above to:
1) Get an idea how the dataset looks like
2) Know the median / mean / rough statistical distribution
3) Check that there are no duplicated rows.

I will then run a fast and dirty check on the distributions using the code below. The code should product histogram charts :

# Identify numerical columns
numerical_columns = df.select_dtypes(include=[np.number]).columns

# Perform univariate analysis on numerical columns
for column in numerical_columns:
    # For continuous variables
    if len(df[column].unique()) > 10:  # Assuming if unique values > 10, consider it continuous
        plt.figure(figsize=(8, 6))
        sns.histplot(df[column], kde=True)
        plt.title(f'Histogram of {column}')
        plt.xlabel(column)
        plt.ylabel('Frequency')
        plt.show()
    else:  # For discrete or ordinal variables
        plt.figure(figsize=(8, 6))
        ax = sns.countplot(x=column, data=df)
        plt.title(f'Count of {column}')
        plt.xlabel(column)
        plt.ylabel('Count')
        
        # Annotate each bar with its count
        for p in ax.patches:
            ax.annotate(format(p.get_height(), '.0f'), 
                        (p.get_x() + p.get_width() / 2., p.get_height()), 
                        ha = 'center', va = 'center', 
                        xytext = (0, 5), 
                        textcoords = 'offset points')
        plt.show()

Basically, what we want to do is get a rough idea of what the data is like, so that we can see if there are any inconsistencies and assumptions we might need to consider before attempting to transform the data. After our fast and dirty check, we will proceed towards EDA Level 1.

EDA Level 1 — Transformation of Original Data

Based on the Level 0 information. I decided to transform the dataset before doing a deeper exploration on the dataset for more insights.
1) I changed the column names to all be in small letters and spaces to be changed to underscore. I also changed it to names that I feel are more generic and categorized, for easy interpretation.
2) I filled in the empty null / NaN rows with values I feel make sense. (Will show some examples: dr_cr_ratio, prop_pur_price, cc_out_bal_ave_mth)
3) I changed the datatype to be more appropriate.
4) Do data validation
5) Mapping / Binning of Categorical Features

Changing Column Names

I decided to change the column names to be more readable and standardized.

All small letters
All spaces changed to underscores
Shortforms standardized

### Rename the column names for familiarity
# This is if there is no requirement to use back the same column names.
# This is also only done if there is no pre-existing format, or if the col names don't follow conventional format.
# Normally will follow feature mart / dept format to name columns for easy understanding across board.

df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
                 'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa': 'casa_bal_avg_mth',
                 'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'drvcr': 'dr_cr_ratio_yr',
                 'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'asset_tot_val',
                 'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_price_avg': 'prop_pur_price',
                 'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
                 'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_max_mth', 'min_mth_trn_amt': 'cc_txn_amt_min_mth',
                 'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt_yr', 'ann_n_trx': 'cc_txn_nos_yr'}
df_l1.rename(columns=new_col_dict, inplace=True)

I find renaming columns an important part because when I do this, I am able to categorize columns accordingly and when analyzing features at a later stage, having clear column names really help in quickly identifying the points. Eg. when I see features with the cc prefix, I know immediately it is a credit card feature. This allows me to identify trends along the way.

Filling Up Nulls / NANs

Deciding how to fill up the null values is one of the crucial parts of ensuring the model predicts accurately. We can only fill up the null values if we have domain understanding of the feature and also an understanding of the figures within the dataset. For me, this part usually takes up the most amount of time from my end, as it really requires trying to understand the data beforehand.

We can use several methods to figure out how to explore the data and fill up the null values. I will give two examples:

USING A BOXPLOT

sns.set(style="whitegrid")

# Create the boxplot
plt.figure(figsize=(10, 6))  # Set the size of the plot
sns.boxplot(x='c_incm_typ', y='casa_bal_max_yr', data=df_l1)

# Set labels and title
plt.xlabel('Income Type')
plt.ylabel('casa_bal_max_yr')
plt.title('Boxplot of casa_bal_max_yr by Income Type')
plt.yscale('log')

# Show the plot
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()

I was trying to find if there was any relationship between casa_bal_max_yr with Income Type, such that I can use casa_bal_max_yr to define the person’s Income Type. Since there is no significant result, and the overlap among the IQR is too huge, we will use 0 as a separate category instead of deciding to classify as 1 to 8.

USING DESCRIBE

The describe method provides clear understanding of the basic statistical information. In the below case, I was trying to decide check if all home loans have property purchase price value. This is because by common sense, if they took up a home loan, they will have a property, and that property will have a value.

new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['loan_home_tag'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['prop_pur_price']].isnull().any(axis=1)].shape[0]
print("Number of rows where 'loan_home_tag' is null, but 'prop_pur_price' is not null:", not_null_count)

new_df = df_l1[['prop_pur_price','loan_home_tag']]
null_loan_home = new_df[new_df['prop_pur_price'].isnull()]
not_null_count = null_loan_home[~null_loan_home[['loan_home_tag']].isnull().any(axis=1)].shape[0]
print("Number of rows where 'prop_pur_price' is null, but 'loan_home_tag' is not null:", not_null_count)

new_df = df_l1[['prop_pur_price','loan_home_tag']]
condition = new_df['loan_home_tag'] == 1
new_df[condition].describe()

We can see that there were 5460 customers who had a property purchase price but did not take up a loan, and that there were 2243 customers who had taken up a loan but had no property purchase price.

We then used the describe method to see what is the distribution of property purchase prices for those who have taken a home loan, to see how to impute the data. For this particular case, I decided to use the median to impute the nulls in the property purchase price.

Changing the Data Type

dtype_mapping = {'c_id': str, 'c_age': int, 'c_pc': int, 'c_incm_typ': int, 'prod_nos': int,
                 'casa_td_nos': int, 'loan_home_tag': int, 'loan_auto_tag': int,
                 'funds_nos': int, 'cc_txn_nos_yr': int, 'u_id': int}

The datatypes were changed. The changes were only minor and most likely won’t affect the final result but it was done mainly as a best practice. It is important to understand the data and use the appropriate datatype especially when dealing with bigger amounts of data.

Doing Data Validation

Based on the highlight of the lack of unique ID for column c_id , let’s do some data validation for that column.

By right c_id is supposed to be a unique value (representing one customer). But it is possible that the c_id is duplicated due to input from multiple sources or snapshots in time, and the ETL process not being entirely clean. There are several ways to handle unclean data. But we must first understand what is the unclean data.

If the unclean data is duplicated data with just one or two column values difference, we can use a rank method to get the row with the least amount of nulls or the one with the latest data, or even to combine the information.

However, in serious cases where the c_id is wrong, the common protocol would be to go back to the source and check what went wrong in the source and then only figure out how to solve it. Lets have a deeper look into c_id to see what are the issues.

Based on the c_id information and a general read into the data, despite there being the possibility that the c_id contain information from different snapshots, it is also unlikely that is the case as some of the profiles do not conform to common sense.

Such as customer 99731, where one of it shows the person is 52 and has a degree, but another row shows the person is 65 and only has A-Levels.

For this particular case, I will just take it as different rows since the rows seems to have no similar trend.

Mapping / Binning of Categorical Features

This part is important because the data might not be categorized according to industry convention. For instance, in the context of an ultra-wealthy client list, having a HDB 2 ROOM and HDB 3 ROOM might not make a difference. They might all be categorized under HDB.

If there are many values for the features, we can either use OneHotEncoding to split the values into multiple features of binary classification, or we can use LabelEncoding to label different categories as numbers.

However, I think it best to discuss with domain experts and come up with relevant categories and bin it manually first.

df_l1 = df.copy()
df_l1.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True)
new_col_dict = {'pc': 'c_pc', 'incm_typ': 'c_incm_typ', 'gn_occ': 'c_occ',
                 'num_prd': 'prod_nos', 'casatd_cnt': 'casa_td_nos', 'mthcasa': 'casa_bal_avg_mth',
                 'maxcasa': 'casa_bal_max_yr', 'mincasa': 'casa_bal_min_yr', 'drvcr': 'dr_cr_ratio_yr',
                 'mthtd': 'td_bal_avg', 'maxtd': 'td_bal_max', 'asset_value': 'asset_tot_val',
                 'hl_tag': 'loan_home_tag', 'al_tag': 'loan_auto_tag', 'pur_price_avg': 'prop_pur_price',
                 'ut_ave': 'ut_avg', 'maxut': 'ut_max', 'n_funds': 'funds_nos',
                 'cc_ave': 'cc_out_bal_avg_mth', 'max_mth_trn_amt': 'cc_txn_amt_max_mth', 'min_mth_trn_amt': 'cc_txn_amt_min_mth',
                 'avg_trn_amt': 'cc_txn_amt_avg_mth', 'ann_trn_amt': 'cc_txn_amt_yr', 'ann_n_trx': 'cc_txn_nos_yr'}
df_l1.rename(columns=new_col_dict, inplace=True)
fill_values = {'c_edu': 'Unknown', 'c_hse': 'UNKNOWN', 'c_pc': 0, 'c_incm_typ': 0,
               'c_occ': 'UNKNOWN',
               'casa_td_nos': 0, 'casa_bal_avg_mth': 0, 'casa_bal_max_yr': 0, 'casa_bal_min_yr': 0,
               'td_bal_avg': 0, 'td_bal_max': 0,
               'loan_home_tag':0, 'loan_auto_tag': 0,
               'ut_avg': 0, 'ut_max': 0, 'funds_nos': 0,
               'cc_txn_amt_max_mth': 0, 'cc_txn_amt_min_mth': 0, 'cc_txn_amt_avg_mth': 0,
               'cc_txn_amt_yr': 0, 'cc_txn_nos_yr': 0, 'cc_lmt': 0}
df_l1.fillna(fill_values, inplace=True)

Summary for EDA Level 1

At the end of EDA Level 1, there should be a clear table outlining how the nulls were filled, what are the new column datatypes, and whether they are numerical / categorical / identifiers. This will make it much easier during the next phase where we will use models / statistical methods to get feature importances and analysis.

EDA Level 2 — Understanding of Transformed Data

A recap, in EDA Level 0, we explored the raw data. In EDA Level 1, we explored the data even deeper and came up with decided on how to transform the data.

In EDA Level 2, we will understand the transformed data. We will be using several tools such as:

Correlation Analysis
IV / WOE Values
Feature Importance from Models
Statistical Tests
Further Data Analysis on Imputed Data

Correlation Analysis

Correlation Analysis allows us to see which features are highly correlated. Despite it not being a requirement to remove highly correlated features for tree-based algorithms, as the algorithm will allocate feature importance, it is still a good practice to check the correlation and remove features which are highly correlated.

Before we started the correlation analysis, I had already noticed that cc_txn_amt_avg_mth is actually derived from cc_txn_amt_yr, in Level 1 EDA. As such, I will expect a very high correlation from these 2 features.

numerical_cols = ['c_age', 'prod_nos',
                  'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_bal_min_yr',
                  'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
                  'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos', 
                  'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mth', 'cc_txn_amt_avg_mth',
                  'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_occ_encoded',
                    'loan_home_tag', 'loan_auto_tag']

# Assuming df is your DataFrame
correlation_matrix = df_l2[numerical_cols].corr()

# Create the heatmap
plt.figure(figsize=(20, 16))  # Set the size of the plot
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Set title
plt.title('Correlation Heatmap')

# Show the plot
plt.tight_layout()
plt.show()

# Find the max correlation
upper_triangular = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
max_correlation = upper_triangular.max().max()
print(f"Maximum pairwise correlation: {max_correlation:.2f}")

The output is as below.

To get the individual pairwise correlation, we can use the code snippet below:

def corr_v(df_input, col1, col2):
    correlation_value = df_input[col1].corr(df_input[col2])
    return f"Correlation value between {col1} and {col2} is: {correlation_value}"
    
print(corr_v(df_l2, 'casa_bal_avg_mth', 'casa_bal_max_yr'))
print(corr_v(df_l2, 'td_bal_avg', 'td_bal_max'))
print(corr_v(df_l2, 'ut_avg', 'ut_max'))
print(corr_v(df_l2, 'cc_txn_amt_max_mth', 'cc_txn_amt_yr'))
print(corr_v(df_l2, 'cc_txn_amt_avg_mth', 'cc_txn_amt_yr'))

IV / WOE Values

IV / WOE Value

Information Value (IV) quantifies the prediction power of a feature. You may read up more about it here. Short story is, we are looking for IV of 0.1 to 0.5

def iv_woe(data, target, bins=10, show_woe=False):
    
    #Empty Dataframe
    newDF,woeDF = pd.DataFrame(), pd.DataFrame()
    
    #Extract Column Names
    cols = data.columns
    
    #Run WOE and IV on all the independent variables
    for ivars in cols[~cols.isin([target])]:
        print("Processing variable:", ivars)
        if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>10):
            binned_x = pd.qcut(data[ivars], bins,  duplicates='drop')
            d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
        else:
            d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})

        
        # Calculate the number of events in each group (bin)
        d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
        d.columns = ['Cutoff', 'N', 'Events']
        
        # Calculate % of events in each group.
        d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()

        # Calculate the non events in each group.
        d['Non-Events'] = d['N'] - d['Events']
        # Calculate % of non events in each group.
        d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'].sum()

        # Calculate WOE by taking natural log of division of % of non-events and % of events
        d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
        d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
        d.insert(loc=0, column='Variable', value=ivars)
        print("Information value of " + ivars + " is " + str(round(d['IV'].sum(),6)))
        temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, columns = ["Variable", "IV"])
        newDF=pd.concat([newDF,temp], axis=0)
        woeDF=pd.concat([woeDF,d], axis=0)

        #Show WOE Table
        if show_woe == True:
            print(d)
    return newDF, woeDF

numerical_cols = ['c_age', 'prod_nos',
                  'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_bal_min_yr',
                  'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
                  'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos', 
                  'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mth', 'cc_txn_amt_avg_mth',
                  'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_occ_encoded',
                    'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
all_cols = numerical_cols + categorical_cols + dependent_col

IVDF, woeDF = iv_woe(df_l2[all_cols], 'c_seg_encoded', bins=10, show_woe=True)

sorted_IVDF = IVDF.sort_values(by='IV', ascending=False)
display(sorted_IVDF)

Use the code snippet above to get the IV value. The result is as below:

By right we are looking for IV of 0.1 to 0.5. However, we only have two features which are within this range. The results are not satisfactory. For those IV which are zero,

It may indicate unusual data distributions or wrong handling of missing data. For those with IV of 0, there is a high chance it is the way it is due to imbalance of data, resulting in lack of binning. We will keep this in mind during further analysis.

When there is binning that enables different cutoffs

Undersampling the majority might be one of the ways to get a more accurate study on this, but is is a case by case basis.

Feature Importance from Models

We will get the feature importances from various algorithms and see if we can find any trend from the models. We used Decision Tree Classifier, Random Forest Classifier, XGBoost, and Logistic Regression.

# Base Settings
df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
                  'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_bal_min_yr',
                  'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
                  'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos', 
                  'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mth', 'cc_txn_amt_avg_mth',
                  'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_occ_encoded',
                    'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col

# Settings Train / Test Split. 
# We will not be doing Train / Validation / Test split as this is for feature importance only.
from sklearn.model_selection import train_test_split

# Splitting into Training and Holdout Test Sets
# Ensure stratification for now. We will adjust the ratio only later if required.
X_train, X_test, y_train, y_test = train_test_split(df_l2[independent_col], df_l2[dependent_col],\
                                                    stratify=df_l2[dependent_col], test_size=0.2, random_state=88)

# From Standard Scaler for Numerical Columns (when necessary) Eg. Logistic Regression
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(\
    transformers=[('num', StandardScaler(), numerical_cols)],\
    remainder='passthrough') # Pass through categorical features unchanged

X_train_transformed = preprocessor.fit_transform(X_train)
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=independent_col)
X_test_transformed = preprocessor.fit_transform(X_test)
X_test_transformed_df = pd.DataFrame(X_test_transformed, columns=independent_col)
y_train_transformed = y_train.values.ravel()
y_test_transformed = y_test.values.ravel()

# Function for getting feature importance sorted.
def feature_importance_sorted(classification_model_input, X_train, y_train, feature_importance_input=None):
    if classification_model_input is not None:
        some_model = classification_model_input
        some_model.fit(X_train, y_train)
        feature_importances = some_model.feature_importances_
    else:
        feature_importances = feature_importance_input
    feature_importances_sorted = sorted(zip(X_train.columns, feature_importances), key=lambda x: x[1], reverse=True)
    df_feature_importances = pd.DataFrame(feature_importances_sorted, columns=['Feature', 'Importance'])
    for feature_name, importance in feature_importances_sorted:
        print(f"Feature {feature_name}: {importance}")

    df_feature_importances['rank'] = range(1, len(df_feature_importances)+1)
    return df_feature_importances

# Decision Tree Classifier Feature Importance
from sklearn.tree import DecisionTreeClassifier
dtc_fi = feature_importance_sorted(DecisionTreeClassifier(), X_train, y_train)

# Random Forest Classifier Feature Importance
from sklearn.ensemble import RandomForestClassifier
rfc_fi = feature_importance_sorted(RandomForestClassifier(), X_train, y_train.values.ravel())

# XGB Feature Importance
import xgboost as xgb
xgb_fi = feature_importance_sorted(xgb.XGBClassifier(), X_train, y_train)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=10000)
lr.fit(X_train, y_train.values.ravel())
feature_importances = lr.coef_[0]  # Assuming binary classification
lr_fi = feature_importance_sorted(None, X_train, y_train.values.ravel(), feature_importances)

From the above code, we can get the individual feature importance from the models. However, lets rank it for easier reference.

dtc_fi = dtc_fi.rename(columns={'Importance': 'imp_dtc', 'rank': 'rank_dtc'})
rfc_fi = rfc_fi.rename(columns={'Importance': 'imp_rfc', 'rank': 'rank_rfc'})
xgb_fi = xgb_fi.rename(columns={'Importance': 'imp_xgb', 'rank': 'rank_xgb'})
lr_fi = lr_fi.rename(columns={'Importance': 'imp_lr', 'rank': 'rank_lr'})

merged_df = dtc_fi.merge(rfc_fi, on='Feature', how='left')\
                  .merge(xgb_fi, on='Feature', how='left')\
                  .merge(lr_fi, on='Feature', how='left')

merged_df

After we rank it, is is much easier to interpret the overall features which are important and not important.

I colored the features which ranked 20 and more, for different algorithms. We can see that there is some major overlap between tree-based / ensembled-based models. Whereas for logistic regression we see a much different result.

More commonly I see the credit card features being pointed out as less significant.

Statistical Tests

We can run an individual t-test to check on the difference in distribution of individual features for affluent and normal customers. I used a significance of 0.05 and found that the credit card features were found to be insignificant.

aff_df = df_l2[df_l2['c_seg_encoded']==1]
norm_df = df_l2[df_l2['c_seg_encoded']==0]
norm_df_2 = norm_df.sample(frac=0.2, random_state=88) 
# Using a smaller sample of the norm_df, since original norm_df is 5x bigger.
# Don't anticipate much change but just trying.

from scipy.stats import ttest_ind
def individual_t_test(df_1, df_2, listoffeatures, alpha_val):
    '''
    For continuous variable individual t-tests
    '''
    newlist = []
    for feature in listoffeatures:
        fea_1 = df_1[feature]
        fea_2 = df_2[feature]

        t_stat, p_val = ttest_ind(fea_1, fea_2, equal_var=False)
        t_stat1 = f'{t_stat:.3f}'
        p_val1 = f'{p_val:.3f}'
        
        if p_val < alpha_val:
            sig = 'Significant'
        else:
            sig = 'Insignificant'

        newdict = {'feature': feature, 't_stat': t_stat1,
                   'p_value': p_val1, 'significance': sig}
        newlist.append(newdict)

    df_result = pd.DataFrame(newlist)
    return df_result

individual_t_test(aff_df, norm_df, numerical_cols, 0.05)

individual_t_test(aff_df, norm_df_2, numerical_cols, 0.05)

Example: For cc_txn_nos_yr, the distribution of credit card transaction number per year for the affluent and the distribution of credit card transaction number per year for the normal were found to be not statistically significant.

Data Analysis on Imputed Data

For this, we again use a quick and dirty method of coming up with boxplots to check between the two different categorical labels.

df_l2 = df_l1.copy()
numerical_cols = ['c_age', 'prod_nos',
                  'casa_td_nos', 'casa_bal_avg_mth', 'casa_bal_max_yr', 'casa_bal_min_yr',
                  'dr_cr_ratio_yr', 'td_bal_avg', 'td_bal_max', 'asset_tot_val',
                  'prop_pur_price', 'ut_avg', 'ut_max', 'funds_nos', 
                  'cc_out_bal_avg_mth', 'cc_txn_amt_max_mth', 'cc_txn_amt_min_mth', 'cc_txn_amt_avg_mth',
                  'cc_txn_amt_yr', 'cc_txn_nos_yr', 'cc_lmt']
categorical_cols = ['c_edu_encoded', 'c_hse_encoded', 'c_pc', 'c_incm_typ', 'c_occ_encoded',
                    'loan_home_tag', 'loan_auto_tag']
dependent_col = ['c_seg_encoded']
independent_col = numerical_cols + categorical_cols
all_cols = numerical_cols + categorical_cols + dependent_col

for feature in numerical_cols:
    plt.figure(figsize=(8, 6))
    boxplot = sns.boxplot(x='c_seg_encoded', y=feature, data=df_l2)
    plt.title(f'Box Plot of {feature} by AFFLUENT / NORMAL')

    # Add condition to use log scale if values are greater than 1000
    if df_l2[feature].max() > 1000:
        boxplot.set_yscale('log')
    
    plt.xlabel('Customer Type')
    plt.ylabel(feature)
    plt.show()

casa_bal features showed similar differences. Asset_tot_val was also significantly different for AFFLUENT, which is common sense.

Summary for EDA Level 2

We should use a confluence of all the points obtained, to come up with insights or a deeper understanding of the features and data.

Based on the:
1) Information Value
2) Feature Importance from Multiple Algos
3) Statistical Test
4) Further Data Analysis on Imputed Values

I found that credit card related features ranked as the least influential, whereas total assets and casa balance were consistently ranked as important features. Despite these information are based on imputed data (which might have some error), the fact that assets and casa balance determining whether a person is affluent or not, seems rather logical.

— — — — — — — — — — — — — — — — — — — — — —

You may find the Jupyter Notebooks as per below: