Exploring Pyspark.ml for Machine Learning: Crafting Optimal Feature Selection Strategies

Published in

Data And Beyond

11 min readOct 28, 2023

What is Feature Selection?

In the realm of data science and machine learning, the process of selecting the right features for your model is akin to selecting the finest ingredients for a culinary masterpiece. Each choice affects the outcome, and in the world of machine learning, the art of feature selection can make or break your model. In this journey through using Pyspark, we will explore a variety of feature selection techniques and strategies that will empower you to create models that stand out!

It is important to note that feature importance is not a one-size-fits-all solution. Its effectiveness varies depending on the type of model being used and the inherent characteristics of the dataset. For instance, linear models may provide clear and interpretable feature importance values, assuming a linear relationship between features and target. On the other hand, non-linear models like neural networks may not yield straightforward feature importance interpretation due to their inherent complexity.

Understanding the strengths and limitations of feature importance methods is vital. While it provides valuable insights into the model’s inner workings and simplifies the model by focusing on significant features, it is critical to interpret the results cautiously. High feature importance does not imply causation, but rather a strong association or correlation with the target variable. Hence, interpreting feature importance should always be accompanied by domain knowledge and a nuanced understanding of the problem at hand.

In the industry, leveraging feature importance is a common practice, especially in ensemble models like Random Forests and Gradient Boosting Machines. However, it’s crucial to guard against misinterpretations. A common pitfall is assuming that high feature importance indicates a causal relationship, which is not always the case. Proper consideration of the context and a comprehensive understanding of the domain are essential to derive meaningful insights from feature importance.

The article will focus on the below points:

IV (Information Value) and WOE (Weight of Evidence)
Correlation Heatmap
Feature Importance

We will talk on using Pyspark code to implement the 3 method, to fully utilize the advantage that Pyspark has over Pandas. In reality, there are other methods / combinations and it also depends on the type of data and business objective. As such, this only serves as an example of what can be done but isn’t exclusive.

1. IV and WOE

In the realm of credit risk assessment and related domains, Information Value (IV) and Weight of Evidence (WOE) are fundamental tools. IV quantifies the predictive power of a feature, acting as a compass to guide feature selection. WOE, on the other hand, transforms features into a more predictive continuous space, providing a clear path for analysis.

Typically, these tools are invaluable in the finance industry where understanding and predicting credit risk is paramount. The ability to gauge the predictive strength of features and transform them into a more meaningful representation, aiding in better risk assessment and decision-making, is indispensable.

The advantages of IV and WOE are numerous. They are particularly effective in credit scoring due to their ability to handle missing values and outliers robustly. The clear transformation of features into predictive continuous representations allows for simpler and more effective use in predictive models. However, it’s important to remember that these tools assume a linear relationship between features and the log odds of the target, limiting their application to domains where this assumption holds.

In practice, implementing IV and WOE involves calculating these metrics for each feature. This usually includes binning continuous features and computing the logarithm of the ratio of the proportions of good and bad outcomes for each bin. The WOE is then computed using these IV values.

I found a clear and concise explanation of the calculation of WOE / IV values here.

Sample WOE / IV calculation. Credits: ListenData

If your using python / pandas, the below code can be used to get the WOE / IV. Credits to Ailurophile

def iv_woe(data, target, bins=10, show_woe=False):
    
    #Empty Dataframe
    newDF,woeDF = pd.DataFrame(), pd.DataFrame()
    
    #Extract Column Names
    cols = data.columns
    
    #Run WOE and IV on all the independent variables
    for ivars in cols[~cols.isin([target])]:
        if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>10):
            binned_x = pd.qcut(data[ivars], bins,  duplicates='drop')
            d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
        else:
            d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})

        
        # Calculate the number of events in each group (bin)
        d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
        d.columns = ['Cutoff', 'N', 'Events']
        
        # Calculate % of events in each group.
        d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()

        # Calculate the non events in each group.
        d['Non-Events'] = d['N'] - d['Events']
        # Calculate % of non events in each group.
        d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'].sum()

        # Calculate WOE by taking natural log of division of % of non-events and % of events
        d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
        d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
        d.insert(loc=0, column='Variable', value=ivars)
        print("Information value of " + ivars + " is " + str(round(d['IV'].sum(),6)))
        temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, columns = ["Variable", "IV"])
        newDF=pd.concat([newDF,temp], axis=0)
        woeDF=pd.concat([woeDF,d], axis=0)

        #Show WOE Table
        if show_woe == True:
            print(d)
    return newDF, woeDF

If your pyspark.sql.dataframe is too big and it is impossible to change it to a pandas dataframe, you can use the below pyspark code.

import pandas as pd
from pyspark.sql import functions as F
from pyspark.ml.feature import QuantileDiscretizer

def iv_woe_spark(data, target, bins=10, show_woe=False):
    # Empty DataFrames
    newDF = pd.DataFrame(columns=["Variable", "IV"])
    woeDF = pd.DataFrame(columns=["Variable", "Cutoff", "N", "Events", "% of Events", "Non-Events", "% of Non-Events", "WoE", "IV"])
    
    # Extract Column Names
    cols = [col for col in data.columns if col != target]
    
    for ivars in cols:
        if data.schema[ivars].dataType in ['integer', 'double'] and data.select(ivars).distinct().count() > 10:
            # Bin the data for continuous variables using QuantileDiscretizer
            discretizer = QuantileDiscretizer(numBuckets=bins, inputCol=ivars, outputCol=ivars+"_bin")
            datatemp = data.select(ivars, target)
            datatemp2 = discretizer.fit(datatemp).transform(datatemp)
            d0 = datatemp2.select(ivars+"_bin", target)
        else:
            d0 = data.select(ivars, target)
            d0 = d0.withColumnRenamed(ivars, ivars+"_bin")

        # Calculate the number of events in each group (bin)
        d = d0.groupBy(ivars+"_bin").agg(
            F.sum(target).alias('Events'),
            F.count(target).alias('N')
        )
        
        # Events
        total_events = d.select(F.sum('Events')).collect()[0][0]
        d = d.withColumn('% of Events', F.greatest(F.col('Events'), F.lit(0.5)) / total_events)

        # Non-Events
        d = d.withColumn('Non-Events', F.col('N') - F.col('Events'))
        total_non_events = d.select(F.sum('Non-Events')).collect()[0][0]
        d = d.withColumn('% of Non-Events', F.greatest(F.col('Non-Events'), F.lit(0.5)) / total_non_events)

        # WOE / IV
        d = d.withColumn('WoE', F.log(F.col('% of Events') / F.col('% of Non-Events')))
        d = d.withColumn('IV', (F.col('% of Events') - F.col('% of Non-Events')) * F.col('WoE'))
        d = d.withColumn('Variable', F.lit(ivars))
        
        # Compute IV for this variable and print it
        iv_value = d.agg(F.sum('IV')).collect()[0][0]
        print("Information value of " + ivars + " is " + str(round(iv_value, 6)))
        
        # Append IV and WoE data to woeDF
        temp_woe = d.toPandas()
        woeDF = pd.concat([woeDF, temp_woe], axis=0, ignore_index=True)
        
        # Append IV to newDF
        temp_iv = pd.DataFrame([(ivars, iv_value)], columns=["Variable", "IV"])
        newDF = pd.concat([newDF, temp_iv], axis=0, ignore_index=True)
        
        # Show WoE Table
        if show_woe:
            print(temp_woe)

    return newDF, woeDF

# newDF, woeDF = iv_woe_spark(data, target, bins=10, show_woe=True)

In the financial sector, particularly in credit scoring, IV and WOE are the go-to techniques for understanding feature predictiveness. However, they are sometimes misused or inappropriately applied in domains where the assumptions of linearity do not hold. Understanding when and where to apply IV and WOE is crucial to ensure their effective use and to draw meaningful insights for decision-making. Their significance cannot be overstated, especially in industries where risk assessment is a critical component of the business landscape.

2. Correlation Heatmap

Sample of a Correlation Heatmap from StackOverflow

Correlation heatmaps find extensive application in exploratory data analysis (EDA) and preprocessing. During the initial stages of a project, understanding how features relate to each other can guide data preprocessing steps, such as identifying and handling multicollinearity, which is crucial for model stability and interpretability. The heatmap, being a visual tool, is particularly effective in conveying complex relationships to stakeholders, making it a valuable asset in the data scientist’s toolkit.

However, it’s essential to note that correlation, as shown by the heatmap, is limited to capturing linear relationships between features. Non-linear relationships may not be accurately represented, which is a crucial drawback. Also, the correlation coefficient may indicate a relationship, but it doesn’t imply causation.

In pandas, getting the correlation is as simple as below.

# Assuming `df` is a DataFrame
correlation_matrix = df.corr()

# Use visualization libraries to plot a heatmap
plt.figure(figsize=(50,50))
sns.heatmap(correlation_matrix, 
            xticklabels=correlation_matrix.columns.values,
            yticklabels=correlation_matrix.columns.values,
            cmap='coolwarm',
            annot=True)
plt.savefig('correlation_matrix.png')

It is not as simple on pyspark. To implement a correlation heatmap in PySpark, we first calculate the correlation matrix for the features in the dataset by converting to a vector column. This matrix is then visualized using suitable visualization libraries. The code as below can be used.

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

def correlation_matrix(df, listofcolumn):

    # Convert to vector column first
    df1 = df.select(listofcolumn)
    vector_col = "corr_vector"
    assembler = VectorAssembler(inputCols = listofcolumn,
                                outputCol  =vector_col)
    df_vector = assembler.transform(df1).select(vector_col)

    # Get the correlation matrix
    ## Correlation Matrix - Raw
    rawmatrixtemp = Correlation.corr(df_vector,vector_col)
    rawmatrix = rawmatrixtemp.collect()[0]["pearson({})".format(vector_col)].values
    
    ## Correlation Matrix - Visual data
    corr_matrix = rawmatrixtemp.collect()[0][0].toArray().tolist()
    corr_matrix_df = pd.DataFrame(data = corr_matrix,
                                  columns = listofcolumn,
                                  index = listofcolumn)
    corr_matrix_df.style.background_gradient(cmap='coolwarm').set_precision(2)

    return rawmatrix, corr_matrix_df


# Assume train_df is the pandas.sql.DataFrame training data.
# Assume collist is the list of column names within train_df.
raw_matrix, viz_matrix = correlation_matrix(train_df, collist)
plt.figure(figsize=(50,50))
sns.heatmap(viz_matrix, 
            xticklabels=viz_matrix.columns.values,
            yticklabels=viz_matrix.columns.values,
            cmap='coolwarm',
            annot=True)
plt.savefig('correlation_matrix.png')

In various domains, especially where feature relationships are intricate and understanding them is crucial, correlation heatmaps prove to be invaluable. However, it’s essential to interpret the heatmap within its limitations, considering the linearity assumption and potential non-linear relationships. Utilizing this technique judiciously can significantly enhance feature selection, model interpretability, and the overall success of the machine learning project.

3. Feature Importance

Feature importance is a fundamental concept in machine learning that helps us grasp the significance of each feature (or variable) in influencing the model’s predictions. In essence, it quantifies the contribution of each feature towards the model’s performance, aiding in informed decision-making during the feature selection process.

When constructing a predictive model, we typically consider multiple features that could influence the outcome. However, not all features are equally important. Some features might have a more pronounced impact on the predictions, while others may be less influential or even irrelevant. Determining the relative importance of these features allows us to focus our efforts on the most critical ones, potentially improving model accuracy, interpretability, and efficiency.

Feature importance is often computed after a model is trained. Various algorithms, such as decision trees, random forests, and gradient boosting machines, provide built-in mechanisms to calculate feature importance. These mechanisms evaluate how much each feature contributes to minimizing prediction errors. The importance of a feature is usually measured based on how often the feature is used for splitting nodes in the trees or how much it reduces the impurity in the splits.

When using pandas, getting the feature importances using scikit learn is as simple as the code below:

# Prepare features (X) and target variable (y)
X = df.drop('label', axis=1)
y = df['label']

# Train a RandomForestClassifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X, y)

# Get feature importances
feature_importances = rf_classifier.feature_importances_

# Print feature importances
print("Feature Importances:")
for feature_idx in range(len(feature_importances)):
    print(f"Feature {X.columns[feature_idx]}: {feature_importances[feature_idx]}")

# Function that will return sorted feature importances in DataFrame
def feature_importance_sorted(classification_model_input, X_train, y_train):
    some_model = classification_model_input
    some_model.fit(X_train, y_train)
    feature_importances = some_model.feature_importances_
    feature_importances_sorted = sorted(zip(X_train.columns, feature_importances), key=lambda x: x[1], reverse=True)
    df_feature_importances = pd.DataFrame(feature_importances_sorted, columns=['Feature', 'Importance'])
    for feature_name, importance in feature_importances_sorted:
        print(f"Feature {feature_name}: {importance}")
    return df_feature_importances

## Use Method
## from sklearn.tree import DecisionTreeClassifier
## feature_importance_sorted(DecisionTreeClassifier(), X_train, y_train)

You may also find some sample code from scikit learn here.

PySpark.ml, a versatile machine learning library, offers native support for calculating feature importance. This built-in feature is a valuable asset for data scientists, enabling them to assess the contribution of each feature in the model’s predictions. One of the strengths of this method is its model agnosticism; it works seamlessly across different model types available within PySpark.ml, providing a consistent approach to understanding feature relevance.

When should one utilize PySpark.ml’s feature importance method? Whenever there’s a need to understand how each feature contributes to the model’s predictions, this method is a go-to choice. It is particularly beneficial when dealing with large-scale datasets and complex models, where gaining insights into feature importance is vital for effective decision-making.

To obtain feature importance using PySpark.ml, you typically train a model and then extract the feature importance information from the trained model. The specific method for accessing feature importance might vary based on the machine learning algorithm used.

# Import necessary PySpark modules
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# Assume train_df is the pandas.sql.DataFrame training data.
# Assume collist is the list of column names within train_df that we want to train.
# Assume 'label' is the actual label column name that we are trying to learn.
def trainRFmodel(traindf, collist)
    # Vector Assemble all features into one vector column
    assembler = VectorAssembler(inputCols=collist,
                                outputCol='model_feature')
    newdf = assembler.transform(traindf)

    # Initiate a Random Forest Classifier
    rf = RandomForestClassifier(featuresCol='model_feature',
                                labelCol='label')

    rf_model = rf.fit(newdf)

    return rf_model

model = trainRFmodel(traindf, collist)
dictfi = {}
for feature, importance in zip(collist, model.featureImportances):
    dictfi[feature] = importance
    print(f"{feature}:{importance}")

Synergizing Feature Selection Strategies: IV/WOE, Correlation Heatmap, and Feature Importance

In the ever-evolving landscape of data science, combining feature selection methodologies can unlock a wealth of insights and optimize predictive models. When we synergize Information Value (IV) and Weight of Evidence (WOE), correlation heatmaps, and feature importance, we create a powerful amalgamation that refines feature selection for machine learning models.

Harnessing IV/WOE for Initial Feature Assessment

Information Value (IV) and Weight of Evidence (WOE) are stalwart tools, particularly in credit risk assessment and related domains. IV quantifies the predictive power of a feature, providing a foundational ranking. Features with high IV are considered more predictive for modeling. Subsequently, applying the WOE transformation on these features helps express their predictive power in a continuous and more intuitive manner.

By starting with IV and WOE, we get an initial assessment of the features’ predictive capabilities. This serves as a solid foundation for subsequent feature selection steps.

Identifying Redundancy with Correlation Heatmap

The correlation heatmap is a visual representation of the relationships between features. It uses color gradients to indicate the strength and direction of correlations. Highly correlated features might indicate redundancy, potentially hampering model performance or interpretability.

Integrating the correlation heatmap after the IV/WOE assessment helps us identify relationships among the initially chosen features. Detecting and removing redundant features based on correlations ensures that the features selected for model training are diverse and complementary.

Refinement with Feature Importance

Feature importance, often calculated using techniques like decision trees, random forests, or gradient boosting machines, quantifies the contribution of each feature to the model’s predictions. It helps us understand which features are the most influential.

By leveraging feature importance after IV/WOE and the correlation heatmap, we refine the feature selection further. The focus shifts to the features identified as important by both their predictive power (IV/WOE) and their impact on the model (feature importance). This two-step validation enhances the robustness of feature selection.

The Synergy: A Comprehensive Approach

The synergy lies in a comprehensive approach where we start with IV/WOE to rank and select features based on their predictive power. The correlation heatmap is then employed to identify and potentially remove redundant features. Finally, feature importance is used to refine the selection, ensuring that the chosen features are not only predictive but also impactful for the model.

By synergizing these methods, we create a cohesive and robust feature selection strategy. It enables us to extract the most informative and influential features, enhancing the model’s predictive performance and interpretability. The result is a finely tuned set of features that optimally represents the underlying patterns in the data, setting the stage for successful machine learning endeavors.