A Comparative Study of Supervised Learning Techniques for Churn Prediction: Logistic Regression, Decision Tree, and Random Forest

Kartavi Naik
11 min readJun 21, 2024

--

Project Overview:

The objective of this project is to analyze customer churn within a company in order to derive insights that can aid in improving customer retention strategies. Customer churn, which refers to the rate at which customers stop doing business with a company, is a critical metric for businesses aiming to maintain a stable customer base and sustainable growth. By understanding the factors that contribute to churn, we can develop proactive measures to retain customers and enhance overall customer satisfaction.

Data source

This case study is based on kaggle dataset and can be downloaded from https://www.kaggle.com/code/abhashrai/customer-retention-analysis-prediction

Data Description

The dataset used for this analysis consists of customer information collected over a period of time. It comprises 64,374 entries and 12 columns capturing various aspects of customer behavior and interaction with the services provided. Below is a brief overview of the variables included in the dataset:

  1. CustomerID: Unique identifier for each customer.
  2. Age: Age of the customer.
  3. Gender: Gender of the customer (e.g., Male, Female).
  4. Tenure: Number of months the customer has been with the company.
  5. Usage Frequency: Frequency of service or product usage by the customer.
  6. Support Calls: Number of customer support calls made by the customer.
  7. Payment Delay: Average delay in payment or transaction by the customer.
  8. Subscription Type: Type of subscription or service plan (e.g., Basic, Premium).
  9. Contract Length: Length of the contract or agreement (e.g., Month-to-month, One year).
  10. Total Spend: Total spending or transaction amount by the customer.
  11. Last Interaction: Number of days since the customer’s last interaction or engagement with the company.
  12. Churn: Binary variable indicating whether the customer churned (1) or not (0).

This dataset provides a comprehensive view of customer attributes and behaviors that can be leveraged to predict and mitigate churn. The analysis aims to uncover patterns and insights that can inform strategic decisions aimed at reducing churn rates and increasing customer loyalty.

Importing and Loading Data for Customer Churn Analysis

We will import two essential Python libraries, Pandas and Seaborn, to analyze a dataset on customer churn. Pandas provides powerful tools for data manipulation and analysis, while Seaborn enhances data visualization with insightful plots.

We begin by importing Pandas as pd and Seaborn as sns. Pandas will help us load and manage our dataset in a structured manner. Meanwhile, Seaborn will enable us to create clear and informative visualizations that reveal patterns and trends in the data.

By leveraging these libraries, we aim to gain insights into customer behavior and factors influencing churn. This exploration is crucial for businesses looking to enhance customer retention strategies and improve overall customer satisfaction.

EXPLORATORY DATA ANALYSIS (EDA)

EDA is crucial because it helps you understand data patterns, detect anomalies, validate assumptions, guide preprocessing, formulate hypotheses, and communicate insights effectively. It ensures data quality and informs informed decision-making in analysis and modeling.

Churn:

The dataset has 34,000 customers who did not churn and 30,000 who did.

Age:

Output

The average age of customers is approximately 42 years, with a standard deviation of about 14 years. Ages range from 18 to 65.

Distribution of customer Age

The age distribution shows that the majority of customers are evenly spread across different age groups.

Gender:

There are 34,353 female customers and 30,021 male customers. The dataset is slightly skewed towards female customers.

Churn Rate is higher in female in comparision to that of male.

Tenure :

The tenure distribution shows a significant number of customers have a tenure of 25 to 45 months.

Support Calls:

The distribution of support calls is irregular, with several peaks indicating higher frequencies of specific call counts.

The boxplot shows that customers who churn tend to have a higher range of support calls (between 5 to 9 calls) compared to non-churned customers (between 2 to 7 calls).

This indicates that higher support call frequency might be associated with a higher likelihood of churn.

Payment Delay:

The payment delay among customers is generally stable up to 15 days, with slight fluctuations between 15 to 30 days.

The boxplot shows that customers who churn have a higher range of payment delays (between 18 to 27 days) compared to non-churned customers (between 6 to 19 days).

This suggests that longer payment delays are associated with a higher likelihood of churn.

Subscription Type :

The dataset is fairly balanced among the three subscription types.

Correlation Heatmap:

High correlation between CustomerID and Churn indicates the importance of unique customer identifiers in tracking churn.

Support Calls and Payment Delay have a notable positive correlation with Churn, indicating that customers who churn tend to have higher support call counts and payment delays.

Usage Frequency shows a negative correlation with Churn, suggesting that higher usage frequency is associated with lower churn rates.

FEATURE ENGINEERING:

In the process of feature engineering for our customer churn analysis, one crucial step involves encoding categorical variables. This is achieved using Pandas’ get_dummies function, which transforms categorical variables into numerical representations.

Categorical variables, such as gender, subscription type, and contract length, are non-numeric and cannot be directly used in machine learning models. Encoding them into numerical format allows us to include these important features in our predictive models effectively.

  • Binary Encoding: Pandas creates binary dummy variables for each category, dropping the first category to avoid multicollinearity issues. For instance, if Gender has categories 'Male' and 'Female', after encoding, we will have a column for 'Gender_Female' (1 or 0).
  • Maintaining Information: This transformation preserves the information from categorical variables while making it usable for algorithms that expect numerical input.
  • Improving Model Performance: Machine learning models typically perform better when trained on numerical data. By encoding categorical variables, we ensure that our models can utilize these features to make accurate predictions or classifications.

By incorporating encoded categorical variables into our feature set, we enhance the predictive power of our models, enabling them to better capture underlying patterns and dependencies in the data related to customer churn.

DATA MODELLING : LOGISTICS REGRESSION, DECISION TREE AND RANDOM FOREST

After completing feature engineering by encoding categorical variables, we proceed to data modeling using three distinct algorithms: Logistic Regression, Decision Tree, and Random Forest. Each of these models offers unique strengths in predicting customer churn based on the dataset features we have prepared.

Set the target variable and labels

Before beginning the modeling process, it’s crucial to define the target variable (y) and features (X). In our case, the target variable is typically Churn, which indicates whether a customer has churned or not.

Standardize the features

To ensure fair comparison and optimal performance across different models, it’s beneficial to standardize numerical features. This process scales features to have a mean of 0 and a standard deviation of 1.

DATA MODELLING

We start with importing all the important libraries.

LOGISTIC REGRESSION

Description: Logistic Regression is a statistical model used for binary classification tasks, where the outcome variable is categorical with two possible values (usually 0 and 1).

Working Principle: It models the probability of the default class (often 0) using a logistic function, which maps any real-valued input into a probability value between 0 and 1.

Key Features:

  • Provides interpretable coefficients that represent the impact of each feature on the probability of the outcome.
  • Works well when the relationship between the features and the target variable is linear or can be transformed to linear.
  • Often used as a baseline model for classification tasks due to its simplicity and interpretability.

The Logistic Regression model shows strong performance with an overall accuracy of 87%. It achieves balanced precision and recall scores of 87% for class 0 and 86% for class 1, indicating it is effective at predicting both classes. The F1-score, which balances precision and recall, is also high at 87%, demonstrating robust performance across the dataset. In summary, the model shows reliable predictive ability based on the provided metrics.

The confusion matrix reveals that the model correctly predicted 5936 instances of class 0 (True Positives) and 5203 instances of class 1 (True Negatives). — It incorrectly predicted 857 instances of class 0 as class 1 (False Positives) and 879 instances of class 1 as class 0 (False Negatives).

The model shows a relatively higher number of True Positives and True Negatives compared to False Positives and False Negatives, indicating overall good performance in terms of correctly predicting both classes.

DECISION TREE

Description: A Decision Tree is a non-parametric supervised learning technique used for classification and regression tasks. It recursively splits the dataset into subsets based on the most significant feature at each node.

Working Principle: It partitions the data based on feature values to maximize information gain (or minimize impurity) at each split, resulting in a tree-like structure where each internal node represents a decision based on a feature, and each leaf node represents a class label or regression value.

Key Features:

  • Easily interpretable and visualizable, making it useful for understanding feature importance and decision-making process.
  • Can handle both numerical and categorical data.
  • Prone to overfitting on noisy datasets, which can be mitigated using techniques like pruning.

The Decision Tree model achieves perfect precision, recall, and F1-score for both classes, demonstrating flawless predictive accuracy. — The high values in the diagonal of the confusion matrix (6773 and 6062) indicate that the model rarely makes mistakes, with very few instances misclassified as either false positives or false negatives.

RANDOM FOREST

Description: Random Forest is an ensemble learning method that constructs a multitude of decision trees during training and outputs the mode (classification) or average prediction (regression) of the individual trees.

Working Principle: It builds multiple decision trees using random subsets of the features and samples (bootstrap samples) from the training dataset. Each tree is trained independently to predict the target variable, and the final prediction is made by averaging (for regression) or voting (for classification) the predictions of all trees.

Key Features:

  • Reduces overfitting compared to a single decision tree by averaging predictions across multiple trees.
  • Handles high-dimensional datasets well and captures complex relationships between features and target variables.
  • Generally provides higher accuracy compared to individual decision trees, making it a popular choice for various machine learning tasks.

The Random Forest model achieves exceptional performance with perfect precision and recall scores for both classes, resulting in an overall accuracy of 100%. — The confusion matrix further confirms the model’s accuracy, with very few misclassifications (1 false positive and 42 false negatives), highlighting its robustness in correctly identifying instances from both classes.

CONCLUSION

· Logistic Regression is straightforward and interpretable, suitable for linear relationships.

· Decision Tree is intuitive, interpretable, and prone to overfitting but can capture complex interactions.

· Random Forest improves upon Decision Trees by reducing overfitting and boosting accuracy through ensemble learning, making it versatile for a wide range of applications.

All three techniques — Logistic Regression, Decision Tree, and Random Forest — perform admirably well on the dataset. While Logistic Regression shows strong performance with an accuracy of 87%, both Decision Tree and Random Forest achieve perfect accuracies of 100%, indicating their superior ability to classify instances accurately. The choice between these techniques would depend on the specific requirements of the problem, considering factors such as interpretability (Logistic Regression), robustness (Decision Tree), and ensemble learning advantages (Random Forest

SUMMARY

Conducting this customer churn analysis case study has been an insightful journey into understanding the factors influencing customer retention. By leveraging Python libraries such as Pandas, Seaborn, and scikit-learn, we’ve explored essential steps from data exploration to modeling.

Throughout this study, we’ve accomplished:

  • Data Exploration: We began by loading and examining our dataset (customer_churn_dataset-testing-master.csv). Exploratory Data Analysis (EDA) helped us grasp the dataset's structure, identify key features, and prepare them for modeling.
  • Feature Engineering: To enhance model performance, we engineered features, including encoding categorical variables using pd.get_dummies(). This step ensured our data was suitable for training predictive models.
  • Modeling: We implemented three powerful machine learning algorithms — Logistic Regression, Decision Tree, and Random Forest. Each model was trained on our preprocessed data to predict customer churn based on various input features.
  • Evaluation and Insights: Metrics such as accuracy, precision, recall, and F1-score were used to evaluate model performance. These insights provided a quantitative understanding of how well our models predicted churn and highlighted areas for improvement.

While there is potential for further enhancing the model’s performance through feature selection, hyperparameter tuning, and exploring more advanced algorithms, time constraints led us to conclude our analysis here.

For future endeavors, consider exploring additional techniques such as:

  • Ensemble Methods: Combining predictions from multiple models to improve overall performance.
  • Feature Importance: Identifying which features have the most significant impact on churn prediction.
  • Time-Series Analysis: Incorporating temporal aspects to capture evolving customer behaviors.

I hope this blog serves as a valuable resource for tackling similar challenges in customer analytics. For more insights or to connect with me, feel free to follow my Medium or join my LinkedIn network via LinkedIn Profile.

Thank you for accompanying me on this analytical journey, and I look forward to sharing more insights with you in the future!

--

--

Kartavi Naik
0 Followers

Data Analyst | MSC in Statistics | Proficient in SQL, Python, Power BI, Excel | Passionate about transforming data into actionable insights