Logistic Regression for Beginners: Predict Your Customer Churn Patterns.

Learn how to implement logistic regression to make accurate and insightful customer churn predictions.

Asish Biswas
AnalyticSoul
5 min readJun 11, 2024

--

Welcome back data enthusiasts! In our previous lessons, we meticulously prepared our dataset through exploration, one hot encoding, and removal of redundant features. Now, we’re ready to embark on predicting customer churn with logistic regression.

Logistic regression

Logistic regression is a supervised machine learning algorithm that predicts binary response variables. It models the probability of an event occurring. For example, predicting the likelihood of being infected with COVID-19 (yes or no) or estimating whether a customer will continue to
use a service (yes or no).

Logistic regression models the logarithm of the odds ratio, which is the log of the probability of the event occurring divided by the probability of the event not occurring. The range of the logit function is between −∞ and . For example, if the probability of churn is 75%, then the probability of no churn would be 25%. Therefore, the odds ratio will be 75% divided by 25% equals 3, and the logarithm of 3 is around 0.48, indicating that the probability of the event occurring is somewhat likely but not certain.

Here’s the formula for logistic regression.

(logistic regression)

Parameters:

  • p: Probability of an event to occur.
  • xk: Independent variable.
  • θk: Weights or coefficient values.

Logistic regression transforms its output using the S-shaped logistic sigmoid function to return a probability value between 0 and 1, which can then be mapped to two classes.

(sigmoid function)

Logistic regression is particularly well-suited for our task (churn prediction) because it:

  • Outputs Probabilities: Logistic regression provides probabilities for each class, giving us insight into the confidence of predictions.
  • Interpretable: The coefficients of logistic regression are easy to interpret, making it clear how each feature affects the prediction.

Implementing logistic regression

Once you have your dataset prepared, follow these 5 steps to building a logistic regression model:

  • Step 1: Split the data into train and test datasets.
  • Step 2: Initializing the logistic regression model.
  • Step 3: Training the model with the training data.
  • Step 4: Predicting with the test data.
  • Step 5: Measuring the model performances.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 0: Prepare the dataset
# load dataset
df_telco_encoded = pd.read_csv('data/telco_customer_churn_encoded.csv', header=0)

# Step 1: split testing and training data
y = df_telco_encoded['Churn']
X = df_telco_encoded.drop(columns='Churn')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# Step 2: init logistic regression model
logreg = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=1000)

# Step 3: training the model
logreg.fit(X_train, y_train)

# Step 4: predicting
y_pred = logreg.predict(X_test)

# print sample predictions
print(y_pred[5:15])
  • Firstly, we split the encoded features and labels into training and testing datasets using scikit-learn’s train_test_split() function.
  • Then, we instantiate the logistic regression model with solver = 'lbfgs', penalty = 'l2', and max_iter = 1000 parameters.
    solver: This determines the optimization algorithm used to find the optimal model parameters. We use the default lbfgs solver because it requires less memory and is suitable for small to medium-sized datasets.
    penalty: This specifies the type of regularization used in the model which prevents overfitting. We use the default l2 technique which fits well with lbfgs solver.
    max_iter: This defines the maximum iteration for the solvers to converge (max_iter=1000).
  • Now, we train the model with the training features and labels.
  • Finally, predict churn with the testing features.

Evaluation metrics

Once our model is trained with test data, we would want to evaluate the model’s performance. The way to evaluate a classification model is by using a confusion matrix. A confusion matrix helps us to compare the model prediction against the actual test value.

There are four key facts in a confusion matrix.

  1. True positive: When the actual value is true (1) and the model also predicts true (1).
  2. False positive: When the actual value is negative (0), but the model falsely predicts it as positive (1).
  3. False negative: When the actual value is positive (1), but the model falsely predicts it as negative (0).
  4. True negative: When the actual value is negative (0) and the model also predicts it as positive (0).

Using these four key facts, we can compute the three main model performance indicators in percentage numbers.

  • Accuracy: It is the percentage of correctly predicted values (both true positive and true negative).
  • Precision: It measures the number of positive predictions that were correct. For customer churn, the number of customers that were predicted as churn actually did churn.
  • Recall: It measures positive predictions against the actual total number of positives. In our use case, it is the number of total churned customers that were correctly classified.

In the customer churn use case, failing to predict the churning customers (false negative) would harm the business more than some wrong true predictions (false positive). Thats why the recall is more vital.

Now let’s calculate all three KPIs.

# step 1: split testing and training data
y = df_telco_encoded['Churn']
X = df_telco_encoded.drop(columns='Churn')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

# step 2: init logistic regression model
logreg = LogisticRegression(solver='lbfgs', penalty='l2', max_iter=1000)

# step 3: training the model
logreg.fit(X_train, y_train)

# step 4: predicting
y_pred = logreg.predict(X_test)

# step 5: model performance evaluation (accuracy)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print('Accuracy:', round(accuracy, 4))
print('Precision:', round(precision, 4))
print('Recall:', round(recall, 4))
# Output:
Accuracy: 0.796
Precision: 0.6667
Recall: 0.539

The accuracy of the model is 79.6%, meaning approximately 80% of the predictions made by the model on the test data are correct.

Precision is 66.7%, indicating that when the model predicts a customer will churn, it is correct 66.7% of the time.

Recall is 53.9%, showing that the model correctly identifies 53.9% of all actual churners.

You will find more ways to evaluate and interpret the results in the accompanying jupyter notebook. Practice along with your own data and use case. Happy coding :-)

What’s next?

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

Thanks for reading! If you like this tutorial series make sure to clap (up to 50!) and let’s connect on LinkedIn and follow me on Medium to stay updated with my new articles.

Support me at no extra cost by joining Medium via this referral link.

--

--