Simplified Logistic Regression: Classification With Categorical Variables in Python

Rowan Curry
7 min readOct 25, 2021

--

Logistic Regression is an algorithm that performs binary classification by modeling a dependent variable (Y) in terms of one or more independent variables (X). In other words, it’s a generalized model that predicts the probability that an event will occur. Specifically, Logistic Regression uses regular linear regression to model the ‘logit’ function:

This function returns output between 0 and 1 for all possible values of X. A visual representation of the Logistic (Sigmoid) Function follows

The Sigmoid Function

In order to implement the Logistic Regression function, the “LogisticRegression” function from the sklearn will be used.

Evaluating Model Performance

After building the Logistic Regression model, we’ll evaluate the model’s performance on a test dataset using a Confusion Matrix. This is a helpful performance measurement for when model output can be two or more classes.

Confusion Matrix

TP stands for True Positive. This occurs when you predict a positive result and your prediction is correct. TN stands for True Negative, which occurs when you predict a negative result and your prediction is correct.

FN stands for False Negative. This occurs when you predict a negative result and the actual value is positive. FP stands for False Positive, which occurs when you predict a positive result but the actual value is negative.

This Confusion Matrix will be used to evaluate our Logistic Regression model by explaining the model in terms of Recall, Precision, Sensitivity (f1-score), and Support.

Recall describes how many positive classes we predicted correctly. It’s calculated using the equation TP/(TP + FN).

Precision describes how many classes were actually positive out of all of the classes we predicted as positive. It can be found using the formula TP/(TP + FP).

The f1-score, or Sensitivity, describes the balance between Recall and Precision. This metric will alert any potential imbalances in model performance. It can be summarized as the percent of positive predictions that were correct. It’s calculated using the equation 2*(Precision*Recall)/(Precision + Recall).

Support is the number of samples of the true response that lie in that class. Support won’t change between models. Instead, it describes the evaluation process.

In order to evaluate the model’s performance using a Confusion Matrix, the “classification_report” function from sklearn will be used.

Dataset & Variable Descriptions

The Portuguese Bank Marketing Data Set from the UCI Machine Learning Repository will be used to build the Logistic Regression Model. This data describes the results of direct marketing campaigns for a Portuguese banking institution. For this model, the bank-full.csv file will be used.

There are seventeen input variables: examples include age, job, marital status, month, etc. There is one output variable which describes whether or not the customer agreed to subscribe to a long term deposit. This binary output variable has two possible values: “yes” and “no”. More information about each variable can be found through the link above.

Preliminary Data Visualizations & Data Cleaning

The project begins by importing all libraries needed for data preprocessing, data visualization, and for building and evaluating the model:

We’ll start by opening the file and checking our data frame.

Next we’ll check our dataset for null values using seaborn:

Luckily for us, there are no null values. This makes the data cleaning process very simple.

Next, we’ll visually explore the data before building our model. It’s always a good idea to play around with data before using it to build a model. This can provide important information about the data that will become essential later when you begin attempting to improve the classification rate of your model.

We’ll start with a visualization of our output variable. Remember, this variable describes how many people agreed to subscribe to a long-term deposit.

We can see that the number of people who refused far outweighs the number of people who said yes. This is important to remember as we start thinking about how to build our model.

We’ll do a couple more visualizations:

We can see from the countplot of the output variable and housing status of customers that whether or not a customer has permanent housing may have an impact on the likelihood of a yes. From the next plot, we can make an educated guess that age will not have a significant impact on a customer’s decision. For the countplot of the output variable according to whether or not a customer has defaulted on a loan, we can see that there is not a large enough population of customers who did default on a loan to help the model predict a customer’s decision.

These are just a few of the visualizations that can (and should) be done to explore the data.

Data Preparation: Converting Categorical Features

At this stage, we need to change the categorical variables to a format that our Linear Regression model will understand. We can do this by converting categorical features to dummy variables. First, we’ll double-check which features are categorical:

Now, we’ll convert all “non-null object” columns to dummy variables using the following process:

Now, we can see the new columns:

Finally, we need to convert our labels and drop any nonsense/unhelpful columns. Notice that in the df.drop() stage of the code above we dropped the ‘default’ column since we concluded during preliminary visualizations that it did not provide enough information to be helpful to the model.

Data Preparation: Train, Test, Split

We will now split our data into training data sets and test data sets:

Building the Model

Now that we have our training and test data sets, we can train our model and make predictions:

As you can see, the argument “class_weight = ‘balanced’” was passed to the model. This is due to what we noticed in our very first data visualization — that the number of people who refused to subscribe to a long term deposit was much larger than the number of people who agreed.

Model Evaluation

We can see that our model has a 97% precision rate for customer refusals and a 39% precision rate for customer agreements. This is no doubt a result of our data imbalance.

To go into a deeper analysis of this classification report, reference the definitions in the “Evaluating Model Performance” section of the introduction.

Feature Importance

To better understand the performance of our model, we can investigate each individual feature. To do so, we’ll start by getting each individual feature’s coefficient score:

And then plot each feature’s score:

Scores marked with a zero coefficient, or very near zero coefficient, indicate that the model found those features unimportant and essentially removed them from the model. Positive scores indicate a feature that predicts class 1 (“yes”). Negative scores indicate a feature that predicts class 2 (“no”).

We can see that the features age, balance, day, and pdays have been marked as unimportant by the model.

This article outlines a simplified Logistic Regression classification algorithm. If you’re interested in fine-tuning the model in order to get a higher classification rate, I encourage you to play around with function parameters, as well as with the data preparation process.

--

--

Rowan Curry

Data Scientist. Very excited about all things data. All views are my own.