Predicting the Success of Bank Marketing Campaigns using Logistic Regression

Claudia Quintana Wong
The Startup
Published in
3 min readJun 5, 2020
Taken from https://www.cienciadedatos.net/documentos/27_regresion_logistica_simple_y_multiple

In this post, we are going to predict the success of bank telemarketing based on data collected from bank marketing campaigns in Portugal. We will be using the Bank Marketing Campaign Dataset that is publicly available for research.

Our goal is to predict if a client will subscribe to a bank term deposit or not. For this reason, the problem can be considered as a binary classification problem and therefore, can be resolved using logistic regression.

Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable, although many more complex extensions exist. Input values x are combined linearly using weights or coefficient values to predict an output value y. A key difference from linear regression is that the output value being modeled is a discrete value rather than a continuous value.

We will build a simple logistic regression model in Pytorch. PyTorch is an open-source machine learning library that was has been developed by Facebook‘s artificial intelligence research group. It can be considered one of the major deep learning frameworks today.

Let´s have some fun.

Step 1: Downloading & exploring the data

After downloading the dataset at Bank Marketing Campaign Dataset and unzipping it, let´s do some exploration. A detailed explanation about the dataset can also be fount at Kaggle

The dataset contains 41 188 instances. Conducted campaigns were based mostly on direct phone calls, offering bank clients to place a term deposit, making a total of 20 attributes. If after all marking efforts client had agreed to place a deposit — target variable marked ‘yes’, otherwise ‘no’, know in the dataset as y.

Also, it is a good idea to analyze the output variable distribution

Step 2: Preparing the data for training

We need to convert the data from the Pandas dataframe into a PyTorch tensor for training. To do this, the first step is to convert it to NumPy arrays. The next method will also translate categorical values into numeric values.

Tensors should be saved in a TensorDatasetfor its later manipulation and then, the dataset needs to be split into train and validation using the function random_split . After we are ready to build our DataLoader.

Now, we are ready to start implementing our classification model.

Step 3: Creating a model using PyTorch

The following code defines our logistic regression model. The function forward corresponds to the internal behavior of the model. Our PyTorch models should inherit from the class nn.Module

The model is instantiated as follows. As it is a logistic model we should specify the input and output sizes.

Step 4: Training the model to fit the data

In order to train and evaluate our model, we should define the corresponding functions.

Let´s evaluate the model before training:

The models should be trained by varying the values of hyperparameters. In this example, as it is a simple model, we just need to find the appropriate number of epochs and the learning rate. After doing some test it has been decided to set 100 as the number of epochs and take 0.01 as the learning rate.

As you can observe in the image above, after training the model accuracy has increased. Once the training is completed, we test if we are getting correct results using the model that we defined. This setting has allowed reaching an accuracy of 90 %, which means than 90% of the examples in the validation set has been classified correctly.

The choice of hyperparameters directly influences the performance of the model. To find the appropriate learning rate it is recommended to vary learning rates by orders of 10 (e.g. 1e-2, 1e-3, 1e-4, 1e-5, 1e-6) to figure out what works.

Step 5: Making predictions using the model

After evaluating the model, it is necessary to implement a new function to predict new instances using the model.

It is important to highlight that not all machine learning models reach similar performances, it depends on the model but also, it strongly depends on the complexity of the task. For example, a regression model would not be enough to solve an image classification task or a machine translation problem.

For your reference, you can find the entire code of this article here

References:

--

--

Claudia Quintana Wong
The Startup

Computer Scientist | Professor at University of Havana