How to Target a Marketing Campaign to Clients Most Likely to Say “Yes” Using a Logistic Regression Model

Business analysts: try this simple machine learning tutorial with TableTorch add-on for Google Sheets

Andrejs Skļarevskis
Google Sheets Geeks
15 min readMar 26, 2023

--

Article’s most pressing question: who is likely to subscribe, a partial features table and a demonstration of a logistic regression model’s likelihood estimates

Business analysts sometimes regard machine learning as too complex, something best left to data scientists and programming enthusiasts. However, this isn’t necessarily the case.

In this article, we’ll first look into a sample of bank marketing campaign dataset [Moro et al., 2014]. Then, without any programming, right inside Google Sheets, we’ll build a model estimating the result of a phone call to particular customer.

As is demonstrated in section 4.5, the produced model can help to reduce the number of calls by two thirds.

  • In section 1 and 2, we’ll start with the analysis of the task at hand and the original dataset, specifically focusing on how to convert its categorical text information into numeric data usable for statistical analysis.
  • In section 3, we’ll then compute and research a correlation matrix.
  • Finally, section 4 describes how to fit a logistic regression model and apply it for business purposes.

The results can be seen in this spreadsheet (ODF, best viewed in Google Sheets).

The described techniques are quite flexible and can be applied to various tasks, such as:

  • Choosing the best location for the next shop of a restaurant chain
  • Estimating factory equipment’s optimal maintenance schedule
  • Selecting the best time of day to contact a customer, etc.

In fact, you can apply it to any question for which you can collect a training dataset with one dependent variable of yes/no nature (1 or 0) and a few or many independent variables (features) that are correlated to the dependent variable.

1. Task overview

Let’s start by looking at the columns present in the dataset.

The “y” column (the last one) determines whether a client subscribed to a term deposit, the campaign’s goal.

If the bank had any clue as to which clients would likely react positively to the campaign, they could save time and money on contacting uninterested people. Hence “y” is the label or dependent variable, we need to fit a model that would predict the value of “y” based on other inputs provided in the dataset.

Once such a model exists, the bank can focus further marketing efforts on clients with the highest predicted chance of subscribing.

Some of the columns contain numeric values and thus are suitable for statistical analysis right away. However, “job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “day_of_week”, “poutcome”, as well as the label column itself, “y”, contain categorical text data that needs additional processing before you can do a regression.

Note: Column descriptions are provided on the dataset’s home page.

2. Feature engineering

2.1. “y” column

Its values are either “yes” or “no”. We need to convert these to 1 and 0, respectively, in order to use it as label for logistic regression. Fortunately, the TableTorch add-on for Google Sheets has a function to do such a conversion with a single click of a button.

(1) Start TableTorch by clicking on its icon on the right-side panel of Google Sheets. Install it first at Google Workspace Marketplace.

TableTorch, once installed, has a start button on the right-side panel of Google Sheets

(2) Select a range containing all the rows of the “y” column and click the “Text categories encoding” button on the main menu of TableTorch.

Text categories encoding button of the TableTorch add-on for Google Sheets

(3) Check only the “yes” phrase on the next menu. Then, click the “Binary columns” button.

(4) A new column will be inserted, respectively containing 1 and 0 in place of “yes” and“no” in the original column. Let’s name it the “Observed result”.

TableTorch successfully encoded “yes” category into a new column with numeric ones and zeroes.

Note: for more information, look at the detailed documentation of the “Text categories encoding” function of the TableTorch add-on or watch this video (2 minutes). The technique that we apply in this article is also known as one-hot encoding. However, TableTorch’s “text categories encoding” is not strictly the same as one-hot encoding because, for larger texts, it could produce a set of binary columns having a few ones and zeroes between them, i.e. it is not necessarily mutually exclusive as it should be in one-hot encoding.

2.2. “marital” and “contact” columns

Let’s apply text categories encoding binary function to these columns as well.

  • For “marital”, select the “married” and “divorced” phrases
  • Whereas for “contact”, select only the “cellular” phrase.

It is unnecessary and sometimes even harmful to add binary columns for every possible category because they would have strong negative correlation that might “confuse” the regression.

The dataset will now look as follows:

Marketing dataset with “marital” and “contact” columns converted into categorical binary columns

2.3. “default”, “housing”, “loan”, and “poutcome” columns

These columns contain categorical data as well and are thus well-suited for the binary variant of the text categories encoding function of the TableTorch add-on.

Select these phrases to create the appropriate binary columns:

  • “default”: “no”
  • “housing”: “yes”
  • “loan”: “yes”, “unknown”
  • “poutcome”: “failure”, “success”

Once all the processing is done, we can give some of the new columns more appropriate names, hide the original columns and see the result:

“default”, “housing”, “loan” and “poutcome” categorical text columns converted to numeric binary columns.

It looks great, we have almost completed the numeric data extraction from the categorical text columns!

There are just a few more columns to talk about.

2.4. “job”, “education”, “month” and “day_of_week” columns

These columns have many more categories than the previous ones.

We could use binary columns extraction through the text categories encoding function again but it would produce dozens of extra columns and complicate further analysis.

Let’s use the target average encoding technique instead.

We’ll devise a Google Sheets formula to convert a category into a number representing the average success rate of the marketing campaign for such a category.

First of all, let’s create the new columns:

  • “job-avg-result” (C, whereas “job” is at B)
  • “edu-avg-result” (H, right after “education” at G)
  • “month-avg-result” (S, for “month” at R)
  • “day-avg-result” (U, after “day_of_week” at T).

Now the “observed result” column we’ve created earlier has index AI and it’s the time to enter the respective formulas.

For the new “job-avg-result” (C) column:

=AVERAGEIF($B$2:$B$4120, B2, $AI$2:$AI$4120)

For “edu-avg-result” (H):

=AVERAGEIF($G$2:$G$4120, G2, $AI$2:$AI$4120)

For “month-avg-result” (S):

=AVERAGEIF($R$2:$R$4120, R2, $AI$2:$AI$4120)

For “day-avg-result” (U):

=AVERAGEIF($T$2:$T$4120, T2, $AI$2:$AI$4120)

The formulas are provided for the first data row only. They then need to be copied to the other rows of the same column.

Optionally, you can format the new columns as percents and apply conditional background formatting to make them visually clearer, as demonstrated on the picture below. These values represent respective shares of the rows belonging to the specified category that had a positive outcome of the marketing campaign.

There are significant differences between the various categories that are likely to provide the necessary numeric signal for the regression.

Very well, now we are done with the categorical columns but there is one more column that needs our attention before we can continue.

2.5. “pdays” of 999

The “pdays” column tells us how many days have passed since the client was contacted the last time.

However, as it is often the case with data collection, there’s an issue of partial data, e.g. a client that has never been contacted. As the dataset’s manual describes, the column is assigned with a special value of 999 in that case.

A separate explicit variable providing that information would likely be an important signal for the regression, helping it “understand” that the “pdays” value is actually irrelevant.

Let’s insert a new column “pnever” (Y) right after “pdays” (X) with the following formula for the first row:

=IF(X2=999, 1, 0)

3. Looking at the correlation matrix

Before we can proceed to the regression, we need to look at the correlation coefficients of the produced features to identify whether we need to exclude some of them from the analysis, and also if we need to use percentile rank scaling to account for non-linear dependencies.

Having 26 numeric columns means that if we use Google Sheets’ own CORREL() function for that, we would need to enter the formula 325 times (26 * 25 / 2) to compute all the correlations.

Instead, let’s use TableTorch and compute the matrix with only one click.

(1) As was shown in section 2.1, once you have installed TableTorch via Google Workspace Marketplace, you can start it by clicking on its icon on the right-side panel of Google Sheets.

(2) Select the whole range of the dataset and click the Correlation Matrix button on the TableTorch main menu.

Correlation Matrix button on TableTorch main menu

(3) Click the Compute button.

Compute button on a correlation matrix panel of TableTorch add-on for Google Sheets

A separate sheet with the computed correlation matrix will appear shortly.

Note: the documentation for the Correlation Matrix function of TableTorch is available here, as well as in this video.

Correlation matrix computed by Gaujasoft TableTorch add-on for Google Sheets

The matrix shows that most of the correlations with the “observed result” column are quite faint, so we shall not expect the regression to produce a very precise model.

Furthermore, the column that could have provided the best signal, “duration”, needs to be excluded from the regression. Why? Its value is obtained only once a phone call is made, whereas we want to filter the clients’ list before making any phone calls.

The new “pnever” column we’ve engineered also shows signs of concern: it has rather high correlation coefficients with “pdays” (from which it is derived) and “prev_success” (which is expected). The only way to see whether that would cause any problem for a regression or not is to run it, though.

Note: TableTorch can also compute a correlation matrix with a different coefficient, Spearman’s rank. It helps you find column pairs with non-linear relationships that thus require rank scaling before being used in a regression. No such correlations could be found for the dataset we are currently researching, though. See also Correlation matrix page on the TableTorch website for more information about the Spearman’s rank coefficient.

4. Binary classification with a logistic regression model

4.1. Fitting the model

Now that we are done with the preparations, let’s get back to what we wanted to do in the first place.

The idea was to train a model on the collected dataset so as to use it on the new data, presumably representing a bank’s clients that have not been contacted yet, focusing on the clients with the highest chance of a positive response first.

In machine learning, this task is called binary classification. We want to assign one of two classes to each record, either “likely to subscribe” or “likely to reject”. One of the most frequently used methods to fit such a model is logistic regression.

(1) Let’s begin by starting TableTorch: click on its icon on the right-side panel of Google Sheets. Links: install, more information.

(2) Select full dataset’s range and click the Logistic regression button.

Logistic regression button on the main menu of the TableTorch add-on for Google Sheets.

(3) Among the presented features list, deselect the “duration,” as it cannot be obtained for the new data (i.e. data that doesn’t have an observed result already).

(4) Expand the Sampling options section. Select 5-Fold Cross Validation in the Sample splitting menu and enable the Sample with replacement option. These settings should help to produce a more reliable and precise model. See this article for more information about the sampling options supported by TableTorch.

(5) The following collage shows the entirety of the logistic regression settings (note that in the actual TableTorch, you may need to scroll down the panel to see all of these options):

TableTorch add-on for Google Sheets: logistic regression panel with the features selector and controls for learning and sampling options

(6) Click the Fit Model button. The Learning summary panel will show up shortly.

4.2. Learning summary panel

The summary demonstrates the most commonly used metrics to gauge the quality of a binary classification model. Let’s briefly discuss what these metrics represent and what these particular values mean.

Logistic regression can only be trained on values between 0 and 1. Once it’s trained, it can produce an estimation of the label when applied on the new data. This estimation will always be in the range of [0; 1] due to the nature of logistic function. It can statistically also be interpreted as a probability of a row belonging to a certain class in a world where no other factors exist — other than those present in the dataset.

To use logistic regression as a binary classifier, an important parameter needs to be chosen, a threshold.

If a regression’s estimation is below a threshold, the classifier assigns class zero to the row. Otherwise, class one is assigned instead. For example, if we choose threshold of 50% (0.5) and model’s estimation for some row stands at 10%, class zero is assigned. In our research that means that the row represents a person who is unlikely to subscribe to a term deposit.

On the other hand, if model outputs 70%, class one is assigned, hence the row represents someone who is likely to positively react to the marketing campaign.

The learning summary panel shows quality metrics computed for the threshold of 0.5 (50%). However, a click on the “Add summary sheet” button creates a separate sheet with a detailed summary of quality metrics at multiple other levels of threshold, which we’ll look at later.

(1) Accuracy is the percentage of rows which got an estimation that is equal to the label. 76.6% means that this model is 26.6 pp. better than simply flipping a coin.

(2) Precision for class X is the share of rows for which the model assigned class X correctly. In an example shown below, if 12 rows are observed to be of class X but the model assigned class X to 14 rows, of which 10 were correct and 4 erroneous, the precision will be 10 / (10 + 4) ≈ 71%.

Demonstration of the relationship between model’s estimations, threshold, and the respective precision and recall quality metrics for binary classifiers.

In our case, precision for class 0 is at impressive 95.6%. This means that when the model assigns class 0 (the “won’t subscribe” class) to the row, it is quite likely to be correct.

At the same time, precision for class 1 of 27.4% shows that the model struggles with that class a lot and often assigns that class erroneously. This should not deter us from applying such a model in business because it can still help to significantly reduce the amount of unnecessary phone calling, as we will discuss in section 4.5.

(3) Recall for class X is the percentage of the observed class X rows for which a model also assigned the same class. Let’s take an example shown on the picture above: if a model assigned class X only to 10 out of 12 rows that actually belong to that class, the recall for class X is at 10 / 12 ≈ 83%.

The recall for class 1 that we see in the learning summary is at 70%. This means that at threshold of 0.5, the model successfully classified 70% of the clients who would agree to open a term deposit in an actual call.

(4) AUC-ROC is an often-used metric to gauge the probability of the model’s estimation being correct.

4.3. Finding the appropriate threshold in the learning summary sheet

Click the Add summary sheet button on the learning summary panel shown in section 4.2. A separate sheet will appear with many more details about the produced model than a panel could accommodate.

TableTorch add-on for Google Sheets: detailed learning summary sheet for a logistic regression model

The learning summary sheet contains at various threshold levels:

  • the precision
  • recall
  • and accuracy metrics.

It allows you to choose a threshold level that is most appropriate for a given task.

It also shows the model’s coefficient and parameters of standard scaling for each feature (mean value and standard deviation). This helps to quickly evaluate which features turned out to be the most significant.

Let’s imagine that a decision was made to use this model to filter the list of clients who have not been contacted yet in order to reach out only to customers who are likely to respond positively to the campaign.

Considering also that no model is perfect, it’s determined that the business is prepared to lose no more than 25% of the hypothetical future subscriptions. This means that we need to build a binary classifier with a recall of at least 75%. As is seen on the learning summary sheet, the highest threshold that still qualifies is 30%.

4.4. Creating a binary classifier

Let’s build the classifier and then analyze its impact on the continuation of the marketing campaign.

The first thing to do is to click the “Insert prediction column” on the Learning summary panel shown in section 4.2. A column will be inserted right after the range the model was trained on.

An estimation column containing a formula with standard scaling and logistic regression model.

The column’s formula

  • Applies standard scaling to the input features
  • Multiplies variables by their respective coefficients
  • And processes the result with a logistic function to output the final estimate in range of [0; 1] that roughly represents chances of a row to be of class 1.

This formula obviously does not depend on the “observed result” column, which was the label or a variable that we fitted a model to estimate. Thus, it can be copy-pasted to the new data rows, i.e. the rows that don’t have a value for the “observed result” at all.

Let’s add a binary classifier column, “make a call” (AL) with the following formula:

=IF(AK2 > 0.3, 1, 0)

A binary classifier column applies a formula that compares the logistic regression model’s estimation to an established threshold.

4.5. Calculating the model’s impact on a marketing campaign

As we have successfully produced the model and used it to create a binary classifier to find the clients who are most likely to react positively to the campaign, let’s evaluate a few metrics to understand how this model can help with the campaign’s continuation.

(1) Observed number of calls: how many calls were actually made?

=COUNT(AJ2:AJ4120)

(2) Observed positive results: how many calls resulted into a subscription?

=COUNTIF(AJ2:AJ4120, 1)

(3) Filtered number of calls: how many calls would be made if the model was used?

=COUNTIF(AL2:AL4120, 1)

(4) Filtered positive results: how many positive results would there be if the model was used?

=SUMIF(AL2:AL4120, 1, AJ2:AJ4120)

These formulas yield the following values:

Observed number of calls:…. 4119.
Observed positive results:….. 446.
Filtered number of calls:…… 1459.
Filtered positive results:……. 335.

This means that using this model could help reduce the number of calls by almost two thirds and still successfully reach 75% of the clients who decided to subscribe to a term deposit.

And this, in turn, could result into various favorable effects, such as:

  • Savings on the calls
  • Lower negative impact on customer satisfaction due to unsolicited calls
  • The campaign completed faster and results achieved more quickly

This all leaves more time for other marketing efforts.

5. Conclusion

Fitting a regression model usually involves:

  • Programming in Python or R
  • Data export
  • And some kind of external tooling to run a script.

In this article, we’ve demonstrated how to do a regression without any of that, right inside a spreadsheet. We also discussed how to apply it in business and the tangible impact that such application can make.

Download this spreadsheet to have a look at the results of the feature engineering and logistic regression that is described in this article. It has the Open Document Format: you can just drop it in your Google Drive and then open it in Google Sheets.

TableTorch add-on for Google Sheets has been used throughout this article for feature engineering, correlation matrix calculation, as well as the logistic regression itself. I am a software engineer (LinkedIn profile) working on the add-on and I’d be happy to discuss it or anything else regarding this research in the comments below.

Thank you!

Google, Google Sheets, Google Drive, Google Workspace and YouTube are trademarks of Google LLC. Gaujasoft TableTorch is not endorsed by or affiliated with Google in any way.

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22–31, June 2014

--

--