Using Machine Learning to classify hard bounce e-mails — Part 2

Published in

Data Arena

5 min readDec 23, 2019

--

The objective of this article series is to identify hard bounce e-mails using machine learning techniques. The part 1 article was about Feature Engineering and Exploratory Analysis. In part 2 we will see how to train an Extreme Gradient Boost algorithm to identify hard bounce e-mails.

The Dataset

In this article I will work with a dataset created from the feature engineering explained in the previous article. Here are the variables contained in this dataset:

emailDomain_cat: e-mail domain. E.g.: hotmail.com, gmail.com;
emailDomainPiece1: e-mail domain piece 1. E.g.: for an e-mail test@test.com.br -> com, for an e-mail test@test.au -> au;
emailDomainPiece2: e-mail domain piece 2. E.g: for an e-mail test@test.com.br -> br;
regDate_n: registration date, used to calculate monthsSinceRegDate variable;
regDate_n: birth date, used to calculate the age variable;
monthsSinceRegDate: number of months since the registration date;
age: age of e-mail owner;
percNumbersInEmailUser: percentage of numbers in the e-mail user;
hasNumberInEmailUser: dummy indicating if exists numbers in e-mail user;
emailUserCharQty: number of characters in e-mail user;
flgHardBounce_n: dummy indicating if the observation is a hard bounce (1 = hard bounce). This is the variable to predict.

It is important to notice that this dataset is imbalanced, as we can see in the code snippet below. The variable flgHardBounce_n which we will try to predict has only 21% hard bounce observations.

Only 21% of the observations are hard bounce — Image provided by the author.

Dealing with imbalanced datasets

To avoid bias in our model, we have to treat this imbalance in our dataset. Although exists more advanced techniques like Cost-Sensitive Learning and Recognition-based Learning to equalize the dataset, I will use a more simplistic but effective approach called Oversampling.

I choose Oversampling instead of Undersampling to preserve all the characteristics of the non-hard bounce observations (our majority category).

Two common approaches to solve the imbalance in datasets — Image provided by the author.

In the Oversampling method, new samples of the minority category are created to equalize the observations. In this article, I am using the SMOTE (Synthetic Minority Oversample Technique) to create new samples.

The SMOTE algorithm synthesizes new minority instances between existing minority instances. Imagine that SMOTE algorithm draws lines between existing minority instances like the following image.

SMOTE example — Connecting minority class dots. Image provided by the author.

After synthesizing new minority instances, the imbalance shrinks from 4 red versus 13 green to 12 red versus 13 green.

SMOTE example — new dots between existing data. Image provided by the author.

The following code snippet shows how to apply SMOTE in the hard bounce dataset.

Percentage of each y variable category after SMOTE. Image provided by the author.

Extreme Gradient Boost (XGBoost)

Extreme Gradient Boost (or XGBoost) is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. According to the following image, XGBoost is one of the most advanced tree-based algorithms.

Tree-based algorithm evolution. Image provided by the author.

Being one of the most used algorithms in Kaggle competitions, the XGBoost trains decision trees sequentially and tries to minimize errors on each iteration.

XGBoost process illustration. Image provided by the author.

Training XGBoost algorithm

Now it’s time to train the XGBoost algorithm using the snippet below. Note that I changed the default value for the “max_depth” (max depth of a tree) and “objective” (algorithm learning objective) algorithm parameters.

For the “max_depth” parameter, I’ve chosen 5 based on tests made before with other values. For the“objective” parameter, the function used was the logistic regression because our predictor variable is binary.

I’ve trained two versions of XGBoost algorithm to compare results between a model trained applying oversample and a model trained without oversampling.

Training the XGBoost with the oversampled dataset:

Training the XGBoost with the imbalanced dataset:

Evaluating XGBoost algorithm

First, I evaluated each trained algorithm against the correspondent test dataset. Thus, for the algorithm that was trained with oversample, I used the oversampled test dataset and for the algorithm that was trained with the imbalanced dataset, I used the test imbalanced dataset. I used the AUC indicator in this evaluation.

AUC comparison — Image provided by the author.

I’ve got an AUC = 0.9043 on XGBoost trained with the oversampled dataset. Very good! The algorithm seems to be a good predictor.

On the other hand, the XGBoost trained with an imbalanced dataset performed worse, with the AUC = 0.8603.

Let’s see how both algorithms predicted True and False Positives in the confusion matrix.

Confusion Matrix comparison — Image provided by the author.

The algorithm trained with oversampled dataset classified correctly 89% of non-hard bounces and 75% of hard bounces :). Although the algorithm trained with an imbalanced dataset classified correctly 88% of hard bounce occurrences, this algorithm has worse performance in non-hard bounces.

Following, the code snippets used in this evaluation.

The model trained with the oversampled dataset had a good performance on a balanced test dataset but it would perform well in the imbalanced dataset?

To check this behavior, I applied the XGBoost algorithm trained with oversample on the imbalanced dataset and got the following results.

ROC Curve and AUC indicator for imbalanced dataset — Image provided by the author.

As we can see, I’ve got a worse AUC if compared with a test dataset, but 0.857 can be acceptable for this kind of classifier.

Confusion Matrix for imbalanced dataset — Image provided by the author.

Observing the confusion matrix, the algorithm classified more hard bounce occurrences correctly than before but the correct non-hard bounce classification percentage decreased.

The code snippet used to evaluate the algorithm on the imbalanced dataset is available below.

The project on GitHub can be found here.

I hope this article has brought you relevant insights. Feel free to comment on it and share your feedback :)