Using Machine Learning to classify hard bounce e-mails — Part 1

Thiago Cordon
Data Arena
Published in
4 min readDec 1, 2019

This article series aims to show how to identify hard bounce e-mails using machine learning techniques. In part 1 we will see Feature Engineering and Exploratory Analysis.

Photo by Tiffany Tertipes on Unsplash

What is e-mail hard bounce?

This terminology is widely used in Marketing and is related to bounced e-mail messages which occur when an e-mail message is rejected by an e-mail server.

There are two main types of bounce:

  • Hard bounce: indicates a permanent non-delivery reason. E.g.: e-mail recipient does not exist; domain name does not exist or e-mail server blocked the delivery.
  • Soft Bounce: indicates a temporary non-delivery reason. E.g.: mailbox is full; recipient's e-mail server is down or the e-mail message is too large.

Hard Bounces can be a big pain in companies that have to maintain a good reputation in e-mail delivery because it can affect the delivery rate. In more critical scenarios, it can cause a block in the sender’s IP.

The objective of this series of articles is to identify hard bounce e-mails using machine learning techniques.

Dataset

The dataset contains the following variables:

  • email: e-mail recipient
  • flgHardBounce: a flag indicating if the recipient is hard bounce (1=hard bounce)
  • regDate: date of user registration. It can tell us how old is the registration
  • birthDate: birthdate of recipient

It can be difficult to build a good classifier with these initial variables, so we have to do some feature engineering to extract more information.

Feature Engineering

First, I built a bunch of auxiliary functions that will help reutilize the code throughout this article series. They can be found below.

After importing the dataset, I realized that I had some missing values in my predict variable, then I decided to remove the missing occurrences.

3.32% of missing rows in flgHardBounce.

Moving on, I had to convert some columns to guarantee the correct datatype. This piece of code creates new dataframe columns with the converted datatypes.

In the code snippet below, I am creating the variable monthsSinceRegDate, which will store the number of months since the registration date.

The column age (below) will store the age calculated using the birthDate variable.

Now it’s time to extract some information from the e-mail. First, let’s extract the domain and pieces of the domain (like .com and .br).

Count by e-mail domain after transformation — Image by the author.
Count by e-mail domain piece after transformation — Image by the author.
Count by e-mail domain piece after transformation — Image by the author.

Let’s build some variables using e-mail users. In the snippet below, I am creating the variables:

  • percNumbersInEmailUser: percentage of numbers in e-mail user.
  • hasNumberInEmailUser: dummy indicating if the username has numbers.
  • emailUserCharQty: character quantity in e-mail user.

Finally, I will persist the dataframe with new variables in a CSV file. Thus, I can use this in the next section.

Exploratory Analysis

Now that we have some new features, let’s see how good they are for our purpose that is to classify e-mails in hard bounces or not hard bounces.

Next, a summary of what the exploratory analysis is telling us.

  • monthsSinceRegDate: hard bounces have more time registered than not hard bounces observations;
  • age: the distribution seems to be the same between hard bounce and not hard bounce observations. There are a lot of observations around 40 years old;
  • percNumbersInEmailUser: the distribution seems to be the same but hard bounce observations have observations where the username is composed of 100% of numbers;
  • emailDomain_cat: gmail.com has fewer hard bounces if compared with the other categories. The category others hold most of the hard bounce observations;
  • emailUserCharQty: the distribution seems to be the same between hard bounce and not hard bounce observations.

In the next article, we will train a machine learning algorithm based on these variables. See you :)

--

--