Improving Subject Lines: An endless quest using Machine Learning (pt. 2/2)

Published in

Moosend Engineering & Data Science

6 min readOct 13, 2017

The second part of this series for my loyal followers. Hi Mum. Hi Dad.

In the previous part I built a model which was able to predict the open rate of an email campaign based on the morphology of a subject line and external characteristics of a campaign.

Awesome, right? I know, Ma’.

Today I’m taking the dose of action up a notch. Yep, things are about to get real crazy.

This time, I’ll try to predict the open rate by building a model that i) considers words, ii) measures the impact of each word, and iii) its performance in the subject line.

Get coffee.

Part 2: A linguistic approach of a subject line

Successor to “Non-linguistic approach to subject lines”

Machine learning aims to understand learning principles through the computational process.

By “computational process”, I mean the representation of features on a word-less level, a purely mathematical level in order to unearth hidden patterns or optimize pattern combinations.

But how do we make sense out of this computational process?

The answer is through human theories. By the few, hand-picked fellas who see stories where there’s numbers.

Theories are the main source to render a model effective and productive. It’s like a sparkle that starts a chain reaction: good theories and mathematical representation in an n-dimensional feature space unlock the model’s knowledge as soon as data flows in.

*fist bump*

Diving into the subject line

Here at Moosend, we spend a lot of time constructing theories, and exploring ideas that can make the impossible lose its prefix.

Our mission is to serve a greater purpose for the email marketing community: break the subject line code!

The goal of our latest model implementation was to break down elements and distinguish them into discrete categories.

Think of the subject line as a sequence of placeholders:

We represent every element (word or symbol) as a placeholder and we measure the impact of each element.

In this way, we measure not only the impact of a word, but also the impact of a pattern or a pattern combination in it.

This is how we constructed our model.

“ The subject line is a sequence of placeholders. We can remove, replace or even find the best combination of elements in order to improve it.”

Subject Line Pre-processing

Before I started processing the text and removing elements around the subject line, I ran an entity recognition.

In doing so, I extracted specific entities from each subject line which would help me identify pattern and pattern combinations, as well as their relationship influencing the open rate.

Some of these entities are brands, currency, percentages, emojis, duration patterns, and so on.

Our first step in text processing is to remove numbers and punctuation.

Then, we tokenize every subject line, remove stopwords and stem each word.

Stemming takes place in order to find the common root of words. As a case in point, the words: “claims”, “claiming”, and “claimed” have the same impact in a sentence, as they are derivatives of the same root.

Our goal here is to focus on the root of a word, not on the prefix or suffix. This way, we reduce the wordlist and have a better understanding of the impact of a word in a sentence.

Feature Construction Area

In part 1 of this series, we went over the importance of sender performance history. We’ll be using this feature in this model implementation, too.

We must also consider the number of recipients of a campaign because it’s the denominator of the open rate.

After subject line pre-processing, we create our wordlist and score each word.

To complete the scoring process, we establish 2 measurements for each word: word score and position score.

The word score:

For every subject line s_i, we score the w_ij word, where j is the position of the word in s_i, with the sum open rate from the campaigns that the word is found in, divided by the number N of campaigns.

Figure 1: Relationship of Word Score and Open Rate

The position score:

The second measurement is based on the theory that the score is influenced by the number of words in the subject line and also the position of each word in it.

Essentially, the words placed at the beginning of the subject line are considered stronger words, while those at the end are considered weaker.

Thus, our measurement scores the word position with the sum of the open rate of the subject line divided by the word position in s_i and the product number of those, divided by N subject lines that the word is found in:

Figure 2: Relationship of Word Position and Open Rate

Model Selection (An another forest)

As in the maiden post of this series, I went for an ensemble method to approach this problem, only this time I chose Gradient Boosting Regressor (GBR) as it performs better without overfitting, compared with the other ensemble algorithms.

GBR is more flexible in parameter tuning than Random Forest: some core parameters that enhance my model are the number of estimators, the learning rate, the depth and the loss function.

Figure 3: Model progress through training steps

A large number of estimators can boost our accuracy but it can also lead to overfitting — so, how can we prevent this from happening? The depth parameter limits the number of nodes in the tree and prevents overfitting of each tree in the model. The learning rate is how fast our model changes “beliefs”, that is, how fast it learns new things on the top of old one. There is an exchange in the learning rate, estimators, and depth, therefore, these parameters must be tuned with the best combination to produce results that prevent overfitting.

In Figure 3, we see how the loss of the model (Train Loss) is changing in association with the number of estimators (Iter):

Evaluation

To evaluate our results, we use the same measurements as in the previous model (R-Squared, MAE, MSE, Accuracy) and we will compare and contrast them so we can see how this method describes the whole problem:

Simply put:

This model implementation boosted model accuracy by 10%, dropping the Mean Absolute Error (MAE) by around 43% compared with the Random Forest implementation.

We can see clearly that the value of R-Squared increased, which means our independent variables (features) of this model describe the dependent variable (Open Rate) in a better way than the previous model.

This means, that we can use this model as a baseline tool for a whole system of marketing “assistants”. These marketing “assistants” could improve a subject line not only on pattern level, but also on word or syntax level.

The verdict?

Succe- -Ma, please, no, don’t call aunt Gertrude. No, it’s not Nobel-worthy. But it’s getting there.