Improving Subject Lines: An endless quest using Machine Learning (pt. 1/2)

Dimitris Apostolopoulos

Published in

Moosend Engineering & Data Science

7 min readSep 12, 2017

I open up about open rate prediction in email campaigns. Glam is a tough job, but someone’s got to do it.

Part 1: Non-linguistic approach to subject lines

Precursor to “Linguistic approach to subject lines”

Today’s article picks up on email open rate prediction.

Which emails do you open? What influences your decision? Is it the subject line? Is it the timing? Is it a lot more?

I’m setting out on a quest to discover what influences open rate prediction.

I will be sharing what this road map brings back with you. I’d love to hear your comments, love to reply to questions, love to help out. I’m the outgoing type of data scientist, yes.

*Guitar intro solo*

I am fascinated by modern marketing tools:

developed from data-driven Machine Learning algorithms, modern marketing tools encompass all things Future; the potential is crazy and mainly untapped. What a great time to be alive, right?

I tend to see modern martech tools as “doors” to some kind of radically new perspectives that lead to solutions, and opportunities for businesses and marketers. Ultimately, we are talking about safer, calculated decisions even when the user has low or even no experience in the field.

Amazing how data can reveal hidden patterns, right? Measuring things that humans generally ignore or the human mind can’t understand.

In the past few weeks I’ve been trying to predict the open rate of email campaigns based on the subject line of the email.

By subject line, I’m referring to it as a whole, without considering the words used or their meaning.

I work for Moosend, so I have access to some millions of emails and insights to come up with various implications for further research.

In today’s case, what I did was to look for a pattern in the subject lines of the most opened emails from our database. I completely disregarded the meaning of words.

Disregarded. Ignored.

Instead, I focused on morphology alone. I went looking for the grammar-and-syntax pattern there is across thousands of subject lines.

So this is why I went for patterns over meaning. At first.

Consider some subject lines with the word “free”. The word “free” has shades of meaning, which influence the semantic value of the word depending on the context. Also, depending on the country or a national advertising trend, subject lines with that word could boast a higher open rate. Or fall into “spam”, either by email clients or perception-wise. Worse, the word could even be bound by several cultural implications per English-speaking country, which I would not get into, in the first place.

Focusing purely on morphology of the subject line, namely the number of characters, the number of words, the factor/influence of punctuation and other parameters like the influence of segmentation, total recipients, the industry the sender is in, and the day the campaign was sent.

Open rate as a variable

Open rate is a delicate continuous variable and as such, it can be influenced by a variety of factors. Can’t think of more than the subject line itself? Think again.

Every time you open your inbox you read certain emails and ignore certain others. Why did you open email A and ignored B?

I am sure there can be a bunch of reasons why email A open rates outnumber those of email B. Maybe you didn’t like the subject line, or thought that the sender is a spammer, maybe the profile picture looked fishy.

These all make sense BUT.

All the aforementioned hypotheses make that regression problem much more complex and much more unstable because we can’t measure them. This practically means that, in all probability, we cannot achieve a high-accuracy model.

So we move on to Feature Construction.

The features of our model are the single most important group of variables that influence the results of the model.

Constructing a good feature space is a prerequisite for higher accuracy, and better features means higher flexibility and better results even when the approach of a problem is not excellent.

External characteristics of an email campaign (see sender industry, day email was sent, total recipients, segmented list vs non-segmented list) are mostly predefined and don’t require any special transformation but the subject line have to be transformed to multiple features so we can use them as inputs for our model.

In order to analyze the morphology of a subject line we must create a function that takes the subject line and looks for specific features, such as the ones mentioned earlier. This becomes much easier to implement, using these two core libraries for data manipulation: NumPy and Pandas.

Figure 1: The effect of emoji on open rate of email campaigns

Figure 2: The redistribution of open rate through days

Figure 3: How the number of words of a subject line effects the open rate

Figure 4: The effect of exclamation mark on open rate of email campaigns

Sender integrity adds value.

We cannot overstate the importance and value of the email sender. When we receive an email in our inbox the sender and the subject line are the only clues we have.

To measure the sender, I could either assign a unique ID like category value, or measure their performance history. Measuring the latter using a continuous value can work as normalization term for our model. This is because every user has established their “public image” across their subscribers and the open rate doesn’t change unexpectedly, except in some cases with good segmentation, but can be changed over time.

So my model takes the sender’s performance history as a baseline for the final prediction and measures the distance above or below the vertical axis.

Normalization

Data preprocessing is crucial as the success of the model itself lies in the quality of preprocessing. In many cases it depends on the method you use and what you want to accomplish. It’s a technique that makes things work and maybe a tad more complicated.

In our feature space scaling varies: we have binary, multiclass and continuous variables so we choose a normalization technique for the preprocessing step.

Normalization is a simple mapping (rescaling) of data between 0 and 1 using the following formula

Model Selection

To boost accuracy, the model approach is based on the Ensemble method. Ensemble learning is generally used to average the predictions of different models to get a better prediction to a more complex problem. Ensemble methods are like using the predictions of small “expert” models in different parts of the input space, so we choose the Random Forest Regressor as the perfect tool for the job. Forest outperformed all Ensemble algorithms. It’s in its nature. (I didn’t say I’m a funny guy.)

The next step is to employ Hyperparameter Tuning of machine learning algorithms which can optimize the performance of the model based on the nature of the research at hand. Optimization is made easy by Python with SciKit Learn package, and GridSearchCV to find the best parameter combination.

Evaluation

To evaluate our model I used 4 measurements, namely R-Squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and accuracy. In regression, accuracy can be measured directly through R-Squared, MSE and MAE which give us enough information for a regression model. In our case, it’s individual campaigns we are talking about, so, to measure the accuracy of our regressor, I created a threshold within a +/-1% range from the predicted value in order to measure the accuracy of our model.

Conclusion

Even a simple model like this demonstrates the power of machine learning, since we can predict 7/10 campaigns correctly in our threshold. From the array above, we see clearly the importance of sender performance history which boosted our accuracy by 17% and decreased the MAE by 0,90.

What this means

This means that its accuracy is higher, but there is more to be done.

The main shortcoming of the model right now is that the morphology model I came up with considers form and placement in the sentence, not meaning. Essentially, it can make suggestions about adding emojis, or personalization, and the effect they will have on the impact of the subject line, but it cannot guide you through the best alternative for a word.

Let’s consider this example: “Get 5% off”.

Personalization
Emojis
Amount in dollars over percentages
various equivalents to “get”: “claim”, “discover”, “here’s”, and so on,

…are a few of the changes that can be incorporated to test the impact of our subject line.

I will be coming back with more of these posts on this series.

What are your thoughts?

What would you like me to work on in a future post? Leave your comments below and I will make sure to tag you on the next post as an Inspiration Contributor!

In the meantime, feature engineering, preprocessing, and parameter tuning, essential to machine learning, won’t cease to amaze us any time soon. Data and the aforementioned process make us all intrigued and powerful to unveil solutions to virtually any data-related challenge.