How Der Spiegel Uses Machine Learning To Identify Its Most Valuable Potential Subscribers

Alex Held
10 min readNov 3, 2022

--

If we consider the most active ten percent visitors on spiegel.de, less than a fifth of these users have subscribed with us, leaving us with a number of over one million seriously active users without subscription. That being the case, acquiring new users is expensive and a core challenge to most online news websites. Reducing customer acquisition costs by a small amount can result in a large increase in profit. This can be done through better identifying potential customers, showing better-targeted ads, giving discounts at the right time, etc. — all of which are suitable tasks for ML.¹

Photo by h heyerlein on Unsplash

Over the course of several audience research projects at Der Spiegel, we suggest that the decision for a news subscription is not a sudden, instantaneous action, but is the informed decision based on the experience with our outlet.² This applies at least for visitors who outlast our free trial phase and are not solely interested in the content of a specific paywalled article.

As such, we propose effective engagement features to show that they are suitable for building a subscription prediction model. We developed a machine learning approach that can be used to predict lasting subscriptions and enables us to segment readers according to their affinity-to-subscribe. To outline the ML system we have built, this text is organized into (1) Data & Features, (2) Model, (3) Evaluation, (4) ML Operations and (5) Next steps.

Data & Features

As we plan to train a machine learning model to make predictions, we need a data source to learn from. Therefore, based on spiegel.de’s Adobe Data Feed, the raw website tracking data from spiegel.de, usage features are derived at device level, where we currently developed four classes of input features:

  • Engagement: web_visit, app_visit , avg_time_spent_article, avg_time_spent_article_across_sessions, mean_articles_read, mean_visit_duration, number_articles_read, paywall_loads, total_visit_duration, visit_subscription_page, number_of_visits
  • Location: federal states
  • Referrer: aggregated URLs
  • Editorial: department, type (text, audio, video) and format (op-ed, investigative, …)

A long backlog of further potential features already exists, but the first step of development was primarily about the implementation of an end-to-end ML pipeline. To paraphrase the fourth rule of the Google ML Guide: Keep the first model simple and get the infrastructure right. The first model provides the biggest boost to your product, so it doesn’t need to be fancy. But you will run into many more infrastructure issues than you expect.³

Anyhow, building suitable features is only one part of feature engineering. With customer journey data involved, we also set sliding windows within our user journeys. For training data, we set lookback and lookahead windows, whereby data for inference only has the former. Accordingly, our features mentioned above are calculated over the last thirty days per user. For training data records, there is information as to whether a user has taken out a subscription, accompanied by the maximum number of days the subscription has been used after conversion. For this purpose, a maximum of forty days from purchase will be looked ahead. Together with our lookback window, we have a total sliding window of seventy days for training. Following this approach, the model is flexible so that a retention threshold can be set in such way that training solely considers subscriptions as a target value for which users have still used their subscription X days after the actual order. This makes it possible to train the model on predicting long-term subscriptions only, where readers stay with us over the trial phase. To be precise, we currently train the model on purchases where readers still used their subscription at least thirty-five days after conversion. Journeys who do not meet these criteria are getting removed from training.

Apart from that, further journeys being discarded on a large scale before the model is trained, as this is necessary to achieve an appropriate predictive quality for the purpose for which the model was designed: the prediction of lasting subscriptions and identification of potential loyal customers. The main reason for removing users from training is because many journeys consist of very short and/or few interactions, including those journeys that led to a subscription. The absence of obligation to login without a subscription and the low data quality on cookie device level is causing these data quality problems. Therefore, at least seventy percent of user journeys that led to a conversion are not included in model training because they do not bring enough quality with them. It is also important to say, that even bigger numbers of user journeys who did not lead to a subscription are getting removed from the training dataset because they also include only very short and/or few interactions. These circumstances described above are one of the biggest challenge to this ML task.

Training our model on only a subset of positive cases naturally has an impact on inference, because the same short and incomplete journeys are also present in the classification dataset. This is important to keep in mind if the model gets validated on unseen data without subsetting it according to the training data, as we end up with something known as train-serving-skew, where the unseen data comes from a different distribution, so the model does not generalize well. For this reason, our model is not performing well in predicting the total sum of daily subscriptions, which was expected and is not the goal for our model. But on the other end it becomes accurate in predicting explicitly lasting subscriptions and even more important, identifying potential customers.

Finally, it is important to note, that our ML task on hand is a highly imbalanced binary classification problem, where we have more users who did not subscribe compared to journeys who end up in a subscription. Therefore, we sample down negative cases and with that removing high amounts of journeys from training data once more. We currently seeing best performance if we balance the training data one to nine, where for every journey with subscription we have nine journeys without subscription.

Model

Do we need ML at all to better identify potential customers and predict lasting subscriptions? ML solutions are only useful when there are complex patterns to learn and therefore, the question is the same as with every other ML system: Is the pattern complex and very challenging to manually specify? If so, instead of telling oursystem how to calculate the affinity-to-subscribe from a list of characteristics, we let our ML system figure out the pattern itself.⁴ Let’s paraphrase the Google’s ML Guide once more:

A simple heuristic can get your product out the door. A complex heuristic is unmaintainable. Once you have data and a basic idea of ​​what you are trying to accomplish, move on to machine learning. As in most software engineering tasks, you will want to be constantly updating your approach, whether it is a heuristic or a machine-learned model, and you will find that the machine-learned model is easier to update and maintain (see Rule #16 ).

Therefore, for our approach, a random forest model is trained on the features described earlier. The model is then used to provide all users with a model score between 0 and 100. The higher the value, the more subscription-affine the user. The training method used is sklearn.ensemble.RandomForestClassifier, which runs with a training time of less than three hours, from which cross-validation and feature selection take up most time. We have gone through a model experimentation phase, where we chose to discard the following approaches due to insufficient performance uplifts:

  • Random Forest outperformes XGBoost
  • SMOTE with random undersampling of the majority class did not create any uplift to the model
  • RFECV (Recursive feature elimination with cross-validation) outperforms other feature selection methods like SelectKBest() or SelectFromModel()
  • The basic RandomForestClassifier() outperforms RandomForestClassifier(class_weight=’balanced_subsample’) and BalancedRandomForestClassifier()

Evaluation

We applied recursive feature elimination in a cross-validation loop to find the optimal number and estimate the importance of individual features. As a result, from the feature list described earlier, only features from category engagement were considered relevant during this process. None of location, referrer or editorial features seem to have much predictive power.

With these features selected, the current model in production did perform as such during cross validation on training data: precision: 0.71, recall: 0.41, F1 score: 0.54 and accuracy: 0.93.

If we look at the confusion matrix, the model can clearly identify most journeys that did not lead to a subscription. Also, identifying around forty percent of actual conversions (recall) seems like an acceptable outcome. Nevertheless, it is important to note, that we removed journeys with positive labels with few interactions before training as described earlier in the section data & features. For this reason, our performance on recall is only assessable under these circumstances. Furthermore, a precision of 71% seems also satisfying, especially if we consider how useful false positives can be, as these users behave like potential subscribers, so they would be highly interesting to be targeted with specific campaigns.

Apart from using a threshold to split the predicted model scores into a binary classifier where only journeys with a score > 0.5 are conceived as successful subscriptions (like we did for the confusion matrix seen above), we use the model scores as actual probabilities of how likely it is that a user will subscribe with us. We found a well calibrated classifier, by putting the predicted model scores into buckets and compare them to the proportion of predictions in each bucket that were actual subscriptions. We observed that this proportion is higher for higher buckets, so that the probability is roughly in line with the average prediction for that bucket.⁵ It is important to note, that buckets with higher models scores contain significantly fewer users.

ML Operations

Our model is retrained on a monthly cadence, with all historical training data being used for retraining. Therefore, it becomes more accurate over time because it takes more input data into account over a longer period. Each trained model is versioned, and the training dataset and feature selection are saved with the model. In addition, there is an automatically created evaluation document for each model, which contains cross- and hold-out validation with most important key figures such as precision and recall.

We do not use any tools like DVC or MLflow for versioning or experiment tracking but having the whole project wrapped up in a Django app. This enables us to use management commands to control and serve the major daily workflows like data pre-processing and feature engineering, building the classification dataset and serving classifications via the model, uploading model scores to Adobe, or running monitorings. On a monthly cadence, we also trigger workflows to build the training dataset and train a new model. Furthermore, Django’s ability to initialize modules helps us to catch logs and monitor data on different levels from raw tracking data, to derived features or predictions.

The model currently in production is serving predictions in batch mode, where all users who visited our website yesterday are getting new or updated scores. The scope of this project included the usability of the resulting model scores in Adobe Analytics on a daily basis for accessible analysis from various stakeholders. From Analytics we then send segments for users with high scores to Adobe Target daily, to be able to personalize our product based on the model scores. For this purpose, we used the Adobe Data Sources API, to be able to attach scores on user-session level.

Finally, all components of the ML system are running in Azure. Raw data from the Adobe Data Feed is stored in binary parquet files in Azure blob storages and data processing tasks are written in PySpark, to be able to use distributed batch processing. As compute infrastructure we rely on a mid-sized virtual machine and for code repositories and continuous integration we use Azure DevOps. Development is done in VS Code as it smoothly integrates with our Azure ecosystem.

Next steps

  • Data distribution shift: News change, and so do user tastes. What’s trendy today might be old news tomorrow.⁶ We already have built features to capture editorial department usage, but they don’t show much predictive power yet. In building these features, we might need to consider fast and strong shifts in the training data which also means to rethink appropriate timings for retraining and feature engineering.
  • Return on investment: Anecdotal reports claim that 90% of ML models don’t make it to production, others claim that 85% of ML projects fail to deliver value.⁷ While having our model in production, we just starting to search for suitable approaches and campaigns to evaluate if our model was worth the investment.
  • Targeted surveys: Using our model scores to target visitors with specific surveys to find out what motives and attitudes separates them can be very insightful. This is extremely relevant if you think about the fraction of variance which cannot be explained by our features on usage behavior, but more qualitative factors like perceived price or added value of the subscription. In another medium post I wrote about how we combine user research and data science methods at Der Spiegel.
  • Precision-Recall Trade-Off: We see potential in adjusting different probability thresholds in favor of precision or recall. Especially a higher number of false positives could be an interesting segment to work with, as they naturally behave like visitors who usually subscribe with us.
  • Switch to continuous predictions: Instead of a binary classifier, we can reframe our overall ML task as a regression problem, where we predict the number of days a user is going to keep his or her subscription.
  • Compare ML and RFV models: Solely engagement features were selected during the recursive feature selection process and none of location, referrer or editorial features seem to have much predictive power. This led us to the question, what the differences between much bespoke engagement scorings, like RFV models at the Financial Times, and machine learning based scores are?

Thanks for reading!

I hope you liked it, if so, just make it clap and follow me on Twitter.

References

[1], [4], [6] Designing Machine Learning Systems (by Chip Huyen)

[2] Time-Aware Subscription Prediction Model for User Acquisition in Digital News Media (by H. Davoudi, M. Zihayat & A. An)

[3] Rules of Machine Learning: Best Practices for ML Engineering (by Google)

[5] Are Model Predictions Probabilities? (by Google)

[7] Operationalizing Machine Learning: An Interview Study (by S. Shankar, R. Garcia, J. M. Hellerstein & A. G. Parameswaran)

--

--

Alex Held

Data Scientist @ Der Spiegel. All views are my own. Data Analysis & Viz. Machine Learning. Open Data. Open Source. Citizen Science. twitter.com/helloheld