How we built a cutting-edge NLP model to generate SEO titles

Published in

Axel Springer Tech

13 min readSep 7, 2020

written by Hanna Behnke and Sarah Lueck

End of last year, an idea started to form at our data department. Working for Axel Springer’s digital publishing unit, we’re always looking for ways to leverage data to increase the reach of our national media products and support the editorial teams in their daily doing. This time, we cast an eye on a thrilling natural language processing problem: What if we could automatically generate search-optimised titles for news articles? Other than creating a headline, SEO title composition is more of a time-consuming routine task for the editors. So why not accelerate and support the process of writing them? We quickly discovered that this wouldn’t be a trivial task — SEO titles need to be grammatically correct, informative, capture attention, and generate search traffic without being “click baits” or “fake news”. Additionally, title generation is a rather complex task in NLP and projects as well as papers on the topic are rare, especially when it comes to the German language.

Despite the uncertainty, we decided to give it a try. We teamed up with a group of NLP-experienced data scientists from AWS Professional Services to jointly tackle the challenge. Spoiler alert: We’re happy to say that we managed to build an MVP that produces state-of-the-art results and is currently tested in the WELT newsroom. But, as always, the most important lessons are drawn from the process, not the outcome. That’s why we wrote this blog post to share our journey with all its obstacles and insights. In alignment with the structure of our project, the article is divided into 4 parts:

The Groundwork — What problem did we tackle? How did we approach the task? What kind of data did we use?
The Development Process — What kind of machine learning model did we build? How did we evaluate its performance?
The Road to Rollout — How did we take the model into production?
Next Steps — How could we improve the model or use it for related tasks?

The Groundwork

In terms of visible progress, machine learning projects usually have a bit of a slow start. First of all, it is utterly important to understand the business problem and what data is needed to solve it. Once that’s done, it takes some time to gather, explore, and clean the data before it even makes sense to train a first model.

The Business Problem

So, why did we choose to generate SEO titles? The SEO title, aka the title that is displayed in the Google search results, is one of the factors that influence how high the article ranks among the Google search results. The higher the position in the result list, the higher the click-through rate (CTR) for each search result. If we found a way to help the editors to produce good SEO titles fast, SEO title writing would no longer delay the time to publication. Additionally, the article’s chances for a good Google ranking would rise, leading to more search traffic on the newspaper’s website.

Other than the articles’ headlines which originate from pure journalistic creativity, the SEO titles follow a much stricter pattern. Most importantly, they should be descriptive, intriguing, and contain keywords that are specific to the context as well as frequently sought for. Besides, the titles should not exceed the length of what Google displays in the organic search results.

Approaching the Problem

After a couple of meetings with editors and SEO experts, we had a good understanding of what SEO title generation is all about. Based on what we had learned, we decided to split the problem into two separate strands:

1
Generate a title — basically a very short summary of the article text.
This is essentially an extreme form of text summarization. In recent years, summarization problems were usually tackled by using neural networks with an encoder-decoder architecture. The encoder processes the input sequence, in our case the article text. Broadly speaking, it extracts information on the words and their relation to one another to capture the context. This information is then passed to the decoder, which generates the output sequence, in our case the title. We’ll illustrate our specific model architecture in the next chapter. If you’d like to learn more about encoder-decoders in general, here’s a good article that explains the key concepts.

To train an encoder-decoder model, it was clear that we would need a large amount of article data. Therefore, we extracted 500k historic WELT news articles to familiarise the model with the structure of the text and titles. The biggest challenge though was to derive a set of articles we could be sure had high-quality SEO titles. As several factors influence the Google ranking of an article, the ranking was an unreliable indicator for the quality of the SEO title. Therefore, our best option was to use articles that we knew had titles that where modified and approved by our SEO expert team. To avoid expensive and time-consuming hand-labeling, we scraped the change-logs of the content management system. This way, we ended up with roughly 17k SEO-approved titles which we used for training and testing.

2
Identify keywords that are important & frequently sought on Google.
When composing the SEO title, the SEO experts usually come up with context-specific keywords and then pick those with the highest expected search volume. To mimic this workflow we decided to take a two-step approach: First, we extracted the article’s keywords and their context-specific relevance score by applying named entity recognition (NER). Next, we used the PyTrends API to retrieve the expected search volume of the keywords. We then trained an XGBoost ranking algorithm on the keyword characteristics (relevance, search volume, etc.) as well as our 17k high-quality SEO titles. In this way, the model learned which factors had the biggest influence on the likelihood of a keyword to be part of the SEO title. After quite a bit of tweaking, we had a decent model that could rank the keywords — a great foundation for search-optimizing the titles.

Last but not least, we defined our goals and success metrics. If 80% of the generated titles were useful recommendations, we would move on to do a proper user test. A few weeks into the test run, a comparison between articles with generated SEO titles and a control group would show if the tool had indeed a positive effect on the search traffic.

The Development Process

Once we had the basics sorted out, we moved on to start generating SEO titles. This is the phase where progress is most apparent. Within a couple of weeks, we moved from very mediocre results to pretty decent SEO title predictions. In this part, we’ll explain the model’s architecture and how we evaluated its performance.

Transfer Learning — How to NOT start from scratch

During the past few years, numerous pre-trained language models were published for all sorts of NLP-related tasks. These models already have a basic understanding of human language and can be fine-tuned for specific use cases. This is sort of similar to what we do when learning new stuff. For example, imagine a medicine student. She already knows how to speak English, but has yet to learn the subject-specific terminology to make sense of her professor’s explanations. After some time, she’ll be able to understand and use all those new terms. The same goes for language models. Without some extra training, they won’t deliver the expected results.

One of the most versatile and famous pre-trained language models is BERT (Bidirectional Encoder Representations from Transformers). BERT is an encoder that captures the meaning of words by examining the context in which they appear. However, BERT is not particularly good at spotting the relationship between numerous sentences, causing it to perform mediocre at summarization tasks. Thankfully, a team of researchers decided to tackle that issue and modify the architecture of BERT to allow the distinction of sentences. The team combined their BERT-based encoder with a decoder trained specifically for summarization tasks, titled the model BertSum and released it on GitHub. This was a great starting point, but we still had two challenges ahead of us: Firstly, BertSum is only available for the English language. Secondly, it is made for summarisation, not for title generation.

Teaching BertSum to output Short German Titles

While we stuck to the BertSum architecture as such, we modified parts of it to fit our needs. First of all, the encoder, which is the part of the model that captures the meaning of the article, had to “learn German”. So we exchanged the English BERT model with a pre-trained German version of BERT. The decoder, on the other hand, is a 6-layered transformer that we trained from scratch. The training data set comprised of approx. 500k article - title pairs. This was necessary to teach the model to generate titles instead of longer summaries. To achieve an appropriate title length, we added a length penalty to the decoder. The penalty ensures that the predicted title is neither too short, nor too long.

Judging the SEO-potential of a Title

Apart from adapting the BertSum model, we had to find a way to turn our title predictions into proper SEO titles. Thanks to the keyword ranking algorithm we had already trained, we knew which keywords were particularly good candidates for the SEO title. The tricky part was to integrate the keyword ranking information into the title generation model. To do so, we used a technique called “Beam Search”. When the decoder produces a title, it predicts one token (word) after another. At each step, there are several possible options that result in slightly different titles. Beam Search allows you to navigate these options and influence the outcome for each step. Based on the keyword ranking, we tweaked Beam Search so that it favours the options which contain the best keywords. The influence of the ranking decreases with each step to ensure that the most important keywords come first in the title.

Evaluation

When you start training your model you want to check how it performs. This is necessary to track the progress towards your goals and spot flaws in the model or training data. A common metric for assessing model performance when generating a sequence of text is the so-called ROUGE score. Simply put, it measures the overlap of words in the generated title and the original SEO title. Title predictions, however, can be of good quality even if they contain different words than the original title. Therefore, we developed a new metric — we named it SentenceSim — which measures the likeness of titles independent of the exact wording. We used a word embedding and calculated the average similarity of related word pairs of the generated title and reference. This allowed us to quantify if the original and generated title contained words with a similar meaning.

ROUGE and SentenceSim were extremely useful to track improvements over several training iterations. However, we quickly realised that automated evaluation is not enough. You really need to read the results. Only by regularly checking the generated titles we found flaws in our training data, e.g. falsely generated SEO titles coming from news agencies. We also realised that our model is best suited for short article texts which is partly due to the model architecture, partly due to the composition of the training data. The manual evaluation helped a lot to detect patterns in the predictions and improve the quality of the generated titles. To decide if the model was good enough to start a user test with the editors, we came up with a scheme that allowed for a controlled human evaluation. So, we didn’t have much choice but to manually assess hundreds of titles. Even though that was definitely not our favourite part of the process, we dug out some real gems:

“Drohende Schwangerschaft: Das sind die größten Warnungen”
- “Pending pregnancy: These are the biggest warnings”
“Wild: Dieser Fisch ist ein Stück vom Hirsch oder Gulasch”
- “Venison: This fish is part of a deer or a goulash”
“Nürnberg: Mann sticht mit 4,88 Euro auf Frau ein”
- “Nuremberg: Man stabs woman with 4,88 Euro”

After a few iterations of manual evaluation and improvements, we managed to surpass our initial goal: Close to 90% of the generated titles were usable as is, meaning that they were grammatically and factually correct and made an acceptable SEO title.

The Road to Rollout

Thanks to the human evaluation process, we now knew that the model was capable of producing fairly decent SEO titles. Surpassing our initial goal meant that we could continue our work and take the next steps towards the rollout. Therefore, we had to spend some more time on bug fixing and fine-tuning, build a deployment pipeline as well as a user interface and do some proper testing.

As with every piece of software, testing reveals bugs. For example, our model predicted some unexpected tokens, which we had to filter out during post-processing. Bit by bit, we got these issues out of the way and automated both the pre-processing and the post-processing using AWS lambda functions. As all of our technical infrastructure runs on AWS, we chose to deploy the pipeline there. The raw data, cleaned data and trained models are stored in S3 buckets. While the initial rounds of training were mostly run on EC2 instances to save costs, the pipeline uses AWS Sagemaker. This makes life a lot easier, especially if it’s the first time for a team to operate a machine learning pipeline like in our case. To compensate for the higher cost, we plan to use cheaper spot instances for future training rounds.

For the frontend, we designed a recommendation tool and did not aim for full automation. Given the Status Quo of NLP research, it’s simply not possible to consistently generate perfect SEO titles and the editors should always have the final say in what gets published. We decided to go for a browser plugin, which allowed us to work with the browser-based content management system (CMS) that’s used by the WELT journalists. Once the editor clicks into the SEO-title text field, the plugin pops up and generates a title prediction as well as a list of ranked keywords that could be included in the SEO title. If the title prediction works fine, the editor can insert it into the CMS with one click. If it’s of no use at all (yes, that will definitely be the case from time to time) the editors can use the keyword suggestions as a supporting tool, instead. Additionally, we included two “thumbs up, thumbs down” buttons to make it as convenient as possible to give feedback.

In July, we finally introduced the tool to the WELT journalists and kicked-off the test-run with a small team of editors. We strictly made sure not to portrait the tool as any kind of AI-magic. It is not. It is math. And no matter how happy we are with what we achieved from a technological point of view, the users only see the end result, which is far from flawless. We believe in the importance of being brutally honest about what the tool is not able to do, it’s the only way to avoid rejection and instead establish a constructive feedback loop. We look forward to evaluating the first round of feedback soon — let’s see what we can do to improve the tool!

Next Steps

Of course, we’re not done, yet. In addition to evaluating the user feedback, we want to check whether the articles that were modified using the SEO title tool received significantly more search traffic than others. Outside of that, there’s plenty of ideas on how to improve the quality of the predictions, widen the scope, or even alter the model to solve related problems.

Ideas for improvement

Train different models for different types of articles. The first version of the SEO title generation tool still has a limited scope and works best with short news articles. Training specialised models would allow us to better address different journalistic styles and types of content.
Try new NLP algorithms. Since we started the project and chose to work with BertSum, various new models for summarisation were released. Some examples are ProphetNet and Pegasus. In general, NLP techniques and language models keep improving incredibly fast, as recent results of the GPT-3 model showed.

Ideas for new projects

Adapt the model to different Axel Springer news publications. If our WELT test-run proves to be successful, this is the obvious next step.
Generate text summaries. This could be useful to enrich the archive and help journalists to re-discover older content. Or to serve as a draft version of the teaser text.
Think outside the news media box. It’s not only newspapers that publish content. Companies like StepStone, Idealo and Immonet could potentially profit from the model, too.

We had a really great time working on this project and we hope that some of our insights will prove helpful for you, too. If you’d like to know more about the technical and mathematical depths of the model, we’re about to publish a research paper on the project. We’ll put the link here once it’s available online. If any ideas crossed your mind while reading this post, we’d love to know! Just leave a comment below or reach out to us on LinkedIn.

Are you curious to find out more about our work at Axel Springer or seeking for a new job opportunity? Check out our career page and have a look at the Axel Springer Tech blog!