Teaching an AI to summarise news articles: A new dataset for abstractive summarisation

We are open-sourcing 40,000 professionally-written summaries of news articles. Instructions for how to access the dataset can be found in our Github repository, along with examples of us using the dataset for fine-tuning.

Henry Dashwood
Curation Corporation
7 min readMar 23, 2020

--

Background

Automatic text summarisation is one of those research topics that has been around since the early days of computational linguistics and is still receiving a lot of interest from the research community, as it is far from being considered a solved problem.
Constantin Orasan Automatic summarisation: 25 years On (doi:10.1017/S1351324919000524)

Getting a computer to summarise a long document is a problem that dates back to the earliest days of Natural Language Processing (NLP), with statistical attempts in the late 1950s, the empiricist approaches of the 70s, machine learning techniques in the 90s, finally leading to the increasingly popular deep learning methods being used at the moment.

Broadly speaking, there are two computational approaches to the problem: extractive and abstractive. The former takes a document[s] and, using an algorithm, extracts what it deems the most relevant sections to produce a summary, without modifying the wording of the original.

In contrast, abstractive summarisation, takes the input document and tries to write a coherent, plausible and factually correct summary.

Abstractive summarisation is one of the hardest problems in natural language processing. An ideal summariser would not only be able to generate coherent text, it would also understand which information can be lifted directly from the source text, and which parts can be paraphrased. State-of-the-art (SOTA) models therefore don’t rely exclusively on translating between the input document and summary, but incorporate copy mechanisms and extractive summarisation tasks in training.

Corpora

There are practical difficulties with building abstractive summarisation models as well. One is the sheer volume of information that has to be processed. Many NLP tasks work on the sentence or paragraph level. But for summarisation, an entire document is de rigueur. Such large input sizes force us to limit our batch sizes and can make training tedious and expensive.

There are relatively few datasets for abstractive summarisation, let alone good ones, and fewer still are publicly available. When training models for translation between languages we can draw on sources like books or the proceedings of the European parliament. For summarisation though, there has not been much impetus for people to make significant numbers of high quality abstractive summaries available for public consumption, particularly given the expense that goes into producing them.

Historically, datasets have been generated using clever hacks. For instance, we might use the sections of a Wikipedia article below the table of contents as our input, and the section above as our target output. Even the most popular dataset for abstractive summarisation, the CNN/Daily Mail dataset, is only able to use the subtitles of its articles for a target output. Further still, the text found in these implied summaries may often be noisy due to scraping inaccuracies, around 0.5% of the CNN/Daily Mail dataset has been found to have such errors, as shown below.

(Kryscinski et al. 2019)

Curation Dataset

For the last five years at Curation, we have been working to provide companies with the information that they most need to see. We keep abreast of the latest news developments across a range of industries and as part of this service we have a team of professional abstractors writing summaries of news articles.

We are open sourcing 40,000 human-written abstracts of news articles and links to the sources under a Creative Commons license for NLP researchers and data scientists to explore and build on.

We believe (and our clients seem to agree) these summaries are of an excellent quality, and we’re excited to see how the NLP community can utilise them. Although we cannot include the original article body, we are also releasing a script alongside the that can download and parse the sources for personal use.

What do we mean by excellent quality? As the CNN/Daily Mail corpus’ targets are the subtitles of various news articles, they often assume that the reader has already seen the piece’s headline. This means that taken in isolation the subtitles can feel like they are missing their first sentence. In our experiments we have found that state-of-the-art approaches replicate this in their predictions.

In contrast, Curation’s abstracts are specifically written by professional copywriters to stand alone a priori as an intelligible piece of content in their own right. Our writing team conform to a comprehensive internal style-guide and each piece is proofed twice by our editorial and content curation teams to ensure factual and stylistic accuracy. Our abstracts are on average 40 words longer than other publicly available datasets and are designed to maximise information density whilst still being a joy to read.

Want to try it out and see for yourself? Click here to download the Curation Abstracts Dataset.

Comparison of summarisation datasets
Comparison of summarisation datasets (based on Yang et al. 2019)

Example of summaries in the dataset

Examples of fine-tuning

Alongside the dataset, we are releasing some blogs in the forms of Medium posts and Jupyter notebooks for you to build your own abstractive summariser. They can be found in the examples section of the dataset’s repository on Github. We will add to and improve them over time. We would also love to receive feedback on what’s there!

Limitations of Evaluation Metrics

It is worth briefly mentioning the limitations of current methods of evaluating models for abstractive text summarisation. Since using humans to evaluate summaries is expensive, attempts have been made to create automatic metrics that require no human input, but nevertheless provide some valuable insight into the quality of a machine summary. Some of the most popular measures are provided by the ROUGE package (Lin 2004).

Given that ROUGE scores (in addition to other metrics such as BLEU) are based on n-gram overlap between ‘gold’ reference summaries and machine generated ones, such scores are limited in their ability to measure semantic overlap if extensive paraphrasing has been used. An abstractive summary that captures the salient points of a document could therefore achieve a very low ROUGE score if the phrasing in the reference summary is substantially different. In addition to this, ROUGE scores have been shown to correlate weakly with human judgement for both abstractive and extractive summarisation tasks (Kryscinski et al. 2019). For these reasons it is worth taking such measures with a pinch of salt.

For example, calculating ROUGE on our human summaries (which we hold to be the gold standard), we get a ROUGE-L score of just 32.86, compared to 40 for the current state-of-the-art abstractive summariser.

Anecdotally, we believe some SOTA models are over-optimised for ROUGE scores; when evaluating abstractive models, we have found they tend to be perversely incentivised to perform extractively. We are actively investigating other evaluation metrics, such as (TER/hTER) which consider the human edits necessary to reach a coherent, factually correct abstract.

The Future

We are really excited to see how people use this resource. We will be updating you on the progress of our abstractive summariser over the coming year, and making more datasets available to the community. We hope others will build on our work as we have built on that of others.

We’re particularly interested in exploring alternative evaluation metrics, ensuring factual accuracy and pushing forward auto-summarisation as a key pillar of our offering: if that sounds interesting to you, we’re hiring.

If you’re interested in commercial use or access to the wider catalogue of Curation data, including a larger set of over 150,000 professionally-written abstracts and a scalable, on-demand content abstraction API (driven by humans or AI), please get in touch.

Henry Dashwood — Machine Learning Engineer, Curation
Tom Jennings — CTO, Curation

--

--