Teaching an AI to summarise news articles: A new dataset for abstractive summarisation

We are open-sourcing 40,000 professionally-written summaries of news articles. Instructions for how to access the dataset can be found in our Github repository, along with examples of us using the dataset for fine-tuning.

Henry Dashwood
Curation Corporation



Automatic text summarisation is one of those research topics that has been around since the early days of computational linguistics and is still receiving a lot of interest from the research community, as it is far from being considered a solved problem.
Constantin Orasan Automatic summarisation: 25 years On (doi:10.1017/S1351324919000524)

Getting a computer to summarise a long document is a problem that dates back to the earliest days of Natural Language Processing (NLP), with statistical attempts in the late 1950s, the empiricist approaches of the 70s, machine learning techniques in the 90s, finally leading to the increasingly popular deep learning methods being used at the moment.

Broadly speaking, there are two computational approaches to the problem: extractive and abstractive. The former takes a document[s] and, using an algorithm, extracts what it deems the most relevant sections to produce a summary, without modifying the wording of the original.

In contrast, abstractive summarisation, takes the input document and tries to write a coherent, plausible and factually correct summary.

Abstractive summarisation is one of the hardest problems in natural language processing. An ideal summariser would not only be able to generate coherent text, it would also understand which information can be lifted directly from the source text, and which parts can be paraphrased. State-of-the-art (SOTA) models therefore don’t rely exclusively on translating between the input document and summary, but incorporate copy mechanisms and extractive summarisation tasks in training.


There are practical difficulties with building abstractive summarisation models as well. One is the sheer volume of information that has to be processed. Many NLP tasks work on the sentence or paragraph level. But for summarisation, an entire document is de rigueur. Such large input sizes force us to limit our batch sizes and can make training tedious and expensive.

There are relatively few datasets for abstractive summarisation, let alone good ones, and fewer still are publicly available. When training models for translation between languages we can draw on sources like books or the proceedings of the European parliament. For summarisation though, there has not been much impetus for people to make significant numbers of high quality abstractive summaries available for public consumption, particularly given the expense that goes into producing them.

Historically, datasets have been generated using clever hacks. For instance, we might use the sections of a Wikipedia article below the table of contents as our input, and the section above as our target output. Even the most popular dataset for abstractive summarisation, the CNN/Daily Mail dataset, is only able to use the subtitles of its articles for a target output. Further still, the text found in these implied summaries may often be noisy due to scraping inaccuracies, around 0.5% of the CNN/Daily Mail dataset has been found to have such errors, as shown below.

(Kryscinski et al. 2019)

Curation Dataset

For the last five years at Curation, we have been working to provide companies with the information that they most need to see. We keep abreast of the latest news developments across a range of industries and as part of this service we have a team of professional abstractors writing summaries of news articles.

We are open sourcing 40,000 human-written abstracts of news articles and links to the sources under a Creative Commons license for NLP researchers and data scientists to explore and build on.

We believe (and our clients seem to agree) these summaries are of an excellent quality, and we’re excited to see how the NLP community can utilise them. Although we cannot include the original article body, we are also releasing a script alongside the that can download and parse the sources for personal use.

What do we mean by excellent quality? As the CNN/Daily Mail corpus’ targets are the subtitles of various news articles, they often assume that the reader has already seen the piece’s headline. This means that taken in isolation the subtitles can feel like they are missing their first sentence. In our experiments we have found that state-of-the-art approaches replicate this in their predictions.

In contrast, Curation’s abstracts are specifically written by professional copywriters to stand alone a priori as an intelligible piece of content in their own right. Our writing team conform to a comprehensive internal style-guide and each piece is proofed twice by our editorial and content curation teams to ensure factual and stylistic accuracy. Our abstracts are on average 40 words longer than other publicly available datasets and are designed to maximise information density whilst still being a joy to read.

Want to try it out and see for yourself? Click here to download the Curation Abstracts Dataset.

Comparison of summarisation datasets
Comparison of summarisation datasets (based on Yang et al. 2019)

Example of summaries in the dataset

Animals must be stunned prior to being killed in order for halal and kosher meat to be marketed with the official European Union (EU) organic label, the European Court of Justice has ruled. The judges argued that organic labelling reflects the highest animal welfare standards, and that stunning is integral to these as it significantly reduces suffering. The UK’s Food Standards Agency estimates that 88% of animals killed under halal methods are stunned. However, a minority of halal – and all kosher – meat is produced without stunning. The case was brought by a French animal welfare association in 2012.
Researchers in Canada have developed an innovative solar cell which uses bacteria to convert light into energy. The team at the University of British Columbia developed a way of genetically engineering E.coli to produce large amounts of lycopene, a natural dye which bacteria use for photosynthesis. By coating the bacteria with a semiconducting substance and incorporating it into a battery cell, they were able to achieve a current density of 0.686 milliamps per square centimetre, which they say is the highest yet achieved by a biogenic solar cell.
Southeast Asia is unprepared for the rapidly-rising threat of extremist attacks, warns think tank The Institute for Policy Analysis of Conflict (IPAC). It suggests the main danger is in the southern Philippines, where a few Islamic extremist groups have sworn allegiance to ISIS; the groups have links to other parts of the region, particularly Malaysia and Indonesia. ISIS has also endorsed a Philippines-based militant leader for Southeast Asia and widened its "extremist recruitment pool" in the region, opening up channels for international funding and communication. IPAC says as ISIS loses territory in Syria and Iraq, it raises the risk of revenge attacks in Southeast Asia.

Examples of fine-tuning

Alongside the dataset, we are releasing some blogs in the forms of Medium posts and Jupyter notebooks for you to build your own abstractive summariser. They can be found in the examples section of the dataset’s repository on Github. We will add to and improve them over time. We would also love to receive feedback on what’s there!

Limitations of Evaluation Metrics

It is worth briefly mentioning the limitations of current methods of evaluating models for abstractive text summarisation. Since using humans to evaluate summaries is expensive, attempts have been made to create automatic metrics that require no human input, but nevertheless provide some valuable insight into the quality of a machine summary. Some of the most popular measures are provided by the ROUGE package (Lin 2004).

Given that ROUGE scores (in addition to other metrics such as BLEU) are based on n-gram overlap between ‘gold’ reference summaries and machine generated ones, such scores are limited in their ability to measure semantic overlap if extensive paraphrasing has been used. An abstractive summary that captures the salient points of a document could therefore achieve a very low ROUGE score if the phrasing in the reference summary is substantially different. In addition to this, ROUGE scores have been shown to correlate weakly with human judgement for both abstractive and extractive summarisation tasks (Kryscinski et al. 2019). For these reasons it is worth taking such measures with a pinch of salt.

For example, calculating ROUGE on our human summaries (which we hold to be the gold standard), we get a ROUGE-L score of just 32.86, compared to 40 for the current state-of-the-art abstractive summariser.

Anecdotally, we believe some SOTA models are over-optimised for ROUGE scores; when evaluating abstractive models, we have found they tend to be perversely incentivised to perform extractively. We are actively investigating other evaluation metrics, such as (TER/hTER) which consider the human edits necessary to reach a coherent, factually correct abstract.

The Future

We are really excited to see how people use this resource. We will be updating you on the progress of our abstractive summariser over the coming year, and making more datasets available to the community. We hope others will build on our work as we have built on that of others.

We’re particularly interested in exploring alternative evaluation metrics, ensuring factual accuracy and pushing forward auto-summarisation as a key pillar of our offering: if that sounds interesting to you, we’re hiring.

If you’re interested in commercial use or access to the wider catalogue of Curation data, including a larger set of over 150,000 professionally-written abstracts and a scalable, on-demand content abstraction API (driven by humans or AI), please get in touch.

Henry Dashwood — Machine Learning Engineer, Curation
Tom Jennings — CTO, Curation