Extractive summarization vs Abstractive summarization

Published in

GoPenAI

7 min readMar 26, 2023

There is an invasion of posts on newspapers and social networks regarding Large Language Models (LLMs) and ChatGPT, obviously I can’t stay away. So I’m here trying these powerful tools applied in some NLP tasks.

Starting from what are Language Models, they can be defined as a probability distribution over sequences of words. Given a vocabulary of words, a language model assigns a probability to each sequence of words, and the purpose is to predict the next word in the sequence. An easy example of a model can picked up from the Markov process.

Since the introduction of Transformer by Google in 2017, we have been listening to Large Language Models (LLMs), such as BERT, XLNet, GPT-3, and so on.

At this point what are Large Language Models (LLMs), how are they employed in real life, and why are they useful?

Large Language Models (LLMs) are deep neural networks firstly pre-trained on massive quantitative text data learning basic language tasks and functions, and then they are fine-tuned on new data for specific use cases.

They are used in the automation of processes, and some use cases are sentiment analysis and classification, fraud detection, text summarization, text generation, chatbots and virtual assistants, named entity recognition, machine translation, and so on.

Essentially they can help save time and improve the accuracy in the tasks where they are employed. To understand better how they are useful I’ve created a notebook focused on the summarization task, with the comparison between extractive summarization and abstractive summarization and I’ve deployed on Streamlit Cloud a summarization app using the last analyzed approach.

Extractive summarization is the easiest way to give a recap of a text. It consists of extracting the most important sentences from a text selected by the scores given to each sentence and then combining them in a summary format.

Abstractive summarization produces a summary rewriting sentences from the text provided, it gives an acceptable representation of the text exploiting the algorithm’s semantic capability and it’s usually modelled as a sequence-to-sequence task.

Going deeply into this project, I’ve used as a text, a claim procedure chapter from Allianz’s car insurance policy, because I’ve thought about the use of it in the Insurance field, given the large availability of written documents.

# without titles, and subtitles 
sample_doc = """
You must report to Us immediately any
accident, injury, loss or damage which
may give rise to a claim under this Policy.
All incidents must be reported to our
Emergency Services phone number:
Republic of Ireland 1890 48 48 48
Northern Ireland or United Kingdom
00353 1 6133666.

In the event of an accident You should
obtain the following information:
1. The names, contact details and
vehicle details of all parties
involved.
2. The insurance details including the
Policy number of all parties
involved.
3. Details of any witnesses to the
incident or members of An Garda
Siochana / Police that attended the
scene of the accident.

1. Not admit responsibility, sign any
statement or negotiate the
settlement of any claim, without the
written agreement of Allianz.
2. Complete any form(s) We may send
You.
3. Give Us all information and
assistance required.
4. Notify Us immediately of any
impending prosecution, inquest or
fatal inquiry, writ or summons.
5. Send Us, as soon as possible, any
writ or summons, letter or other
documents You may receive.
6. The registration and insurance
details of Your Car should be
provided to any other party involved
and also An Garda/Police, if
requested.
7. If any person is injured, the accident
must be reported to An
Garda/Police, whether they attend
the scene of not.
If You do not do so, We reserve the
right not to pay a claim. We are
entitled, at any stage during any claim,
to take over and conduct the defence
or settlement of the claim, and, at our
discretion, to pursue the claim for our
own benefit in the name of any person
insured.
"""

The first approach has been built as an extractive summarization tool from scratch using the NLTK library. With the pre-processing of the text, removing stop words and tokenizing the corpus, is available a dictionary of words that is used to build a frequency table to score each word. This dictionary is used over every sentence to understand which are the most important sentences overall in the text and then is built another dictionary with the score of sentences. With the average threshold are selected sentences that will be stored in the summary.

You must report to Us immediately any
accident, injury, loss or damage which
may give rise to a claim under this Policy. 
All incidents must be reported to our
Emergency Services phone number:
Republic of Ireland 1890 48 48 48
Northern Ireland or United Kingdom
00353 1 6133666. The names, contact details and
vehicle details of all parties
involved. Not admit responsibility, sign any
statement or negotiate the
settlement of any claim, without the
written agreement of Allianz. Notify Us immediately of any
impending prosecution, inquest or
fatal inquiry, writ or summons. Send Us, as soon as possible, any
writ or summons, letter or other
documents You may receive. The registration and insurance
details of Your Car should be
provided to any other party involved
and also An Garda/Police, if
requested. If any person is injured, the accident
must be reported to An
Garda/Police, whether they attend
the scene of not. We are
entitled, at any stage during any claim,
to take over and conduct the defence
or settlement of the claim, and, at our
discretion, to pursue the claim for our
own benefit in the name of any person
insured.

The tool works fine, and clearly, the result shows the most collected relevant sentences in the text without a rielaboration.

With the following approaches are used Large Language Models (LLMs) for the abstractive summarization. They exploit the Transfer Learning technique before mentioned.

With a high-level overview, the first tool used is the T5: Text-to-Text-Transfer-Transformer model released by Google. It’s an encoder-decoder model with the purpose of converting all NLP problems into a text-to-text format. It means to use always an input sequence and a corresponding target sequence.

The last tool used is GPT-3: Generative Pre-trained Transformer model released by OpenAI. It’s an autoregressive language model that employs deep learning and is able to generate natural-sounding language sequences. Autoregressive means that the choice of the current word depends on the previous word.

Both models use the Transformer architecture that exploits the Attention mechanism. I don’t want to spend time on the architecture, because there are papers and useful tutorials for this purpose, but generally speaking, the Transformer structure is made up of an encoder-decoder. The encoder maps the input sentence and produces a vector representation of it. The decoder takes as input both the encoder’s output and the decoder’s output from the previous time step and then generates an output probability distribution. Attention is an interface connecting the encoder and decoder providing to the last one the use of the most relevant parts of the input sequence in a flexible way to produce the next word prediction, by a weighted sum of all the encoded input vectors.

Here is the result from the second approach using T5:

all incidents must be reported to our Emergency Services phone number. 
details of all parties involved, insurance details including Policy number. 
not admit responsibility, sign any statement or negotiate settlement of claim.
notify us of impending prosecution, inquest or fatal inquiry, writ or summons, 
letter or other documents. if any person is injured, the accident must be 
reported to An Garda/Police, whether they attend the scene of not, if requested.

It is a concise summary despite the previous one, but not very clear.

The last approach uses GPT-3 by the wrapper LangChain.

In the event of an accident, you must report it to Allianz's 
Emergency Services phone number and obtain the names, contact details, 
vehicle details, and insurance details of all parties involved, 
as well as any witnesses or police that attended the scene. 
You must not admit responsibility, sign any statement, or negotiate the 
settlement of any claim without Allianz's written agreement. 
You must also provide the registration and insurance details of your car 
to any other party involved and the police, if requested. If any person 
is injured, the accident must be reported to the police, even if they 
do not attend the scene. If you do not do so, Allianz reserves the right 
not to pay a claim.

In my opinion is the best one, with a concise summary generating sentences with a clear representation of the text.

From this experience, I’ve built a summarization app useful for creating summaries of large documents. With these shown examples the tool can be applied in customer service, moreover in the actuarial department where now there are many documents about IFRS 17 regulation. In the following, I show a summary from an IFRS 17 web page by the app developed with some lines of code deployed in the Streamlit Cloud.

The tool is easy to use, just copy and paste your “OpenAI API KEY”, attach the text you want to summarize, and select the max number of tokens for your result….

… and push the button “Generate Summary”!!!

Enjoy with text summarization app!!!

Extractive summarization vs Abstractive summarization

Written by Claudio Giorgio Giancaterino