Brief history of Text Summarization

9 min readJul 17, 2021

Text summarization is one of the important applications of Natural language processing (NLP). Hence, in order to understand the history of text summarization, it is required to also take a look at the history of NLP. Machine Translation (MT) is actually the origin of NLP which existed during second world war (1940s) for translating Russian language into English and vice versa with the help of a computer. Though there were developments in syntactic theory language and parsing algorithms in 1950s, it was not good enough to create an efficient MT. ALPAC’s (Automatic Language Processing Advisory Committee of the National Academy of Science — National Research Council) report of 1966 concluded that MT was not immediately achievable and recommended it not be funded. There was a period when there was not much development in the field NLP. Later on, with development in Artificial intelligence, NLP also started gaining attention from researchers. The first piece of work to capture attention outside mainstream NLP was Winograd’s SHRDLU thesis at MIT in 1971 (Yorick Wilks). It was during 1980s and 1990s that NLP started gaining interest as a topic for research and many developments in this field were kickstarted during this time. Along with the development of NLP, related areas like Statistical Language Processing, Information Extraction and Automatic summarization also gained interest.

Text summarization in early days were done exclusively using rule-based algorithms. It was called “importance evaluator”, which worked based on ranking different parts of a text according to their importance. Two important knowledge bases were used by the evaluator: one being the “importance rule base” that made use of IF-THEN rules and other being the “encyclopaedia” which contained domain specific world knowledge represented using a network of frames. The importance rule-based method makes use of a concept called HPN (Hierarchical Propositional Network), where numerical representations are given to conceptual units of extended linear representations (ELR) of sentences to constitute the importance of it. A referential structural rule is applied by goal interpreter to interpret the importance of each sentence. Another method called “Production rule system for summarization” was used in 1984, which basically works on three steps: (i) inferencing, (ii) scoring the format rows for their importance and finally (iii) selecting the appropriate ones as summary. (Elaine Marsh)

As research in the field of summarization technique progressed, a lot of developments were made to understand or interpret importance of sentences in a textual data corpus. One such method is to calculate the relatedness of a piece of text to other texts in the corpus and then decide on the importance by the degree or quantity by which this text is related to other texts. For calculating relatedness, the texts are represented in a vector space. In this method, texts are represented as a set or vector of terms. Based on the importance, a term weight is associated with each term and higher weights are assigned to more important terms. A term which occurs frequently is given more importance that an infrequent term. The term-frequency is calculated and represented as TF value is used as term weight. Another major factor called inverse document frequency (IDF) is also associated to term-frequency to calculate the importance. (Automatic Text Browsing Using Vector Space Model).

Two different approaches that are mainly used includes a bottom-up approach which is a combination of statistical word distribution information and general or text specific heuristics, and top-down approach which requires at least basic understanding of the subject matter. Some of the algorithms which follows these approaches include “Systems using term statistics” (TF-IDF) which is purely on term frequency, “Abstract generation method” implemented by Edmundson which uses a combination of term frequency and other features like Cue method, Title method, Location to calculate the weightage of sentence, “Trainable document summarizer” which uses features like sentence length cut-off, fixed phrase, paragraph feature, thematic words, upper case words along with term frequency to identify the weightage . Other methods which were in research with parallel to these methods includes “Text-Tilling” which identifies coherent passages in a given text corpus to identify “text boundaries by calculating the similarity between the nearby chunks of text , Hidden Markov Models and Clustering method. An attempt to create a text summarization algorithm with text understanding were also in development, one such system is “BREVIDOC full-text retrieval system” which was based on Rhetorical Structure Theory (RST). It consists of a Document structure analyser which performs document organization analysis, sentence analysis, text structure analysis which extracts rhetorical relation between sentences and then semantic role extraction with word indexing.

Text summarization using neural networks was an important development in Natural language processing area. In this method neural network is trained on a corpus of articles and further modified using feature fusion to create a summary with highly ranked sentences in an article. In the training process, the neural network learns about the type of sentences that should be included in a summary. During feature fusion the neural network is pruned and collapses the hidden layer unit activations into discrete values with frequencies. It then generalizes the important features that must be there in the sentences that are going to be part of summary. Finally, the modified neural network selects the sentences that will be part of summary by ranking the sentences.

Many developments were happening in parallel. One such approach is diversity-based approach in extractive summarization, which calculates the diversity of the sentences and tries to remove redundant sentences from final summary. In order to find diversity K-means clustering algorithm extended with Minimum Description Length Principle was used. More in the side of extractive method includes graph-based algorithm, Google’s page rank algorithm which are very popular and used in citation analysis, social networks, and the analysis of the link-structure of the World Wide Web. The main logic behind Graph-based algorithm is to decide the importance of a vertex within a graphical representation based on the information computed recursively from the entire graph instead of considering it individually. In TextRank method, similar logic is applied as graph-based approach and here it is applied on a lexical or semantic graph derived from natural language documents. One of the methods of text summarization which was based on Fuzzy logic, uses General statistical method as its base and further applies a fuzzy logic to decide on the importance of text. A fuzzy logic is based on fuzzy rules and membership functions, so selection of both plays a major role in the performance of fuzzy logic system. The fuzzy logic system consists of four components fuzzifier, an inference engine, de-fuzzifier and the fuzzy knowledge base.

Text summarization using seq2seq model in 2016, where an attentional encoder decoder Recurrent Neural Networks, which was actually established for machine translation, outperformed other models and shown a state-of-the-art performance among other models developed at that point of time. The below Figure represents, one of functionality that was implemented in a sequence-to-sequence RNN model.

Feature-rich-encoder functionality of a seq-2-seq RNN model

In the above figure of Seq-2-Seq RNN model, one embedding vector for each POS, NER tags and discretized TF and IDF values, which are concatenated together with word-based embeddings as input to the encoder.

There were other attention-based models introduced in text summarization area, helped in creating a better abstractive summarized output. The base of one such model is standard feed forward Network Neural Language Model (NNLM) which is used for estimating the contextual probability of next word, or known as next word prediction model. In that research, it was first used a Bag-of-words encoder base model, which doesn’t retain any order or the relation between neighbouring words. To improve the performance an attention-based contextual encoder that construct representations based on generation context was used. Using a contextual encoder along with input embedding has improved the overall performance of the model. As shown in below Figure, the encoder is modelled off of the attention-based encoder, where it learns a latent soft alignment over the input text to create the summary .

The heatmap represents a soft alignment between the input and summary

With the introduction of Bidirectional Encoder Representations from Transformer (BERT), there was a wide range of advancement in Natural Language Processing (NLP) tasks. Bidirectional Encoder Representations from Transformer introduced pretrained language models, which uses transfer learning technique to perform as a State-Of-The-Art model in NLP applications. There are multiple types of transfer learning techniques, one of the most promising and widely used technique is sequential transfer learning. In sequential transfer learning there are two stages, one is pre-training stage where a neural network is trained on a huge amount general data and then a fine-tuning stage where the model is trained on a domain specific task. This method helps the deep learning models to converge faster and with relatively less amount of fine-tuning data. Ideally fine-tuning data should be related to pre-training data for an effective transfer learning. This type of training is generally referred to as semi-supervised training where the neural network is first trained as a language model on a general dataset followed by supervised training on a labelled training dataset thus establishing a dependence of supervised fine-tuning on unsupervised language modelling. BERT is making use of sequential transfer learning, where it is trained using a Masked Language Modelling and a “next sentence prediction” task on a corpus of 3300M words. BERT has significantly proven its mark on text summarization area with all its transfer learning features. Another recent method of abstractive summarization is PEGASUS . It uses gap sentence generation (GSG) and masked language model (MLM) simultaneously to establish a state-of-the-art outcome with smaller set of samples. Recent release by Google on T5 (Text-to-Text-Transfer-Transformer) claims that it outperforms other high-end algorithms like BERT, GPT2 etc. on NLP tasks like text classification, question answering, text summarization etc. In this blog, only some of the significant algorithms which are part of history of text summarization are quoted. There many other algorithms which are not part of this study.

References:

There are many references, some are listed below:

Fum, D., Guida, G. and Tasso, C., (1986) Tailoring importance evaluation to reader’s goals. In: Proceedings of the 11th coference on Computational linguistics -. [online] Morristown, NJ, USA: Association for Computational Linguistics, p.256. Available at: http://portal.acm.org/citation.cfm?doid=991365.991440.

Edmundson, H.P., (1969) New Methods in Automatic Extracting. Journal of the ACM (JACM), 162, pp.264–285.

Hearst, M.A. and Plaunt, C., (1993) Subtopic structuring for full-length document access. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Infofmation Retrieval, 5, pp.59–68.

Kupiec, J., Pedersen, J. and Chen, F., (1995) A Trainable Document. SIGIR ’95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, 1, pp.68–73.

Zechner, K., (1997) A literature survey on information extraction and text summarization. Computational Linguistics Program, Carnegie Mellon University, [online] Fall 1996. Available at: http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/zechner/infoextr.pdf.

Kaikhah, K., (2004) Automatic text summarization with neural networks. 2004 2nd International IEEE Conference ‘Intelligent Systems’ — Proceedings, 1June, pp.40–44.

Nomoto, T. and Matsumoto, Y., (2001) A new approach to unsupervised text summarization. SIGIR Forum (ACM Special Interest Group on Information Retrieval), pp.26–34.

Mihalcea, R., (2004) Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions -. [online] Morristown, NJ, USA: Association for Computational Linguistics, pp.20-es. Available at: http://portal.acm.org/citation.cfm?doid=1219044.1219064.

Malte, A. and Ratadiya, P., (2019) Evolution of transfer learning in natural language processing. [online] Available at: http://arxiv.org/abs/1910.07370.

Zhang, H., Cai, J., Xu, J. and Wang, J., (2019a) Pretraining-Based Natural Language Generation for Text Summarization. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). [online] Stroudsburg, PA, USA: Association for Computational Linguistics, pp.789–797. Available at: https://www.aclweb.org/anthology/K19-1074.

Zhang, J., Zhao, Y., Saleh, M. and Liu, P.J., (2019b) PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. [online] Available at: http://arxiv.org/abs/1912.08777.

Brief history of Text Summarization

Written by Prasasthy K B