While our previous blog post we gave an overview of methods of extractive summarization that form subsets of the most important sentences contained in the input text(s), now we want to discuss more recent approaches and developments that generate closer to human text representations.
It is believed, that to improve soundness and readability of automatic summaries, it is necessary to improve the topic coverage by giving more attention to the semantics of the words, and to experiment with re-phrasing of the input sentences in a human-like fashion.
The approach that is expected to solve these problems is to switch from extractive to abstractive summarization. Contrary to extractive methods, abstractive techniques display summarized information in a coherent form that is easily readable and grammatically correct.
In the recent past, NLP has seen a rise of deep-learning based models that map an input sequence into another output sequence, called sequence-to-sequence models, have been successful in such problems as machine translation (Bahdanau et al., 2014), speech recognition (Bahdanau et al., 2015) and video captioning (Venugopalan et al., 2015). However, despite the similarities, abstractive summarization is a very different problem from machine translation: in summarization the target text is typically very short and does not depend very much on the length of the source. More importantly, in machine translation, there is a strong one-to-one word level alignment between source and target, but in summarization, it is less straightforward.
Inspired by success with machine translation, a bunch of deep-learning based techniques emerged to generate abstractive summaries. Depending on their focus, the approached can be roughly divided into structure- and semantic-based ones.
Structure-based approaches to abstractive summarization
The core of structure-based techniques is using prior knowledge and psychological feature schemas, such as templates, extraction rules as well as versatile alternative structures like trees, ontologies, lead and body, graphs, to encode the most vital data.
The central idea of this bunch of methods is using a dependency tree that represents the text or the contents of a document. At the same time, the algorithms of content selection vary significantly from theme intersection to different algorithms are used for content choice for outline e.g. algorithmic program or native alignment try across of parsed sentences. The outline is generated either with the help of a language generator or an associate degree algorithm.
An example of such approach is sentence fusion — the algorithm which processes multiple documents, identifies common information by aligning syntactic trees of input sentences, incorporating paraphrasing information, then matches subsets of the subtrees through bottom-up local multisequence alignment, combines fragments through construction of a fusion lattice encompassing the resulting alignment and transforms the lattice into a sentence using a language model. The approach therefore combines statistical techniques, such as local, multisequence alignment and language modeling, with linguistic representations automatically derived from input documents.
In this method, a full document is represented using a certain guide. Linguistic patterns or extraction rules are matched to spot text snippets that may be mapped into the guide slots (to form a database). These text snippets serve as the indicators of the outline content. An example of such approach is GISTEXTER, a summarization system that targets the identification of topic-related information in the input document, translates it into database entries and adds sentences from this database to ad hoc summaries.
Lead and body phrase method
The lead and body phrase method, proposed by Tanaka to summarize broadcast news, involves syntactic analysis of the lead and body chunks of the sentence. Inspired by the sentence fusion technique, this method identifies common phrases in the lead and body chunks followed by insertion and substitution of phrases to generate a summary through sentence revision. The operations include syntactical parsing of the lead and body chunks, identifying trigger search pairs, phrase alignment with the help of different similarity and alignment metrics. Finally, insertion or substitution or both are applied to generate a new sentence: if the body phrase has rich information and has the same corresponding phrase, substitution occurs, while if the body phrase has no counterpart, insertion takes place.
The rule-based methods depict input documents in terms of classes and lists of aspects. To generate a sentence, this scheme uses a rule-based
information extraction module, content selection heuristics and one or more patterns. To generate extraction rules similar meaning verbs and nouns are identified. Several candidate rules are selected and passed on to summary
generation module. Finally, generation patterns are used for generation of outline sentences. A number of works, propose different sets of extraction rules, including rules for finding semantically related noun-verb pairs, discourse rules, syntactical constraints and word graph, or feature scores and random forest classification. This method provides the best summary but the main drawback is it consumes time as rules and patterns are written manually.
A graph data structure is widely popular in extractive and abstractive methods of summarization. The novelty of the system lies in the idea that every node represents a word unit representing the structure of sentences for directed edges. One of the best-know projects that applies this technique is Opinosis — a framework that generates compact abstractive summaries of extremely redundant opinions. The model generates an abstractive summary by repeatedly searching the Opinosis graph for sub-graphs encoding a valid sentence and high redundancy scores to find meaningful paths which in turn becomes candidate summary phrases. All the paths are afterwards ranked in the descending order of the scores and duplicated paths are eliminated with the help of the Jaccard measure to create a short summary. The Opinosis summarizer is considered a“shallow” abstractive summarizer as it uses the original text itself to generate summaries (this makes it shallow) but it can generate phrases that were previously not seen in the original text because of the way paths are explored (and this makes it abstractive rather than purely extractive).
Ontologies are extremely popular across NLP, including both extractive and abstractive summarization where they are convenient because they are usually confined to the same topic or domain. Besides, every domain has its own knowledge structure and that can be better represented by ontology. Though different in their specific approaches, all ontology-based summarization methods involve reduction of sentences by compressing and reformulation using both linguistic and NLP techniques. One of the most influential approaches is the “fuzzy ontology” method proposed by Lee which is used for Chinese news summarization to model uncertain information. This approach features an extensive pre-processing phase, comprising the outline of the domain ontology for news events by the domain experts and production of the meaningful terms from the news corpus. Afterwards, the term classifier classifies the meaningful terms on the basis of news events. For each fuzzy concept of the fuzzy ontology, the fuzzy inference phase generates the membership degrees. A set of membership degrees of each fuzzy concept is associated with various events of the domain ontology. News summarization is finally done by news agent based on fuzzy ontology.
Semantic-based approaches employ linguistics illustration of document(s) to feed into a natural language generation (NLG) system, with the main focus lying in identifying noun and verb phrase.
Multimodal semantic model
Multimodal semantic model captures the concepts and forms the relation among these concepts representing both text and images contained in multimodal documents. A semantic model is initially built using knowledge representation based on objects. Nodes represent concepts and links between these concepts represent relationship between them. Important ideas are rated using information density metric which checks the completeness, relationship with others and number of occurrences of an expression. The chosen concepts are finally transformed into sentences to create a summary. An example of such system is SimpleNLG that provides interfaces to offer direct control over the way phrases are built and combined, inflectional morphological operations, and linearization.
Information item based methods
In this methodology, instead of generating abstract from sentences of the input file, it is generated from abstract representation of the input file. The abstract representation is an information item which is the smallest element of coherent information in a text the information about the summary are generated from abstract representation of supply documents, instead of sentences from supply documents. The important goal is to identify all text entities, their attributes, predicates between them, and the predicate characteristics. The framework used in his method was proposed in the context of Text Analysis Conference (TAC) for multi-document summarization of news. At the initial stage of the Information Item (INIT) retrieval, subject-verb-object triples are formed by syntactical analysis of the text done with the help of a parser. Most INIT do not give rise to full sentences, and there is a need of combining them into a sentence structure before generating a text. In sentence generation phase, the sentences are generated using a language generator. In the next phase i.e. sentence selection phase, raking of each sentence is done on the basis of the average document frequency score. At last, highly ranked sentences are arranged and abstract is generated with proper planning. This method generates a short, coherent, information rich and less redundant summary. However, the fact that it rejects much meaningful information, reduces the linguistic quality of the created summary.
Semantic Text Representation Model
This technique aims to analyze input text using semantics of words rather than the text syntax or structure. To name an example of such a model, Atif et al. suggested Semantic role labeling to extract predicate argument structure from each sentence and the document set is split into sentences with the document and position numbers. The position number is assigned by using SENNA semantic role labeler API. The similarity matrix is constructed from semantic Graph for Semantic similarity scores. After that, a modified graph-based ranking algorithm is used to determine the predicate structure, semantic similarity and the document set relationship. Finally MMR is used to reduce redundancy for summarization.
Semantic Graph Model
This method generates a summary by creating a semantic graph called the rich semantic graph (RSG). The approach consists of three phases. At the first phase input document are represented using the rich semantic graph (RSG). In the RSG, the verbs and nouns of the input document are represented as graph nodes and the edges correspond to semantic and topological relations between them. The sentence concepts are inter-linked through the syntactic and semantic relationships generated in the pre-processing module. Then the original graph is reduced to a more concise graph using heuristic rules. Finally, the abstractive outline is generated from the reduced linguistics graph. This method produces less redundant and grammatically correct sentences, yet it is limited to a single document and does not support multiple documents.
Even though abstractive summarization shows less stable results comparing to extractive methods, it is believed that this approach is more promising in terms of generating human-like summaries. Therefore, we can expect more approaches mushrooming in this field which offer new perspective from the computational, cognitive and linguistic points of view.