Unfolding a novel recursive autoencoder for extraction based summarization
With push notifications and article digests gaining more and more traction, the task of generating intelligent and accurate summaries for long pieces of text has become a popular research as well as an industry problem.
There are two fundamental approaches to text summarization: extractive and abstractive. The former extracts words and word phrases from the original text to create a summary. The latter learns an internal language representation to generate more human-like summaries, paraphrasing the intent of the original text.
This blog post tackles the problem of single and multi-document extractive summarization using a novel recursive autoencoder architecture that:
- Learns representations of phrases in an unsupervised fashion
- Uses these representations to extract document features
- Takes into account the context provided by the document
- Generates a condensed form of a document(s) by establishing a thematic relevance
Architecture and Algorithmic Paradigms :
Every document contains content-specific and background terms. So far, the existing models in the extraction based summarization space work independent of context as the indexing weights are assigned solely on a term by term basis. Moreover, most extraction based summarization frameworks rely on context independent term indexing and use only static features like term length, term position and term frequency to establish the importance of a sentence in a given passage of text. Consequentially, the context in which the term appears is not taken into consideration when the framework renders a summary.
In PhrazIt, this complete dependence on term significance when building the index is reduced and a higher emphasis is laid on the context of the document .
PhrazIt powers off TextRank — an unsupervised graph-based ranking model for text processing. The reason for sticking to an unsupervised algorithm is that with supervised text summarization algorithms we will also need to provide a large amount of training data. This in effect translates to requiring many documents with known key phrases. And so, although supervised methods are capable of producing interpretable rules for what features characterize a key phrase, the tradeoff against its strict requirement of training data was the intuition behind moving to an unsupervised algorithms space instead.
Additionally, unsupervised keyphrase extraction frameworks are also way more portable across domains because the extraction process is not domain specific. Instead, such frameworks learn the explicit features that characterize key phrases and exploit the structure of the text itself to determine if the key phrases present appear central to the text. This is done pretty much the same way as PageRank works off selecting the important web pages (making TextRank an algorithm that runs PageRank on a graph specifically designed for a particular NLP task). Further, TextRank doesn’t rely on previous training data — making it possible to run the algorithm on any arbitrary piece of text with it producing an output based entirely on the text’s intrinsic properties.
Consequentially, PhrazIt ranks sentences on the basis of the thematic score associated with them thereby allowing for context by
1. Leveraging a contextualized distributed semantic space
2. Using a weighted model to identify the most thematically relevant sentences in a passage of text
To put it in perspective, in PhrazIt, we are still working with a general purpose graph based ranking algorithm. However, unlike traditional graph based ranking algorithms where a graph is built by using the words of a sentence as the vertices while the edges are typically based on the static feature score that comes from the term position, term lengths and term frequencies — in PhrazIt, a graph is built by using the contextualized phrasal representations of the sentences that form the passage as the vertices while the edges are based on an aggregate score that comes from the static features of term length, term position and term frequency in addition to the similarity score that gets computed from the contextualized phrasal vectors.
What are these contextualized phrasal vectors? How does PhrazIt leverage them?
The contextualized phrasal vectors are created by powering off Word2Vec — a Distributional Semantics framework that has been built on the hypothesis that linguistic terms with similar meanings would have similar distributions. That is, words which are similar appear in similar context. However, with naive Word2Vec, the vector representations get generated only at a word level and not a phrasal level. And since, words can only capture so much, Word2Vec sees limitations when we want to exploit and learn about relationships between sentences that provide context to the document(s) under analysis.
PhrazIt therefore augments the manner in which Word2Vec functions to a ‘Word2Vec++’ which essentially permits the framework to generate contextualized phrasal vector representations. This is done through an integration of LSA and LDA into the solution. Given n sentences, LSA generates the concepts referenced in those sentences. LDA on the other hand works as a generative model and explains a set of observations using a bunch of unobserved groups and establishes why some data is more similar. Given n sentences it lists the topics referenced in those sentences. PhrazIt uses LSA and LDA to get the concepts and the topics referenced in a sentence. Having reduced every sentence into a series of concepts and topics, the learning problem we consider is as follows: Given a set of sentences of variable lengths, we want to construct a fixed n-dimensional representation for each sentence with the desired property that sentences that are closer in this n-dimensional space are more similar semantically. This creation of a contextualized distributed semantic space that renders phrasal vectors therefore allowing PhrazIt to learn improved representations of sentences as we are working with a much richer feature space.
Creating Contextualized Phrasal Vectors
Using a dictionary of size d, we represent a sentence of m words as a d x m matrix, M, where the i-th column is a d-dimensional vector with the entry corresponding to the dictionary index of the i-th word set to 1, and 0 elsewhere. Using this index matrix, we assign each word its continuous feature representation by simply multiplying the two matrices, X=LM, where the i-th column in L is the representation of the i-th word in the dictionary, M is the index matrix, and X is the n x m matrix representing the sentence. This X will be the final input to our model. Note that sentences of different lengths, m and m’, would have matrices of different dimensions, n x m and n x m’. Our model needs to construct a continuous n-dimensional representation from an n x m matrix for any positive-valued m.
It essentially dittoes as an recursive autoencoder as it is automatically extracting features in an unsupervised manner. In other words, in PhrazIt the contextualized distributed semantic space works off the paradigm where it is trying to learn an identity function to reproduce its inputs.
The autoencoder collapses the sentence by taking two neighboring words(concept or topic), concatenating the two n-dimensional vector representations to form one 2n-dimensional input vector to the autoencoder. After applying the encoding process, the activations at the hidden layer will be an n-dimensional vector representing the two words jointly. Then, we replace the two words (concept(s) or topic(s)) with this joint representation and repeat until there is only one representation for the entire sentence. The manner of this collapse (or conversion of a sentences into its contextualized phrasal vector representation) has been illustrated in the diagram below.
Thus, we have now effectively been able to render a phrasal representation for every sentence by leveraging Word2Vec, LSA and LDA by exploiting an artificial neural network. For PhrazIt, we additionally also generate a vector representation of the entire document itself. It is this document vector representation that we compare with all of the phrasal vector representations.
As illustrated earlier, PhrazIt constructs a graph using the contextualized phrasal representations of the sentences as the vertices while the edges are based on aggregate scores that comes from the static features of term length, term position and term frequency in addition to the similarity score that gets computed from the contextualized phrasal vectors (that is, how close the phrasal vectors are in the n-dimensional space).
Once this graph is constructed, it is used to form a Stochastic or a Markov matrix and is combined with a damping factor (just like PageRank works in the random surfer model scenario). The ranking over these vertices or in this case the sentences is obtained by finding the eigenvector that is corresponding to the eigenvalue 1. It is this ranking that establishes the thematic relevance of the sentences in the passage of text with the sentence. In other words, the sentence in the document of text that has the highest aggregated similarity score with the document itself is the one that is touted to be the most thematically relevant given the document’s context.
While PhrazIt powers of TextRank for single document summarization, it feeds off LexRank for multi-document summarization. The TextRank framework is extrapolated to LexRank to also allow for lexical association. LexRank conventionally uses just cosine similarity. However, within the PhrazIt framework we have again powered off the contextualized phrasal vectors to assess phrasal similarity. (This has been elucidated in the previous section). Additionally, we have also applied a heuristic post processing that not only adds sentences in a ranked order but also discards sentences that are paraphrases of each other so that the summary rendered biases away from picking sentences that might be paraphrases or repetitions of each other.
Customer Adoption :
Several industries spanning across several domains recognize the usefulness of a progressive text summarizer. Consequently, PhrazIt has seen a series of successful adoptions by several clients. It just works.
Usecase #1: Summarizing long emails
The client (A popular corporate learning solutions provider) wished to condense a set of verbose emails (~3500 words per email) into a few sentences (~50 words) that conveyed the most important points thereby picking the essence of the email. The most important aspect of the summary was to not miss the call-to-actions in the emails. PhrazIt’s ability to render summaries in accordance to the context of the document (here email) aided tremendously by not only generating a contextualized summary that is a precise representation of the email, but also one that ranks the action items(important pieces of the email) in accordance with their ‘thematic relevance’.
Usecase #2 : Integrating into partner workflow
The client (A popular Q&A site) wanted to highlight and pick thematically relevant text snippets from a wordy answer. PhrazIt was successfully integrated into the client’s workflow with PhrazIt and Watson’s Retrieve and Rank offering providing a comprehensive solution that granted the ability to quickly skim over the answers to their queries.
Future Work :
Evaluating summaries (be it automatically or manually) continues to be a difficult task. The main problem in the evaluation comes from the impossibility of building a standard against which the results of a system can be compared given that the nature of a summary is so subjective. Content choice is not a settled problem. People are completely different and subjective authors would possibly select completely different sentences. As a result, it is imperative that we continue to develop frameworks that are focussed at bringing order to texts.