Summarization by Adjacent Document: a New Way to Extract Insights
Today’s search engines are very good at surfacing relevant information, but finding that information is really just step one: distilling it, synthesizing it, and applying it to build more informed beliefs make up a large part of the knowledge acquisition journey.
How do we best distill the world’s knowledge?
This monumental question really sits at the foundation of Considdr’s mission. It’s about figuring out what’s most important from any source of information and expressing it in a concise format. Here are a few other ways to frame this question for any kind of information. What are the key takeaways ? What’s worth considering in forming my belief? What’s the TLDR? What’s the summary?
Existing Approaches to Summarization
Generally speaking there are two common automated summarization approaches: extractive and abstractive. For a more comprehensive look at extractive summarization approaches see this great article by Sciforce on the subject. I’ll borrow the author’s definitions for each form of summarization here:
Extractive summarization means identifying important sections of the text and generating them verbatim producing a subset of the sentences from the original text; while abstractive summarization reproduces important material in a new way after interpretation and examination of the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original one.
Obviously, abstractive summarization is more advanced and closer to human-like interpretation. Though it has more potential (and is generally more interesting for researchers and developers), so far the more traditional methods have proved to yield better results.
So in general, extractive approaches are about finding the best representative sentences in a document and returning those sentences directly. Abstractive approaches try to get algorithms to actually write new sentences to summarize a document.
An Imagined Ideal for Summarization
It’s an extremely helpful exercise to think about what the absolute ideal for knowledge distillation might look like. Say we want to distill an entire book down to its most important insights. In fact, there are very cool companies that hire teams of humans to do this work. Blinkist is one prominent example.
Imagine you’re an employee at Blinkist. What would enable you to write the best summary of key takeaways from the book The Lean Startup by Eric Reis.
- Gain expertise: Well you could spend a lifetime reading the extant literature on startups, getting lots of direct experience working with startups, and building deep expertise in entrepreneurship. (Remember we’re imagining the absolute ideal.)
- Identify what’s most important: Great. Now that you’ve put in decades of hard work gaining domain expertise — go ahead and read the entire book carefully. Next, identify what’s unique about the work; what’s most important; and what’s most essential to consider.
- Write a concise summary: Finally, you’re ready to write up a set of carefully worded sentences that paraphrase the key takeaways from the book.
After spending your entire life dedicated to generating the best summary you possibly can for The Lean Startup, you can now happily retire and look back on a life well spent. :)
Summarization by Adjacent Document
Obviously, existing extractive and abstractive approaches to summarization do not get anywhere close to the ideal. After reading through the above example, you might think — well, of course that’s a totally unrealistic expectation for summarization — no one is going to spend their life preparing to write a good summary!
But, what if I told you that our approach — Summarization by Adjacent Document — is just like having an expert:
- spend a lifetime gaining context in a given discipline
- read through a full text document or any source of information and identify the most important bits
- write a concise abstractive summary statement reflecting the key insights
Pretty great, right? Well, our model isn’t that expert, but it does identify the output of the work that an expert (or many) has already done. Think about it: when a writer cites someone else they (ideally) have a lot of contextual understanding of the field they are in; have meticulously read the document in question and much of the many relevant related documents; they’ve done the work of figuring out what is most important in that document; and then they write a nice concise summary of a key insight in the sentences around where they cite it in their own work.
Summarization by Adjacent document applies to both academic and non-academic writing. Though the explicit citation structure of academic documents makes the application of our approach pretty straightforward, at Considdr, we actually focused on applying our approach primarily to non-academic material.
Another way of thinking about Summarization by Adjacent Document is that it attempts to extract abstractive sentences from adjacent documents. At Considdr, we called this new approach to insight extraction “Summarization by Adjacent Document” because we don’t look at Document A to generate a summary for Document A, but instead we look at Documents B, C, and D, which cite Document A, to find one or many insight sentences (places where those authors have abstracted the key points from Document A).
At the end of the day, Summarization by Adjacent Document really just leverages the knowledge distillation work that has already been done (in many cases for centuries!), but that is fragmented across many documents. If you’d like to test it out, we’ve released our insight model as a Python package. Our insight_extractor returns the “abstractive value” of a given input sentence and is very simple to integrate into any projects that might want to utilize our Summarization by Adjacent Document approach.
Extensions of Summarization by Adjacent Document: Clustering Insight
There are many interesting extensions of our summarization approach. I’ll just cover one big one here.
Because we can leverage many documents at once, we can more easily identify unusual variation in citations and understand what insights are most important or most cited. We built a second model to do just that at Considdr. It clusters sentences to figure out when multiple authors are citing the same insight. To do this we tuned the SoTA model that matched Quora Question Pairs (see: https://www.kaggle.com/c/quora-question-pairs).
Unfortunately, this model relied on Considdr’s massive graph database of more than 3 million extracted insights that we could no longer afford to maintain after company closure — so it’s not included in the referenced package above.