Useful Document Summary Stats

Published in

fathominfo

4 min readJul 31, 2019

It only takes a couple of sentences to get a feel for the character of a document. Take these examples:

“We tackle the problem of counting the number of k-cliques in large-scale graphs, for any constant k ≥ 3.”
“Now, I’m guessing we won’t agree on health care anytime soon. A little applause right there. Just a guess. But there should be other ways parties can work together to improve economic security.”

You can probably guess from the terminology that the first sample comes from an academic paper in the math/science domain. Even absent of the references to applause and political parties, the contractions and casual tone suggest that that the second is a transcript of a speech.

For a recent project analyzing large document collections, we wanted to see whether we could make some of the same inferences using code.

The goal was to write a program that would encode key characteristics of a document into a set of numbers that could be plotted and analyzed. Ideally, those metrics could be used to clearly distinguish between very different document formats (e.g. song lyrics vs a legal brief). They could also be used to look at variation and identify outliers within a set of similar documents (e.g. a cache of diplomatic cables).

We started by coming up with a set of axes that might be used to characterize a document. Is it accessible or esoteric? Does it dive deeply into one topic or cover many topics? Is it free form or does it follow a set format? Here are a few more:

breadth ← → depth
short ← → long
simple ← → complex
objective ← → subjective
casual ← → formal

Some of these characteristics were easy to quantify, and others much less so. We particularly liked the following five metrics, both because they felt useful and meaningful and because the methodology behind them was easy to understand and communicate.

Word Count: You know this one from high school (I’ve been checking it while writing this blog post). Word count is a quick way to get a sense whether you’re looking at a news brief, a tweet, or an in-depth report.
Formality: F-score is derived from the number of context-dependent vs content-independent words. In casual modes of communication like spoken word or email people use lots of context dependent words: “My mom went to her favorite store last week.” Formal texts are typically much more precise: “Mary Johnson went to Ace Hardware on June 9th”.
Adjective Density: The count of adjectives in a document, proportional to the overall word count. It can be used as a measure of formality; we don’t tend to use many adjectives when we’re talking or texting. Adjective density can also give a feel for how descriptive or subjective the text is.
Numeric Density: The count of numbers in a document relative to the overall word count. Numeric density is a good indicator of how quantitative a document is. Scientific journals have high numerical densities, as do receipts and contracts.
Reading Level: Used both in education and government, the Flesch-Kincaid Grade Level formula calculates a reading level based on sentence length and the number of syllables in words. Many states use a variation on this formula to regulate the readability of insurance policies.

Taken together, these numbers can be used to create a sort of “report card” for a document. To show these tools in action I’ve created a simple interactive game that lives here (screenshot below).

These metrics are far from perfect. The shorter the document, the more likely these formulas are to misfire and produce a misleading result (a short document with long sentences might show a reading level of 50+). They also don’t capture the nuances that human readers can pick up on from a young age — sarcasm, irony, politeness. Formality in a legal brief means something different from formality in a wedding invitation.

And it’s worth noting that these stats are focused on the medium rather than the message; they won’t tell you about the people, places, and things that come up in the documents (though we’re working on that problem as well, stay tuned!).

For the reasons described above and others, we see these stats as a starting point rather than a final destination. The numbers alone don’t tell the whole story, but they can help surface unusual data points or groupings. We can see applications for journalists, historians, research administrators, and anyone else faced with a stack of documents too daunting to skim.

For more on how we’re thinking about document collections, check out fathom.info/text and fathom.info/tools.

Useful Document Summary Stats

Written by Linda Gorman