Understanding Text Complexity with Readability Formulas.

Calculating the “ease to understand” factor any given text.

Suhas Dattatreya
Analytics Vidhya

--

Readability is generally measured by two key components — the typography and the content.
Typography covers the presentation of text — attributes like font size, line height, kerning, etc. are accounted for. Content takes a more statistical and lexical approach to the words in the text. [1]

What does a readability score tell us?

Readability scores are approximations. One of the short-comings of readability scores are that they do not account for grammar, spelling, voice, or any other qualitative aspects of a text. A text riddled with grammatical issues might receive the same readability score as one with perfect grammar.

Readability tests — The score generally helps us place our text from ‘easy to understand’ to ‘complex to understand’ where in this context, the ability to understand text correlates to the readability.
This article will help you understand how to go about calculating readability, math, formulae, and python examples so we can calculate the complexity of our text. [2]
You can find the libraries used and the references at the end of this post.

Note: I’m not an expert in typography. All my opinions and content in this article on readability formulas will strictly be on the statistical or lexical approach.

I’ve personally used readability formulas to understand the complexity of the text. I’ve found that all readability formulas work within their own cases (see use-cases) Here’s a list of the most used readability formulas used —

Most used readability formulas

Note: For this article, we’re going to focus on these formulas only

Working our way from the bottom up, we have —

  1. Flesch Reading Ease — The general idea with Flesch Reading Ease is that the avg. length of the sentences and the avg. number of syllables per word to calculate the reading ease.
    The median score for readable text is 60–70
Flesch Reading Ease

Where do we use Flesch Reading Ease? Unproven but it’s a good factor to consider for SEO. Usually, for holistic SEO, readability is key. [3]
If your text is very complicated, you may scare off your audience and make them search elsewhere for information.
Refer Flesch Reading Ease

Formula to calculate Flesch Readability

2. Gunning Fog — Gunning Fog unlike Flesch Reading Ease, does not output a numeric score. Instead, Gunning Fog estimates an index — the index represents the years of formal education a person needs to understand the text on the first reading. Gunning Fox’s index ranges from 0–20.
If a piece of text has a Gunning Fog index of 7, anyone who is educated to the 7th grade (13–14-year-olds) should find it easily readable.
Median Score for Gunning Fog is around 8.

The formula for Gunning Fog

Refer Gunning Fog

3. Automated Readability Index (ARI) — ARI is very straight forward to calculate. It takes into account characters, words and sentences, leaving debatable metrics like “complex words”. This formula was developed for the army. ARI was calculated by a small piece of equipment attached to a typewriter.
It’s still a popular formula and is particularly useful for technical writing.
Refer ARI here

Formula for ARI

It tabulated the number of keyboard strokes, words, and sentences in any passage being typed.

Refer ARI here

5. Smog Index — SMOG is considered the “gold standard” in measuring medical writing in the healthcare industry. SMOG stands for ‘Simple Measure of Gobbledygook’. The formula takes into polysyllabic words into consideration (Polysyllabic meaning more than more syllable)

Refer SMOG here

6. Flesch-Kincaid —IMHO, perhaps the most used formula especially in content writing on the internet. My team was responsible for developing a content management system for a e-commerce website that also had a blogs section and we decided to add a metric where the content writer would choose a geographic location and our backend would calculate what the average readability score of the users in that location.
That score would be later compared to the W.I.P article the writer was writing to give them insights if their content in the draft would appeal to the targeted audience.
Essentially, words that contain a lot of syllables are harder to read than words that use fewer syllables.

Flesch Kincaid readability formula

Refer Flesch-Kincaid here

7. Coleman-Liau — The problem with most readability formulas was that they required a syllable counting — which was an expensive process. Coleman-Liau essentially believes word length in letters is a better predictor of readability than word length in syllables.

After testing the Coleman-Liau index, it was accepted for general use-cases along with educational and technical writing.

CLI (Coleman-Liau Index) = 0.0588–0.296S- 15.8
Where L is the average number of letter per 100 words
S is the average number of sentences per 100 words

Refer Coleman-Liau here

8. Dale-Chall Readability — The Dale-Chall readability score measures a text against a number of words considered familiar to fourth-graders. According to its scale, the more unfamiliar words used, the higher the reading level will be.

Dale-Chall actually compares text to a known repository of words to understand if the word in the given text is complex or not. The words are compiled from 4th and 5th-grade textbooks. You can see the list of words here or here (alphabetically categorized)

Dale-Chall Readability

Thank you for reading.
References :

[1] https://en.wikipedia.org/wiki/Readability, https://www.ahrq.gov/talkingquality/resources/writing/tip6.html
http://www.ncpublicschools.org/docs/superintendents/memos/2014/01/readability.pdf

[2]https://en.wikipedia.org/wiki/Readability#Text_leveling

[3] http://www.readabilityformulas.com/articles/why-use-readability-formulas.php
https://www.thoughtco.com/readability-formula-1691895
http://www.impact-information.com/impactinfo/Limitations.pdf

--

--