Written Comm Analyzer — Vocabulary Range

Sajid Rahman
Prod.IO
Published in
1 min readOct 2, 2018

The range of vocabulary used in written communication is a direct indication of command over the language. Hence we thought a measure of vocabulary range to be an important aspect for analysis. We explored the state of the art techniques used to measure and score vocabulary range and came up with the following algorithm:

  1. Create a long (80K words in our case) list of words sorted by frequency of use i.e commonly used words like ‘the’ will sit at the top 5 whereas rarely used words will be towards the end of the list. We will call this list ‘A’
  2. Process the text and get a list of unique words. We will call this list ‘B’
  3. Arrange these words according to the order of occurrence in the word list. We will call this list ‘sorted B’
  4. Take the 75th, 85th and 95th percentile word out of ‘sorted B’, and check A to get the position of these words.
  5. We have the positions of the three words, which is an indirect indication of the vocabulary range. Higher the positions, wider is the range of vocabulary used in the text.
  6. Calculate Z-score for a better understanding of the vocabulary range.

This technique is used in a lost of tools that calculate the range of vocabulary. In our experiments, we found it to be a good representation of the vocabulary range as compared to what felt like while manually reading. The 3 numbers along with the Z-score can be used to create a weighted formula to generate a score.

--

--