How to find Term Frequency with Python?

Frank Oprel
Voice Tech Podcast
Published in
5 min readSep 16, 2019

Term frequency can be an important an indicator of a term’s importance to a text. This simple metric has many useful applications and is often found in natural language processing methods, such as Term Frequency-Inverse Document Frequency.

1. The basics

Counting how many times a word appears in a text can be achieved with one of Python’s built-in data types: the dictionary.

Consider the following string:

string = "apple banana apple banana apple pear peach"

The string consists of four unique words (i.e. apple, banana, pear and peach) that appear one or more times. To count the frequency of each word in the string, we first have to turn the string into a list of words. This allows us to loop over each word later on. Let’s split the string by spaces and put each word into a list:

words = list(string.split(" "))
print(words)
['apple', 'banana', 'apple', 'banana', 'apple', 'pear', 'peach']

We can now loop over each word in the list words:

for word in words:
print(word)
apple
banana
apple
banana
apple
pear
peach

However, we want to count how many times each unique word appears in the list words. To do so, we first have to create a Python dictionary that can contain key-value pairs. Then we should slightly extend the for-loop to perform the following operations:

  1. If the word does not exist as a key in the dictionary, we will add it and set its value to 1.
  2. If the word does exist as a key in the dictionary, we will increase its value by 1.

It will look like this:

word_count = {}
for word in words:
if word not in word_count:
word_count[word] = 1
elif word in word_count:
word_count[word] += 1
print(word_count)
{'apple': 3, 'banana': 2, 'pear': 1, 'peach': 1}

2. Preprocessing

This method works very well on the simple string above. Unfortunately, it becomes less reliable when it’s applied to realistic texts such as news articles, blogs, Facebook posts and Tweets. That’s because real texts contain more characters than just letters and spaces. split(“ ”) will not yield the same results on strings that contain punctuation. Preprocessing is necessary to find term frequencies in realistic texts.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Examine this excerpt from a CNN news article:

string = """
Washington (CNN)John Bolton had to go -- because he wanted to cancel President Donald Trump's worldwide reality show.\nFor a time the now ex-national security adviser, who first caught Trump's eye with his tough talk on Fox News, was useful to the President -- sharing his desire to shake up the globe.\nBut like everyone else in Trump's dysfunctional foreign policy team, Bolton wore out his welcome, standing in the way of his boss' impetuous instincts and seeking a share of the spotlight.\nOnly in the bizarre Trump orbit could the exit of a national security adviser seen as an ideologue and aggressive hawk also be perceived in some ways as the removal of a stabilizing force. But he did have a view of American interests and the use of US power that while hardline was predictable and logical and positioned within the historic boundaries of US diplomacy.\nLike everything in Trump's foreign policy, there is a political explanation for the latest storm that rocked the White House.
"""

There are several things that we have to consider when counting term frequencies in these types of text:

  • Capitalization: “And” and “and” are counted separately
  • Special characters: “policy” and “policy,” are counted separately
  • Line breaks: “\n” is counted as a separate term

Python’s Regular Expression module provides a simple method to substitute characters in a string. This allows you to substitute special characters and line breaks for another character, like a space.

import re# Replace all line breaks with a space
string = re.sub('\n', ' ', string)
# Replace all special characters with a space
string = re.sub('[^A-Za-z0-9]+', ' ', string)

However, this creates another challenge. Leading, trailing and double spaces are created as a consequence of replacing line breaks and special characters with spaces. Additionally, single characters are created. These challenges can be easily solved:

# Replace all single characters with a space
string = re.sub(r'\b[a-zA-Z]\b', ' ', string)
# Replace all double spaces with one space
string = re.sub(' +', ' ', string)
# Remove leading and trailing spaces
string = string.strip()

Finally, we only have to make the text lower case:

# Make all text lower case
string = string.lower()

If we now split the text based on spaces and place it into a list, counting term frequencies will yield clean results:

words = list(string.split(" "))word_count = {}for word in words:
if word not in word_count:
word_count[word] = 1
elif word in word_count:
word_count[word] += 1
print(word_count){'washington': 1, 'cnn': 1, 'john': 1, 'bolton': 2, 'had': 1, 'to': 4, 'go': 1, 'because': 1, 'he': 2, 'wanted': 1, 'cancel': 1, 'president': 2, 'donald': 1, 'trump': 5, 'worldwide': 1, 'reality': 1, 'show': 1, 'for': 2, 'time': 1, 'the': 12, 'now': 1, 'ex': 1, 'national': 2, 'security': 2, 'adviser': 2, 'who': 1, 'first': 1, 'caught': 1, 'eye': 1, 'with': 1, 'his': 4, 'tough': 1, 'talk': 1, 'on': 1, 'fox': 1, 'news': 1, 'was': 2, 'useful': 1, 'sharing': 1, 'desire': 1, 'shake': 1, 'up': 1, 'globe': 1, 'but': 2, 'like': 2, 'everyone': 1, 'else': 1, 'in': 5, 'dysfunctional': 1, 'foreign': 2, 'policy': 2, 'team': 1, 'wore': 1, 'out': 1, 'welcome': 1, 'standing': 1, 'way': 1, 'of': 7, 'boss': 1, 'impetuous': 1, 'instincts': 1, 'and': 5, 'seeking': 1, 'share': 1, 'spotlight': 1, 'only': 1, 'bizarre': 1, 'orbit': 1, 'could': 1, 'exit': 1, 'seen': 1, 'as': 2, 'an': 1, 'ideologue': 1, 'aggressive': 1, 'hawk': 1, 'also': 1, 'be': 1, 'perceived': 1, 'some': 1, 'ways': 1, 'removal': 1, 'stabilizing': 1, 'force': 1, 'did': 1, 'have': 1, 'view': 1, 'american': 1, 'interests': 1, 'use': 1, 'us': 2, 'power': 1, 'that': 2, 'while': 1, 'hardline': 1, 'predictable': 1, 'logical': 1, 'positioned': 1, 'within': 1, 'historic': 1, 'boundaries': 1, 'diplomacy': 1, 'everything': 1, 'there': 1, 'is': 1, 'political': 1, 'explanation': 1, 'latest': 1, 'storm': 1, 'rocked': 1, 'white': 1, 'house': 1}

Frank Oprel
Marketing Automation Specialist by day, Python hobbyist by night.

--

--