Top K Frequent Words

Ethan Davis
Data Structures and Algorithms DSA
3 min readApr 3, 2024
Data Structures and Algorithms

Statement

Given a string array words, and an integer k, return the k most frequent words. Sort the frequencies from highest to lowest, and words with the same frequency should be sorted by their lexicographical order.

Constraints

  • 1 ≤ words.length() ≤ 500
  • 1 ≤ words[i].length() ≤ 10
  • words[i] consists of lowercase English characters
  • 1 ≤ k ≤ number of unique words

Solution

"""
production algorithm
"""

from collections import Counter
from data_structures.heap.heap import Heap


class Solution:
def top_k_frequent(self, words, k):
"""
time complexity O(nlogk)
space complexity O(k)
"""
smallest = Heap()
counter = Counter(words)

for word, frequency in counter.items():
smallest.push(MinHeapData(frequency, word))
if k < smallest.size():
smallest.pop()

return [smallest.pop().word for i in range(k)][::-1]


class MinHeapData:
def __init__(self, frequency, word):
self.frequency = frequency
self.word = word

def __lt__(self, other):
if self.frequency == other.frequency:
return other.word < self.word
return self.frequency < other.frequency
"""
unit tests
"""

from unittest import TestCase
from algorithms.top_k_elements.top_k_frequent_words.solution import Solution


class SolutionTestCase(TestCase):
def test_no_duplicate_frequencies(self):
# Given
words = ["example",
"harmony", "harmony",
"freedom", "freedom", "freedom",
"journey", "journey", "journey", "journey",
"tranquil", "tranquil", "tranquil", "tranquil", "tranquil",
"library", "library", "library", "library", "library", "library",
"sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine",
"blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom",
"adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure"]
k = 3
solution = Solution()

# When
actual = solution.top_k_frequent(words, k)

# Then
expected = ["adventure", "blossom", "sunshine"]
self.assertEqual(actual, expected)

def test_some_duplicate_frequencies(self):
# Given
words = ["example",
"harmony", "harmony", "harmony",
"freedom", "freedom", "freedom",
"journey", "journey", "journey", "journey", "journey", "journey",
"tranquil", "tranquil", "tranquil", "tranquil", "tranquil", "tranquil",
"library", "library", "library", "library", "library", "library",
"sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine",
"blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom",
"adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure"]
k = 3
solution = Solution()

# When
actual = solution.top_k_frequent(words, k)

# Then
expected = ["adventure", "blossom", "sunshine"]
self.assertEqual(actual, expected)

Strategy

Top K Elements.

Explanation

First, count the frequency of each word. Then, iterate each unique word and its frequency. Push each unique word and its frequency to a min heap, such that words are sorted by frequency and then by lexicographical order. When the size of the heap reaches k+1, then pop the top of the heap, so that the maximum size of the heap remains k. The heap holds the top k frequent words sorted by lexicographical order.

Time Complexity

There’s a string array of words with size n, m unique words, and a heap with size proportional to k. Iteration of all words to count frequencies has time complexity O(n). Then, there’s iteration of all unique words, with heap operations for each iteration of a unique word, which has time complexity O(mlogk).

When mlogk << n, then the time complexity of the algorithm is O(n). However, when n << mlogk, then the time complexity of the algorithm is O(mlogk). Therefore, the time complexity of the algorithm is O(n + mlogk).

Space Complexity

The auxiliary space complexity depends on the number of unique words m and the size of the heap k. Therefore, the auxiliary space complexity of the algorithm is O(mlogk).

Links

--

--