Top K Frequent Words
Statement
Given a string array words
, and an integer k
, return the k
most frequent words. Sort the frequencies from highest to lowest, and words with the same frequency should be sorted by their lexicographical order.
Constraints
- 1 ≤
words.length()
≤ 500 - 1 ≤
words[i].length()
≤ 10 words[i]
consists of lowercase English characters- 1 ≤
k
≤ number of unique words
Solution
"""
production algorithm
"""
from collections import Counter
from data_structures.heap.heap import Heap
class Solution:
def top_k_frequent(self, words, k):
"""
time complexity O(nlogk)
space complexity O(k)
"""
smallest = Heap()
counter = Counter(words)
for word, frequency in counter.items():
smallest.push(MinHeapData(frequency, word))
if k < smallest.size():
smallest.pop()
return [smallest.pop().word for i in range(k)][::-1]
class MinHeapData:
def __init__(self, frequency, word):
self.frequency = frequency
self.word = word
def __lt__(self, other):
if self.frequency == other.frequency:
return other.word < self.word
return self.frequency < other.frequency
"""
unit tests
"""
from unittest import TestCase
from algorithms.top_k_elements.top_k_frequent_words.solution import Solution
class SolutionTestCase(TestCase):
def test_no_duplicate_frequencies(self):
# Given
words = ["example",
"harmony", "harmony",
"freedom", "freedom", "freedom",
"journey", "journey", "journey", "journey",
"tranquil", "tranquil", "tranquil", "tranquil", "tranquil",
"library", "library", "library", "library", "library", "library",
"sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine",
"blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom",
"adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure"]
k = 3
solution = Solution()
# When
actual = solution.top_k_frequent(words, k)
# Then
expected = ["adventure", "blossom", "sunshine"]
self.assertEqual(actual, expected)
def test_some_duplicate_frequencies(self):
# Given
words = ["example",
"harmony", "harmony", "harmony",
"freedom", "freedom", "freedom",
"journey", "journey", "journey", "journey", "journey", "journey",
"tranquil", "tranquil", "tranquil", "tranquil", "tranquil", "tranquil",
"library", "library", "library", "library", "library", "library",
"sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine",
"blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom",
"adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure"]
k = 3
solution = Solution()
# When
actual = solution.top_k_frequent(words, k)
# Then
expected = ["adventure", "blossom", "sunshine"]
self.assertEqual(actual, expected)
Strategy
Top K Elements.
Explanation
First, count the frequency of each word. Then, iterate each unique word and its frequency. Push each unique word and its frequency to a min heap, such that words are sorted by frequency and then by lexicographical order. When the size of the heap reaches k+1
, then pop the top of the heap, so that the maximum size of the heap remains k
. The heap holds the top k
frequent words sorted by lexicographical order.
Time Complexity
There’s a string array of words with size n
, m
unique words, and a heap with size proportional to k
. Iteration of all words to count frequencies has time complexity O(n)
. Then, there’s iteration of all unique words, with heap operations for each iteration of a unique word, which has time complexity O(mlogk)
.
When mlogk << n
, then the time complexity of the algorithm is O(n)
. However, when n << mlogk
, then the time complexity of the algorithm is O(mlogk)
. Therefore, the time complexity of the algorithm is O(n + mlogk)
.
Space Complexity
The auxiliary space complexity depends on the number of unique words m
and the size of the heap k
. Therefore, the auxiliary space complexity of the algorithm is O(mlogk)
.