Top K Frequent Words

Published in

Data Structures and Algorithms DSA

3 min readApr 3, 2024

Statement

Given a string array words, and an integer k, return the k most frequent words. Sort the frequencies from highest to lowest, and words with the same frequency should be sorted by their lexicographical order.

Constraints

1 ≤ words.length() ≤ 500
1 ≤ words[i].length() ≤ 10
words[i] consists of lowercase English characters
1 ≤ k ≤ number of unique words

Solution

"""
production algorithm
"""

from collections import Counter
from data_structures.heap.heap import Heap


class Solution:
    def top_k_frequent(self, words, k):
        """
        time complexity O(nlogk)
        space complexity O(k)
        """
        smallest = Heap()
        counter = Counter(words)

        for word, frequency in counter.items():
            smallest.push(MinHeapData(frequency, word))
            if k < smallest.size():
                smallest.pop()

        return [smallest.pop().word for i in range(k)][::-1]


class MinHeapData:
    def __init__(self, frequency, word):
        self.frequency = frequency
        self.word = word

    def __lt__(self, other):
        if self.frequency == other.frequency:
            return other.word < self.word
        return self.frequency < other.frequency

"""
unit tests
"""

from unittest import TestCase
from algorithms.top_k_elements.top_k_frequent_words.solution import Solution


class SolutionTestCase(TestCase):
    def test_no_duplicate_frequencies(self):
        # Given
        words = ["example",
                 "harmony", "harmony",
                 "freedom", "freedom", "freedom",
                 "journey", "journey", "journey", "journey",
                 "tranquil", "tranquil", "tranquil", "tranquil", "tranquil",
                 "library", "library", "library", "library", "library", "library",
                 "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine",
                 "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom",
                 "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure"]
        k = 3
        solution = Solution()

        # When
        actual = solution.top_k_frequent(words, k)

        # Then
        expected = ["adventure", "blossom", "sunshine"]
        self.assertEqual(actual, expected)

    def test_some_duplicate_frequencies(self):
        # Given
        words = ["example",
                 "harmony", "harmony", "harmony",
                 "freedom", "freedom", "freedom",
                 "journey", "journey", "journey", "journey", "journey", "journey",
                 "tranquil", "tranquil", "tranquil", "tranquil", "tranquil", "tranquil",
                 "library", "library", "library", "library", "library", "library",
                 "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine", "sunshine",
                 "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom", "blossom",
                 "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure", "adventure"]
        k = 3
        solution = Solution()

        # When
        actual = solution.top_k_frequent(words, k)

        # Then
        expected = ["adventure", "blossom", "sunshine"]
        self.assertEqual(actual, expected)

Strategy

Top K Elements.

Explanation

First, count the frequency of each word. Then, iterate each unique word and its frequency. Push each unique word and its frequency to a min heap, such that words are sorted by frequency and then by lexicographical order. When the size of the heap reaches k+1, then pop the top of the heap, so that the maximum size of the heap remains k. The heap holds the top k frequent words sorted by lexicographical order.

Time Complexity

There’s a string array of words with size n, m unique words, and a heap with size proportional to k. Iteration of all words to count frequencies has time complexity O(n). Then, there’s iteration of all unique words, with heap operations for each iteration of a unique word, which has time complexity O(mlogk).

When mlogk << n, then the time complexity of the algorithm is O(n). However, when n << mlogk, then the time complexity of the algorithm is O(mlogk). Therefore, the time complexity of the algorithm is O(n + mlogk).

Space Complexity

The auxiliary space complexity depends on the number of unique words m and the size of the heap k. Therefore, the auxiliary space complexity of the algorithm is O(mlogk).

Top K Frequent Words

Statement

Constraints

Solution

Strategy

Explanation

Time Complexity

Space Complexity

Links

data_structures_algorithms/algorithms/top_k_elements/top_k_frequent_words at main ·…

Contribute to davisethan/data_structures_algorithms development by creating an account on GitHub.

Written by Ethan Davis