CodeFights Solves It: findSubstrings

Published in

CodeFights

11 min readApr 27, 2017

The Interview Practice Task of the Week was findSubstrings. This is an Uber interview question, and their technical interviews are notoriously lengthy and difficult. And the solution implements tries, a data structure that you’ll see a lot in interview questions. It’s a double whammy!

You know the drill by now: If you haven’t solved this challenge yet, hop on over to CodeFights and do it! Then come back here and we’ll walk through solving it together. We’ll start with a naive implementation and then work on optimizing it.

…Done? Okay!

Technical interview question

The goal of this Uber interview question is to find the longest substring that appears in a given list parts in a set of words. You'll return the list of words, with the substrings enclosed in square brackets. If there is a tie for the longest substring, you'll mark the one that appears earliest in the input string.

On a word-by-word basis, if our list is parts = ["An","a","men", "ing", "ie","mel","me"] then we would indicate substrings in the following way:

“word” becomes “word” (there is no substring of “word” in parts)
“anticipating” becomes “anticipat[ing]” (the substrings “a” and “ing” both appear, but “ing” is longer)
“interest” becomes “interest”
“metro” becomes “[me]tro”
“melrose” becomes “[mel]rose”
“melting” becomes “[mel]ting” (“mel” and “ing” are the same length, but “mel” appears first in the word)
“ingoltsmelt” becomes “[ing]oltsmelt”

Our function, findSubstrings, should take a list of strings words and another list of strings parts, and return the list of words with the longest substring indicated. For example, with words = ["word", "anticipating", "ingolt", "melting"], and parts defined as before we would get:

['word', 'anticipat[ing]', '[ing]olt', '[mel]ting']

Naive solution

Let’s try approaching this problem directly. This would mean going through our list of words, one at a time, and then checking each substring in parts to see if it appears in our word. We will keep track of the longest part that occurs earliest in the string.

def findSubstrings(words, parts):
    # We will keep all the final strings here
    output = []
    for w in words:
        # for each word, reset the longest fragment found, and what the match was
        longest = 0
        match   = ""
        position = len(w)

        # look at each part
        for p in parts:
            # is the part in there?
            if p in w:

                # make sure this part is strictly longer than the part already found,
                # OR if it is a tie, make sure that it appears earlier in the word w
                if len(p) > longest or (len(p) == longest and w.index(p) < position):
                    longest = len(p)
                    match   = p
                    position = w.index(p)

        # if we found something, put in the square brackets around the match, and
        # copy the word into our output list. If there is no match, copy the
        # original word to the output list
        if longest > 0:
            loc = w.index(match)
            output.append( w[:loc] + "[" + match + "]" + w[loc+len(match):])
        else:
            output.append(w)
    return output

We can try running this function on our example, and it works. So we have a running implementation.

But is it good enough to pass the CodeFights time tests, or will it take too long to execute? We submit it and… it passes!

(I’m going to add an important caveat here and say that it passes at the time when I wrote this. Our engineers add new tests to challenges fairly frequently, based on user feedback and challenge results, so there’s a chance that when you read this my naive solution might not pass anymore. But the explanation below of its run time is still really useful, so don’t just skip to the optimized solution!)

What’s the run time?

We should take a moment to think about the running time of this code. If we call W the number of words, and P the number of parts, then we can tell that the running time is at least O(WP) from our nested loop. With a high-level language like Python, we also want to look for hidden loops in function calls. Inside the p loop, we call if p in w and w.index(p) multiple times. These functions are scanning the word w to look for p, and will have a running time of O(len(W)). We could speed up the function a little bit by calling w.index(p) once and saving the result, but we still have a running time that looks like O(WPN), where N is the length of a typical word in w.

So how bad is O(WPN)? From the constraints page:
* We have at most 5000 words, so in the worst-case scenario W = 5000;
* Each word is at most 30 characters, so the worst case is N = 30;
* We have at most 5000 substrings in parts, so the worst-case is P = 5000.

Putting this together, this gives us 750 million loops. This means each loop would have to run in about 5 picoseconds to be done in 4 seconds! Put another way, we would have to have PetaHz processors for the worst case scenario.

Can we argue that we solved the challenge anyway, and that the CodeFights test showed us that the cases we will be seeing aren’t the worst possible cases? Yes, you can absolutely argue that! When companies are hiring, they want people who can write code quickly to solve today’s problems. A working solution today is better than a perfect solution in a month!

But you should make sure you demonstrate to the interviewer that you know how to think about algorithmic complexity. Microsoft famously had an exponential algorithm for determining which software patches were needed for Windows XP, but it was working fast enough for 10 years(!), and no one realized that the algorithm was exponential.

So you can tell the interviewer that you know this is an algorithm with three multiplicative factors (length of a typical word, number of words, and number of substrings). You should explain that you pass the tests, but that your algorithm won’t handle the worst case in reasonable time. Go on to explain that this isn’t an exponential algorithm, so getting more powerful computers or waiting a little longer isn’t unreasonable.

An interviewer may ask you to write a faster algorithm anyway, so if you really want to impress her you can preemptively mention that you think there are better algorithms that you would use if there was a good business case for spending the extra time on this. Remember that premature optimization is the root of all evil. (Maybe that’s a little over-dramatic… but it’s still bad!)

Giving it our best Trie

Instead of the naive implementation above, we will use a special type of tree data structure called a trie to store the substrings in parts. Like all trees in programming, a trie has a root node. Every other node keeps track of a letter, and getting to this point in the trie completes a substring from parts. Taking our example of parts = ["An","a","men", "ing", "ie","mel","me"] from earlier, we can think about the trie pictorially as:

Note there are 7 red letters in the picture of the trie, which correspond to the 7 words we have in parts. Every leaf is a red letter, but the e below the m is also red to indicate that me is in parts.

To check if a string s is in parts, we start at the root and move to the first level in the trie that matches the first letter of s. If we are trying to match me, for example, we would start by moving from the root to m, and then from m to e. Since we end up on a red node, we know that our word is on the list.

To see if in is a member of parts, we would start at the root and follow the path to i, then to n. Since n is not red, we haven't completed a string from parts, so we would conclude that in is not a member of parts.

If we want to see whether dog was in parts, we would see there is no path from the root to a d, so we could report right away that dog isn't in parts. We will be scanning a word character-by-character, and instead of having to search through all of parts, we only have to search through the subset of parts that matches what we have already seen.

You have seen this data structure in action before — in fact, every time you see auto-complete working every time you use a browser or text!

Implementing it in code

Now we have to translate our picture into code. We will make a class TrieNode to store all the information in a node:

class TrieNode:
    def __init__(self, letter):
        self.letter = letter
        self.terminal = False
        # here we store the children nodes of this node
        # the letter of the children are the key, and the
        # TrieNode object is the value
        self.children = {}

To create a trie with the words “cat”, “cam”, and “cha” and “chat”, we could write the following (tedious) code:

# This is terrible coding style. 
# Definitely don't do this in your interview!
root = TrieNode("")
# level 1 nodes
c = TrieNode("c")
root.children['c'] = c
# level 2 nodes
a1 = TrieNode("a")
c.children['a'] = a1

h = TrieNode("h")
c.children['h'] = c

# level 3 nodes
t1 = TrieNode("t")  #
m = TrieNode("m")   # these words are terminal!
a2 = TrieNode("a")  #

t1.terminal = True
m.terminal = True
a2.terminal = True

a1.children['t'] = t1
m.children['m'] = m
h.children['a'] = a2

# level 4 node
t2 = TrieNode("t") # also a terminal node
t2.terminal = True
a2.children['t'] = t2

or, pictorially, we have

trie data structure uber interview question

Let’s make a function that adds a word to the trie so that we don’t have to do it manually. Our method will be to start at the root, reusing existing nodes along our path (or making new nodes as needed). The code for this is:

def addFragmentToTrie(root, fragment):
    currentNode = root
    for letter in fragment:
        if letter not in currentNode.children:
            # create a new node
            currentNode.children[letter] = TrieNode(letter)
        currentNode = currentNode.children[letter]

    # we are at the final node. 
    # Mark it as terminal
    currentNode.terminal = True

Using this function, we can condense the code to create our trie to:

# This is much better, and produces the same trie
root = TrieNode("")
for p in ["cat","cam","cha","chat"]:
    addFragmentToTrie(root, p)

Using the trie

I am going to move most of the hard work — that is, finding the longest substring and enclosing it in square brackets — to a function findSubstringInWord(word, root). The function findSubstrings(words, parts) will be responsible for making the trie and collecting all the words together at the end.

Here is our strategy for findSubstringInWord:

Initialize our lenLongSubstr to 0.
Go through word one character at a time, using that character as a starting point for matching substrings in parts.
For each character, use the trie to find the longest substring from parts starting from this character. If it is longer than the longest one we've seen so far, record the current position in the string and the current length.
Use the length of the longestSeenSubstring and its starting position to put the square brackets in the right place.

Note how going through the string in order, and only updating the position when we see a strictly longer string, automatically encodes our tie-breaking condition (yay!). The trickiest part is finding the the longest substring from parts starting at a given location, because we need to know the last terminal node we passed though before failing to find a match on the trie. It would be a lot easier if we just had to keep track of the last node we were on before we failed to match.

def findSubstringInWord(w, root):
    lenLongSubstr, longestPos = 0,0

    for start_pos in range(len(w)):
        # reset to the beginning of the trie
        currNode = root

        for position in range(start_pos, len(w)):
            letter = w[position]
            if letter not in currNode.children:
                # we have run out of branches in trie,
                # so no use looking further down the string
                # from this starting position. Go back to
                # the previous loop
                break
            currNode = currNode.children[letter]
            length = position - start_pos + 1
            if currNode.terminal and length > lenLongSubstr:
                lenLongSubstr = length
                longestPos   = start_pos

    # now we have found where the longest segment starts (longestPos)
    # and how long it is (longestSeenSubstring). We now have to place
    # the square brackets
    if lenLongSubstr == 0:
        return w
    end = longestPos + lenLongSubstr
    return w[:longestPos] + "[" + w[longestPos: end] + "]" + w[end:]

The function findSubstrings is simple by comparison! Note that it uses our earlier method of constructing the trie directly from parts.

# Not much to this function - it just ties everything else together
def findSubstrings(words, parts):
    # build the trie
    root = TrieNode('')
    for p in parts:
        addFragmentToTrie(root, p)
    return [findSubstringInWord(w, root) for w in words]

Running time for the trie

Our algorithm still looped through each word. The constraint that we haven’t used yet is the length of the each element in parts. The worst case scenario is having to check 5 levels at each position at each word. In the worst case, we have:

The number of words, W = 5000;
The number of letters in a word, N = 30;
The number of levels to check in the trie, P = 5.

This is smaller than the naive implementation by a factor of 1000.

The final code!

Our code was spread over a few functions. Collecting it all together, we have:

class TrieNode:
    def __init__(self, letter):
        self.letter = letter
        self.terminal = False
        # here we store the children nodes of this node
        # the letter of the children are the key, and the
        # TrieNode object is the value
        self.children = {}

def addFragmentToTrie(root, fragment):
    currentNode = root
    for letter in fragment:
        if letter not in currentNode.children:
            # create a new node
            currentNode.children[letter] = TrieNode(letter)
        currentNode = currentNode.children[letter]

    # we are at the final node. 
    # Mark it as terminal
    currentNode.terminal = True

def findSubstringInWord(w, root):
    lenLongSubstr, longestPos = 0,0

    for start_pos in range(len(w)):
        # reset to the beginning of the trie
        currNode = root

        for position in range(start_pos, len(w)):
            letter = w[position]
            if letter not in currNode.children:
                # we have run out of branches in trie,
                # so no use looking further down the string
                # from this starting position. Go back to
                # previous loop
                break
            currNode = currNode.children[letter]
            length = position - start_pos + 1
            if currNode.terminal and length > lenLongSubstr:
                lenLongSubstr = length
                longestPos   = start_pos

    # now we have found where the longest segment starts (longestPos)
    # and how long it is (longestSeenSubstring). 
    # We now have to place the square brackets
    if lenLongSubstr == 0:
        return w
    end = longestPos + lenLongSubstr
    return w[:longestPos] + "[" + w[longestPos: end] + "]" + w[end:]

# Not much to this function 
# It just ties everything else together
def findSubstrings(words, parts):
    # build the trie
    root = TrieNode('')
    for p in parts:
        addFragmentToTrie(root, p)
    return [findSubstringInWord(w, root) for w in words]

We did it! High fives for everyone!

Tell us…

How did you solve this Uber interview question? Have you ever seen one like it in a coding interview? Let us know on the CodeFights user forum!