CodeFights Solves It: findSubstrings
The Interview Practice Task of the Week was findSubstrings. This is an Uber interview question, and their technical interviews are notoriously lengthy and difficult. And the solution implements tries, a data structure that you’ll see a lot in interview questions. It’s a double whammy!
You know the drill by now: If you haven’t solved this challenge yet, hop on over to CodeFights and do it! Then come back here and we’ll walk through solving it together. We’ll start with a naive implementation and then work on optimizing it.
…Done? Okay!
Technical interview question
The goal of this Uber interview question is to find the longest substring that appears in a given list parts
in a set of words
. You'll return the list of words, with the substrings enclosed in square brackets. If there is a tie for the longest substring, you'll mark the one that appears earliest in the input string.
On a word-by-word basis, if our list is parts = ["An","a","men", "ing", "ie","mel","me"]
then we would indicate substrings in the following way:
- “word” becomes “word” (there is no substring of “word” in
parts
) - “anticipating” becomes “anticipat[ing]” (the substrings “a” and “ing” both appear, but “ing” is longer)
- “interest” becomes “interest”
- “metro” becomes “[me]tro”
- “melrose” becomes “[mel]rose”
- “melting” becomes “[mel]ting” (“mel” and “ing” are the same length, but “mel” appears first in the word)
- “ingoltsmelt” becomes “[ing]oltsmelt”
Our function, findSubstrings
, should take a list of strings words
and another list of strings parts
, and return the list of words with the longest substring indicated. For example, with words = ["word", "anticipating", "ingolt", "melting"]
, and parts
defined as before we would get:
['word', 'anticipat[ing]', '[ing]olt', '[mel]ting']
Naive solution
Let’s try approaching this problem directly. This would mean going through our list of words, one at a time, and then checking each substring in parts
to see if it appears in our word. We will keep track of the longest part that occurs earliest in the string.
def findSubstrings(words, parts):
# We will keep all the final strings here
output = []
for w in words:
# for each word, reset the longest fragment found, and what the match was
longest = 0
match = ""
position = len(w)
# look at each part
for p in parts:
# is the part in there?
if p in w:
# make sure this part is strictly longer than the part already found,
# OR if it is a tie, make sure that it appears earlier in the word w
if len(p) > longest or (len(p) == longest and w.index(p) < position):
longest = len(p)
match = p
position = w.index(p)
# if we found something, put in the square brackets around the match, and
# copy the word into our output list. If there is no match, copy the
# original word to the output list
if longest > 0:
loc = w.index(match)
output.append( w[:loc] + "[" + match + "]" + w[loc+len(match):])
else:
output.append(w)
return output
We can try running this function on our example, and it works. So we have a running implementation.
But is it good enough to pass the CodeFights time tests, or will it take too long to execute? We submit it and… it passes!
(I’m going to add an important caveat here and say that it passes at the time when I wrote this. Our engineers add new tests to challenges fairly frequently, based on user feedback and challenge results, so there’s a chance that when you read this my naive solution might not pass anymore. But the explanation below of its run time is still really useful, so don’t just skip to the optimized solution!)
What’s the run time?
We should take a moment to think about the running time of this code. If we call W the number of words, and P the number of parts, then we can tell that the running time is at least O(WP) from our nested loop. With a high-level language like Python, we also want to look for hidden loops in function calls. Inside the p
loop, we call if p in w
and w.index(p)
multiple times. These functions are scanning the word w
to look for p
, and will have a running time of O(len(W)). We could speed up the function a little bit by calling w.index(p)
once and saving the result, but we still have a running time that looks like O(WPN), where N is the length of a typical word in w
.
So how bad is O(WPN)? From the constraints page:
* We have at most 5000 words, so in the worst-case scenario W = 5000
;
* Each word is at most 30 characters, so the worst case is N = 30
;
* We have at most 5000 substrings in parts
, so the worst-case is P = 5000
.
Putting this together, this gives us 750 million loops. This means each loop would have to run in about 5 picoseconds to be done in 4 seconds! Put another way, we would have to have PetaHz processors for the worst case scenario.
Can we argue that we solved the challenge anyway, and that the CodeFights test showed us that the cases we will be seeing aren’t the worst possible cases? Yes, you can absolutely argue that! When companies are hiring, they want people who can write code quickly to solve today’s problems. A working solution today is better than a perfect solution in a month!
But you should make sure you demonstrate to the interviewer that you know how to think about algorithmic complexity. Microsoft famously had an exponential algorithm for determining which software patches were needed for Windows XP, but it was working fast enough for 10 years(!), and no one realized that the algorithm was exponential.
So you can tell the interviewer that you know this is an algorithm with three multiplicative factors (length of a typical word, number of words, and number of substrings). You should explain that you pass the tests, but that your algorithm won’t handle the worst case in reasonable time. Go on to explain that this isn’t an exponential algorithm, so getting more powerful computers or waiting a little longer isn’t unreasonable.
An interviewer may ask you to write a faster algorithm anyway, so if you really want to impress her you can preemptively mention that you think there are better algorithms that you would use if there was a good business case for spending the extra time on this. Remember that premature optimization is the root of all evil. (Maybe that’s a little over-dramatic… but it’s still bad!)
Giving it our best Trie
Instead of the naive implementation above, we will use a special type of tree data structure called a trie to store the substrings in parts
. Like all trees in programming, a trie has a root node. Every other node keeps track of a letter, and getting to this point in the trie completes a substring from parts
. Taking our example of parts = ["An","a","men", "ing", "ie","mel","me"]
from earlier, we can think about the trie pictorially as:
Note there are 7 red letters in the picture of the trie, which correspond to the 7 words we have in parts
. Every leaf is a red letter, but the e
below the m
is also red to indicate that me
is in parts
.
To check if a string s
is in parts
, we start at the root and move to the first level in the trie that matches the first letter of s
. If we are trying to match me
, for example, we would start by moving from the root to m
, and then from m
to e
. Since we end up on a red node, we know that our word is on the list.
To see if in
is a member of parts
, we would start at the root and follow the path to i
, then to n
. Since n
is not red, we haven't completed a string from parts
, so we would conclude that in
is not a member of parts
.
If we want to see whether dog
was in parts
, we would see there is no path from the root to a d
, so we could report right away that dog
isn't in parts
. We will be scanning a word character-by-character, and instead of having to search through all of parts
, we only have to search through the subset of parts
that matches what we have already seen.
You have seen this data structure in action before — in fact, every time you see auto-complete working every time you use a browser or text!
Implementing it in code
Now we have to translate our picture into code. We will make a class TrieNode
to store all the information in a node:
class TrieNode:
def __init__(self, letter):
self.letter = letter
self.terminal = False
# here we store the children nodes of this node
# the letter of the children are the key, and the
# TrieNode object is the value
self.children = {}
To create a trie with the words “cat”, “cam”, and “cha” and “chat”, we could write the following (tedious) code:
# This is terrible coding style.
# Definitely don't do this in your interview!
root = TrieNode("")
# level 1 nodes
c = TrieNode("c")
root.children['c'] = c
# level 2 nodes
a1 = TrieNode("a")
c.children['a'] = a1
h = TrieNode("h")
c.children['h'] = c
# level 3 nodes
t1 = TrieNode("t") #
m = TrieNode("m") # these words are terminal!
a2 = TrieNode("a") #
t1.terminal = True
m.terminal = True
a2.terminal = True
a1.children['t'] = t1
m.children['m'] = m
h.children['a'] = a2
# level 4 node
t2 = TrieNode("t") # also a terminal node
t2.terminal = True
a2.children['t'] = t2
or, pictorially, we have
Let’s make a function that adds a word to the trie so that we don’t have to do it manually. Our method will be to start at the root, reusing existing nodes along our path (or making new nodes as needed). The code for this is:
def addFragmentToTrie(root, fragment):
currentNode = root
for letter in fragment:
if letter not in currentNode.children:
# create a new node
currentNode.children[letter] = TrieNode(letter)
currentNode = currentNode.children[letter]
# we are at the final node.
# Mark it as terminal
currentNode.terminal = True
Using this function, we can condense the code to create our trie to:
# This is much better, and produces the same trie
root = TrieNode("")
for p in ["cat","cam","cha","chat"]:
addFragmentToTrie(root, p)
Using the trie
I am going to move most of the hard work — that is, finding the longest substring and enclosing it in square brackets — to a function findSubstringInWord(word, root)
. The function findSubstrings(words, parts)
will be responsible for making the trie and collecting all the words together at the end.
Here is our strategy for findSubstringInWord
:
- Initialize our
lenLongSubstr
to0
. - Go through
word
one character at a time, using that character as a starting point for matching substrings inparts
. - For each character, use the trie to find the longest substring from
parts
starting from this character. If it is longer than the longest one we've seen so far, record the current position in the string and the current length. - Use the length of the
longestSeenSubstring
and its starting position to put the square brackets in the right place.
Note how going through the string in order, and only updating the position when we see a strictly longer string, automatically encodes our tie-breaking condition (yay!). The trickiest part is finding the the longest substring from parts
starting at a given location, because we need to know the last terminal node we passed though before failing to find a match on the trie. It would be a lot easier if we just had to keep track of the last node we were on before we failed to match.
def findSubstringInWord(w, root):
lenLongSubstr, longestPos = 0,0
for start_pos in range(len(w)):
# reset to the beginning of the trie
currNode = root
for position in range(start_pos, len(w)):
letter = w[position]
if letter not in currNode.children:
# we have run out of branches in trie,
# so no use looking further down the string
# from this starting position. Go back to
# the previous loop
break
currNode = currNode.children[letter]
length = position - start_pos + 1
if currNode.terminal and length > lenLongSubstr:
lenLongSubstr = length
longestPos = start_pos
# now we have found where the longest segment starts (longestPos)
# and how long it is (longestSeenSubstring). We now have to place
# the square brackets
if lenLongSubstr == 0:
return w
end = longestPos + lenLongSubstr
return w[:longestPos] + "[" + w[longestPos: end] + "]" + w[end:]
The function findSubstrings
is simple by comparison! Note that it uses our earlier method of constructing the trie directly from parts
.
# Not much to this function - it just ties everything else together
def findSubstrings(words, parts):
# build the trie
root = TrieNode('')
for p in parts:
addFragmentToTrie(root, p)
return [findSubstringInWord(w, root) for w in words]
Running time for the trie
Our algorithm still looped through each word. The constraint that we haven’t used yet is the length of the each element in parts
. The worst case scenario is having to check 5 levels at each position at each word. In the worst case, we have:
- The number of words,
W = 5000
; - The number of letters in a word,
N = 30
; - The number of levels to check in the trie,
P = 5
.
This is smaller than the naive implementation by a factor of 1000.
The final code!
Our code was spread over a few functions. Collecting it all together, we have:
class TrieNode:
def __init__(self, letter):
self.letter = letter
self.terminal = False
# here we store the children nodes of this node
# the letter of the children are the key, and the
# TrieNode object is the value
self.children = {}
def addFragmentToTrie(root, fragment):
currentNode = root
for letter in fragment:
if letter not in currentNode.children:
# create a new node
currentNode.children[letter] = TrieNode(letter)
currentNode = currentNode.children[letter]
# we are at the final node.
# Mark it as terminal
currentNode.terminal = True
def findSubstringInWord(w, root):
lenLongSubstr, longestPos = 0,0
for start_pos in range(len(w)):
# reset to the beginning of the trie
currNode = root
for position in range(start_pos, len(w)):
letter = w[position]
if letter not in currNode.children:
# we have run out of branches in trie,
# so no use looking further down the string
# from this starting position. Go back to
# previous loop
break
currNode = currNode.children[letter]
length = position - start_pos + 1
if currNode.terminal and length > lenLongSubstr:
lenLongSubstr = length
longestPos = start_pos
# now we have found where the longest segment starts (longestPos)
# and how long it is (longestSeenSubstring).
# We now have to place the square brackets
if lenLongSubstr == 0:
return w
end = longestPos + lenLongSubstr
return w[:longestPos] + "[" + w[longestPos: end] + "]" + w[end:]
# Not much to this function
# It just ties everything else together
def findSubstrings(words, parts):
# build the trie
root = TrieNode('')
for p in parts:
addFragmentToTrie(root, p)
return [findSubstringInWord(w, root) for w in words]
We did it! High fives for everyone!
Tell us…
How did you solve this Uber interview question? Have you ever seen one like it in a coding interview? Let us know on the CodeFights user forum!