CodeFights Explainer: Trees

Published in

CodeFights

14 min readApr 18, 2017

Trees. They’re considered to be a computer science basic, but if it’s been a while since you’ve dealt with them — or if you’ve never encountered them before at all — they can be daunting. But tree questions are extremely common in technical interviews, so they’ll come up while you’re finding a job. And then on the job, they’re a wildly useful data structure. So let’s get you up to speed on trees!

A tree is a data structure that is particularly useful for representing data that’s sorted or in hierarchical order, and where we want to be able to retrieve data quickly. I’m going to dive into the idea of trees by introducing one of the most common use cases for them: the binary search tree.

Example: An address book app

Suppose we’re designing an address book application. You type in the name of the person you’re looking for, and if they are in your address book the program retrieves some information about them, like their address, email, and telephone number. If they aren’t in your contact book, you’ll get a “User not Found” message. For the moment, let’s just look at the list of names of people in our sorted address book:

contacts = ["Amanda", "Amy", "Colin", "Flaka", "Gustav", "Madison", "Naomi"]

Suppose you want to know if David is in your contact list. You could scan through the names, one by one, and then keep track of whether or not you have seen David yet. If you haven’t seen it by the time you reach the end of the list, then you can confidently report he isn’t one of your contacts.

Better yet, since you started with a sorted list, you can stop as soon as you see any name that would appear after David in an alphabetically ordered list. In this particular example, once you reach the 4th element in the list, Flaka, you know that David isn’t on this list.

Even with the improvement of the “stop once you reach an element bigger than the one you are looking for” approach, this algorithm is still O(N), where N is the number of items in our list. If you were checking to see if Zoe was on your list, then you would have to check Every. Single. User. Not cool.

A better approach would be to start in the middle, with Flaka. If you are looking for a name that appears earlier in the alphabet than Flaka (like David) then you know it is in the left half of the list (or it is not there at all). If you are looking for Zoe, you know that it appears on the right half of the list. For either name, you only have to search one half of the list. Let’s focus on the steps you would use to search for David:

Start halfway through the list. The list of contacts has 7 items, so start at the 4th item, Flaka.
Since David appears earlier in the alphabet than Flaka (i.e. as strings, David < Flaka), then you know that if David is in the list at all he is first, second, or third. Pick the middle element again, in this case Amy.
Since Amy < David, if David is in this list he has to be the third element. But the third element is Colin, so David isn't in this list.

list vs binary search — Visualizing the search: Look at 3 nodes, two jumps.

For David, we see that it takes 3 searches to find that he is not on the list, rather than 4. For Zoe, the difference is more impressive, because we can also exclude her name in 3 searches rather than 7! Here’s what searching for Zoe in your contact list would look like:

At each step, we are halving the number of names you have to search through, so the number of searches you have to do to determine if a name is on the list is O(log N), which is a vast improvement over our original O(N)!

Using a tree

Instead of thinking about the list of contacts as a flat list, let’s try to visualize it the way that our sorting algorithm goes through it:

names in a tree — Visualizing the contacts as a tree.

Regardless of what name n we search for, we will always start at Flaka. If n is Flaka, then we're done! If we are looking for a name that appears earlier in the alphabet, we will look at indices less than Flaka (to the left). Otherwise, we'll move to the right. We keep moving down the diagram, looking for n, using the convention that if n is smaller than the current word we move to the left, and if it is larger we move to the right. If we reach the names at the bottom of the diagram and still haven't found the name n, we can conclude that it isn't in our diagram.

A tree is a way of implementing this data structure directly, rather than imagining this data structure on a simple list.

Each box in the diagram is called a node. Nodes keep track of the following information:

Properties of a node

Children. The child of a node is a pointer to other nodes. We show that all children of a node are connected to this node via arrows. The node Flaka in our example has two children: Amy and Madison.
Value. This is the data that is actually stored in the node. We are just storing the names at the moment, but a more useful address book would contain a reference to a person object so that we could get the email address, phone number, etc.

Continuing the family-tree analogy, we also have the notion of a parent node. Node X is a parent of node Y if node Y is a child of node X. We've already stated that the Flaka node had two children: Amy and Madison. This means that the parent node of Amy is Flaka.

There are also special types of nodes in a tree:

The root node is the only node on the tree that doesn’t have a parent, so in this tree the root is Flaka.
Then there are leaf nodes, which are the nodes that don’t have any children. In this case, the leaf nodes are Amanda, Colin, Gustav and Naomi.

Each node also a level, which is how far this node is from the root node. So Flaka (the root) is at level 0, Amy and Madison are at level 1, and the leaf nodes are all at level 3 in our example.

For a collection of nodes to be a non-empty tree (as used in computer programming), they need to satisfy 4 conditions:

There needs to be exactly one root node (i.e. one node with no parent).
Each node that is not the root node has only one parent node (i.e. no node is a child of two or more nodes)
You need to be able to reach each node by starting at the root and following a sequence of “child” pointers.
A node at level x cannot have a child at a level less than x.
(In fact, all children of a node at level x have a level of x+1, because children of a node at level x are one extra step from the root.)

When we draw the nodes, we typically draw level 0 (the root) at the top of the page, and increase levels as we move down to the leaves. For some reason, computer scientists have decided to draw all of their trees “upside down”, with the root at top and the leaves at the bottom! Also note that we are jumping between two metaphors when labeling things for trees. We use a lot of terms inspired by family trees such as “parent” and “child”, as well as a lot of terms inspired by botanical trees such as “root” and “leaves”.

To get an idea of what some of the restrictions mean, let’s take a moment to look at two versions of our address book that are not trees. The problem with the data structure on the top is that Colin is a child of Amy and Madison, which breaks rule 2. The problem with the graph on the bottom is that Madison (level 2) is a child of Amanda (level 3).

These rules guarantee that we cannot have cycles while moving along the arrows, because either we will have no children nodes to visit (i.e. we are at a leaf), or we are guaranteed to be going to a higher level. Since we are moving along children links, all the nodes we have seen already are at a lower level than the current level.

All of this is pretty cool, but it seems like we’ve gone to a lot of work and just reproduced the O(log N) search of my contacts that I could have done with a flat (ordered) list! But one of the big advantages that the tree has over a flat list is that it is more flexible. Continuing on with the address book example, suppose I contact Amanda and Gustav a lot more than I call any of the other people. That means that every time I open my address book, my phone is usually doing three searches in the tree. We can rearrange the tree a little to get the following:

unbalanced tree — A tree with the nodes rearranged to make the average search quicker

In the longest searches, my app needs to do four steps now instead of three. But the most frequently searched-for names are now close to the top with the goal of reducing the average search time. Of course, we can also design a (bad) tree that is just a linked list:

a tree pretending to be a linked list — A tree that’s acting like a linked list.

The process of finding the best organization of the tree for the expected use of the data is called balancing the tree.

Other uses of trees

We have used trees to order data by value (i.e. sort it) as a way of replacing a list. A different way of using trees is to manage objects that are clustered or grouped by some attribute. Consider the organization of CodeFights Cafe. We have an owner, a head of accounting (Alice), and a store manager (Katherine). We also have some employees that belong either to the accounting department, or are managed by Katherine. We might structure a tree for CodeFights Cafe in the following way:

tree organizational structure — The organizational structure of CodeFights Cafe.

This tree arranges the nodes by responsibility, which means that any person in this chart manages their descendants. So Katherine can tell her assistant manager Fred what to do, as well give instructions to Claire and Jim. But she can’t give instructions to Bob, for example, because even though he is at a lower level he’s part of a different department.

It’s clear how this structure is a useful way for us to visualize how the Cafe is organized. It also makes certain types of operations easy to implement recursively. Suppose we wanted to send a message to the entire accounting department. We might have a function tell( node, msg) that will use the information in node to get the contact information for the person and email them the message msg. Suppose each node n has its children stored in a list n.children. We could send a message to n and all of its descendants (those nodes that n can reach that are at a lower level than it) with the following function:

def tellAllDescendants( node, msg ):
    # Tell this node
    tell(node, msg)

    # Now pass the message along to all children, with the
    # instruction to pass it to their children, etc.
    for child in node.children:
        tellAllDescendants( child, msg )

If we want to message the entire company “Happy New Year!”, we can simply call tellAllDescendants( owner_node, "Happy New Year"). If we want to tell the service side that we need to bake more cookies, we can call tellAllDescendants( katherine_node, "Pls bake more cookies") and the message will be relayed to Katherine, Fred, Claire, and Jim.

This same organizational structure can answer other questions for us too, such as quickly finding out what the labor costs are in each department. Suppose we wanted to know what the total cost of the salaries in the service side of the business are (so Katherine’s salary, plus that of all the employees under her). If each node n stored that person's salary as n.salary, we could write this function:

def sumSalaries( node ):
    # start by counting the salary of this node
    totalSalary = node.salary

    for child in node.children:
        # ask each child for the sum of its salary, plus the salary
        # of each of its dependents
        totalSalary += sumSalaries(child)
    return totalSalary

Note that us just asking for the sum of the node.salary and the sum of salaries of all of nodes children doesn't work. When applied to the node for Katherine, this would add her salary to Fred's, but it would miss the salaries of Claire and Jim. To find the salary for the service side of the business, we could now call sumSalaries(katherine_node). As a bonus, if we wanted to determine the entire salary cost for the business, we could call sumSalaries(owner_node). The fact that every employee node has only one parent ensures that each node is counted exactly once. (You can see that if Claire worked in the accounting department as well, then her salary would be counted twice by this code).

Our sumSalaries code is an example of a depth-first recursive algorithm. When calling sumSalaries(katherine_node), here is what our algorithm is doing:

Start processing sumSalaries on the node for Katherine
Before returning, we call sumSalaries on the node for Fred (Katherine's child)
Before returning, we call sumSalaries on the node for Claire (Fred's child)
Claire’s node has no children, so now it returns.
Before returning, we call sumSalaries on the node for Jim (Fred's child)
Jim’s node has no children, so now it returns.
We have now summed the salaries of all Fred’s node’s children, so the call to sumSalaries on node for Fred returns.
We have now finished summing the salaries of all Katherine’s node’s children, so call to SumSalaries on Katherine returns.

We had to reach a leaf node before we started having functions returning values. When using depth-first recursive algorithms, we have to be careful that we don’t run out of room on the stack! In this case, we had three function calls “stacked” until we finally started returning answers.

We can also use depth-first recursion to search for a particular node (but we’ll discuss a better approach to this problem below). So far, I have been taking it for granted that I have a variable katherine_node lying around. What if I only have the root, owner_node, and I want to find the total salary of the service side of the business? Suppose nodes stored the department's name as node.department. One way we could do this is to do a depth-first search, stopping the first time we find a department with the right name:


def DFS_dept( node , dept_name ):
    # Does a depth first search for a node with dept name. 
    # Returns node if found, returns None if not found
    if node.department == dept_name:
        return node
    for child in node.children:
        ans = DFS_dept(child, dept_name)
        if ans != None:
            return ans
    # if we get here, we did not find dept_name in any of our children
    return None

# Find the salary of the entire service department
highest_service = DFS_dept(owner_node, "service")
service_salary  = sumSalaries(highest_service)
all_salaries    = sumSalaries(owner_node)

Using depth-first search (DFS) doesn’t make much sense in this example. If we start searching by moving from the owner node to the node for Alice, DFS will keep search through the entire accounting department before giving up. Since we are looking for the highest-ranking person in the service department, it would make more sense to search the tree level by level. This is called breadth first search (BFS), and is implemented below:

from collections import deque 
# python's queue data type 
# the first object in should be the first object out

def BFS_dept( root_node, dept_name ):
    # Does a breadth-first search for a node with dept name. 
    # Returns node if found, returns None if not found.
    # Non-recursive.

    # create a queue with only the root node in it
    queue = deque([root_node])

    while len(queue) > 0:
        # take the node that has been in the queue the longest
        current = queue.popleft()

        # is this the node we are looking for?
        if current.department == dept_name:
            return current

        # add all the children of current to the back of the queue.
        # This means we have to process all the elements already in the queue
        # before going to the next level.

        queue.extend( current.children )

    # if we get here, we have searched all the nodes, and haven't found any nodes
    # of department name
    return None

The question of which type of search strategy, DFS or BFS, is better depends on the structure of the tree, and what information you are looking for. (I’m going to cover this issue in depth in a future article — stay tuned!)

Summary

We’ve seen that when we have ordered data, trees can give us an O(log N) way of searching the tree for the data. We’ve also seen that trees are more flexible than using simple arrays because we can balance the tree.

Trees are particularly useful when:

You have data that is sorted in some way, and you want to do a lot of searches on it.
You have data that is “grouped” in some way, such as the organizational tree for CodeFights Cafe. File systems are a common hierarchy, where the nodes are either files or directories, and node X is a child of node Y if node X is contained in Y. In this example, all ordinary files are leaves, every parent is a directory, and / is the root. A web-based example is HTML's document object model, which is tree-based. Front end developers even talk about "child elements" and "parent elements"!

We’ve seen that with hierarchical data, we can still search through it. We can iterate through the entire tree to get information from each node (commonly by going either depth-first or breadth-first), while ensuring that we visit each node exactly once. This process is called tree traversal. We looked closely at a related topic, which was searching in a hierarchical tree. This is like tree traversal, except we can stop whenever we find what we are looking for, so we no longer guarantee that every node will be looked at.

Ready to tackle some trees?

As I mentioned earlier, tree questions are extremely common in technical interviews. You’ll definitely want to practice them while you’re looking for jobs and preparing for interviews! But you don’t have to hunt for good questions — CodeFights Interview Practice has you covered!

When you’ve solved all of these questions, take a look through the solutions that other CodeFighters have posted. It’s always really instructive to see how other users have solved interview questions — you’ll definitely learn something new!

This article was originally posted on CodeFight On!, the official blog ofCodeFights.

See what else we’re up to by following us on Twitter and Facebook!