Getting students talking with data structure metaphors
“How can we look up an element in a sorted array?” I ask. Only a few hands go up. They know the answer, but may not trust that they know the answer; the vocabulary is technical and they worry they might be missing something.
If, however, I ask, “How can we look up a name in a phonebook?”, pretty much everyone is willing to talk. Placing technical ideas into a familiar, real-world setting reduces the fear associated with classroom discussion.
In teaching Data Structures & Algorithms these past few years, I’ve found that talking about efficiency with metaphors is a fantastic tool for engaging all students in diffuclt, technical discussions. I thought I’d share some of my favorite extended analogies, and the discussions that have flowed from them — I’m curious to hear yours!
The Queue Data Structure. We call it a queue for a reason: a literal queue, e.g. a line at a supermarket, serves as a fine metaphor for the data structure. The abstract operations on a supermarket queue are the same as those on a computer science queue: a customer can join the line at the end (enqueue), or the cashier can process someone at the front (dequeue), removing them from the line.
Things get interesting when we try to use the metaphor to analyze the space- and time-efficiency of the data structure. How much space does a supermarket queue take up? An amount proportional to the maximum number of customers that will ever be in line at once. (Even when the line is not at capacity, that space is considered used: it is not open for the shelving of new items.) How long does adding a customer to the end of the line take? Not long at all: however long the line is, we can simply place her in the empty space behind the current last customer.
But what about processing a customer at the front? Once a customer has been helped, she leaves the line, leaving an empty space by the cashier. The customer behind takes a step forward, then the customer behind her takes a step forward, and so on. It takes time proportional to the number of people in the line to bring everyone into her rightful place (that is—O(n)). At a supermarket, the numbers are small and this cost is negligible, but as anyone who has gone to the airport on Thanksgiving can attest, these delays add up when n is large.
We might wonder if there’s a better way. Can students think of one? Here are some of the answers I’ve gotten when I’ve posed this question:
Every customer can step forward at once. This is, of course, true — in the real world. In our computational model, however, there is no way to implement this. Bytes cannot “step forward”; they must be copied into new locations by a processor, and this cannot happen “all at once.”
Customers don’t line up at all; cashiers move about the store, helping customers in the order they arrived. This is how the Apple Store works. There are many benefits of this system. The capacity of the queue is the capacity of the store: no space needs to be allocated for a “line” up front. And customers don’t need to be shuffled about. The problem is that we really have just kicked the can down the road; somehow, the Apple Store Geniuses need to keep track of which customer to help next, and to do this, they need… a queue. Of names, not people, but still: a queue. So really, the question becomes, how should that queue be implemented?
- On paper — add and cross off names as customers arrive and are helped. This clipboard approach is commonly seen at restaurants, to keep track of who should be seated next. It works, but uses a lot of paper. Even if the line at the restaurant is only ever a few people long, over the course of the day, countless sheets of paper might be used. This is because once space has been used to store a name, we never try to reuse it, even after that person has been seated. This is equivalent to a supermarket in which the line just keeps growing backward, with no one ever stepping forward, even after customers have been helped.
- On paper — but when we get to the bottom of the page, we go back up to the top. To save paper, we could use a system whereby when you reach the bottom of the page, you start over at the top, erasing the crossed-out names. (Equivalently, you could erase a name when you delete it.) This is a very good idea; we call this implementation a ring buffer. The maximum capacity is however many names you can fit on a page.
- Trust each customer to know who arrived after him. I’ve been in barbershops that work this way. In this model, when you’re done helping someone, you ask him who arrived immediately after. In order for this to work, when a new person arrives, the previous person must make a mental note of this fact. This is a linked memory implementation of a queue, and is quite efficient, if you can trust your customers. Luckily, as programmers, we can trust our customers; linked list nodes tend not to lie about the pointers they store.
Customers line up, but don’t move: the cashier moves down the line, and when the line is full, new customers “wrap back around” to the front. This might seem a bit ridiculous, but it is just the real-life version of the “ring buffer” solution proposed for use on paper above. This would work — if it didn’t confuse the customers!
By the end of the discussion, we have discovered the two main implementations of a queue: using contiguous memory and a ring buffer, or using a linked list.
The Dictionary Data Structure. A dictionary is a data structure for retrieval. For instance, I might have the name of a book and wish to locate the book itself. How do we handle these problems in the real world? With books, we typically use alphabetized collections, on which we can perform a sort of manual binary search. It is possible, and not a terrible idea, to implement dictionaries as sorted arrays. Maintaining them can be quite costly, however; each time we wish to add a new book, we must move many other books over, in a process analogous to the customers-all-moving-forward situation we discussed in the queue metaphor.
This becomes even more ridiculous if we consider a different use case: retrieving any type of item by name. Clearly, we have methods of doing this. In my apartment, I can quickly locate salt, a belt, the piano, my favorite shirt, and so on. But — also clearly — those methods do not involve maintaining a sorted list for binary-searching. (“OK, I need to find my toothbrush. I’ll begin halfway through the house, where I find middle-of-the-alphabet items like my mantle and my mezuzah. T comes after M, so I move further down the hall, and land at my Star Market shopping bag.”)
How do we retrieve items in our house? We have a mental system: based on the properties of an object, we store it in a special location. When we want to retrieve it, we ask ourselves: “Where would I have put a toothbrush?” This is the idea behind a hash table. A hash function plays the role of our mental system, consuming a key and outputting a location for storage.
Several questions immediately present themselves:
What if our mental system maps two objects to the same place? I might, for instance, want to store a salt shaker in my “condiments” drawer, only to discover that my pepper shaker is already there, and no room remains for the salt. We can imagine solving this problem in a couple different ways:
- Get bigger drawers. But this is wasteful. There is plenty of unused room in my house; just not in this particular drawer. Expanding all the drawers in my home would lead to lots of unused space.
- Use a backup location. Perhaps I always have some way of coming up with another sensible location for an object. In this instance, I might place the salt in my white-powders drawer, which is currently empty (I am a law-abiding citizen). On retrieval, I would still ask myself, “Where would I put a salt shaker?”, and I would still check the condiments drawer first. But when I see that pepper is already in the drawer, I can then ask myself, “If the condiments drawer were full, where would I have tried to store the salt next?” Following this line of reasoning, I will eventually either come to an empty drawer (indicating that I am out of salt), or find the salt shaker. In the worst case, this could lead me all over the house in search of an object — O(n). But typically, if my house is big enough and I haven’t stuffed it to the brim, it shouldn’t take too long to find something. And if I really have stuffed every nook and cranny with things, I should probably move to a bigger house. (In a precise way, these moves — in which every object needs to be re-sorted into a new home — are infrequent enough that they do not affect the efficiency analysis.) This strategy is called probing or open addressing.
- Put the salt in the yard, and leave a sticky note in the condiments drawer describing its location. If I need to add yet another condiment, I’ll put it somewhere else in the yard, and add a note by the salt explaining where to find it. The process of retrieval is similar to the one above — I follow a chain of condiments until I find the one I want — but this version (a) uses more space (in the yard), and (b) does not require that I have a good mental algorithm for determining back-up locations. This strategy involves storing linked lists in each drawer, and is called separate chaining.
What happens when we remove an object? It depends on which collision resolution strategy from above we are using.
- Separate chaining. In separate chaining, when something is removed, I simply remove it from the linked list. If I were to remove the pepper from my condiments drawer, I would check the sticky note and see that there is salt out in the yard. I would grab the salt and bring it back to the condiments drawer — along with the sticky note affixed to it, which still directs me correctly to the third condiment (now the second one), still sitting in the yard.
- Probing. If I throw out my pepper, I run into a problem. When I am looking for salt, I check the condiments drawer, find it empty, and think to myself: “I must not have salt, because if I did have salt, I would have put it here.” There is a simple solution. When I remove the pepper, I put a “dummy object” into the drawer. Now, when I want to find salt, I first see the dummy object, and think: “Something was removed from this location recently. It’s possible it was still here when I put the salt away, and that I went to a backup location to store the salt.” The dummy does not stay in the condiments drawer forever: the next time I want to add a condiment, I can throw the dummy object away, and use the condiments drawer for its intended purpose.
How can I loop through all the objects? Suppose I want to take inventory of every item in my house. I have no other option than to go through every possible storage space and see if there’s something there. This also means that the order in which I encounter objects is somewhat random; it is not the same as the insertion order.
Interestingly, Python 3.6 does, by default, loop through dictionary items in the order that they were inserted. How is this accomplished? In an effort to save space, Python 3.6 dictionaries work like an especially space-conscious homeowner, who has moved into a much smaller house, but keeps a warehouse out back. In the warehouse, items are stored in the order they were acquired; when a new item arrives, it is shelved just after the last one that arrived. The house itself still acts like a hash table, but rather than store the objects themselves in our drawers, we store slips of paper with “warehouse numbers,” telling us where a thing is stored in the warehouse. (This means all of our cupboards can be very, very small.) When we want to loop through the items we own, there is no need to search through cupboards: we can simply go out to the warehouse and look at our items, stored compactly in the order they were added. An interesting exercise for students: how should we handle deletion in this model?
A good metaphor for this kind of teaching, I think, is one that can be extended, and that can be used to frame and think about a host of questions about the data structure in question. Do you have extended metaphors you use in your teaching? I’d love to hear about them!