Choosing Data Structures: A guide for the confused
This past friday I was teaching a workshop at Coalition for Queens to prepare alumni from their code school for interviews as software developers. One asked about resources for learning how to choose data structures in practice, because she had never been able to find such a guide. I did some searching, and there isn’t anything that quite answers that description. I hope to update this guide with more examples and feedback as I find them. Please leave your feedback!
Before going into my attempt to explain this topic, here are the two best documents I could find online. This is a set of slides for what I bet was a great talk; just reading the slides, you’ll need to do a lot of research to understand what it’s talking about, but the research will certainly be very profitable. This guide crams in a lot of theoretical information in a small space.
What I hope to present here is not an introduction to the theory, but rather a guide to what theory you should know, how to use it to make decisions about picking data structures in your programmes, and go through descriptions of the most commonly used structures in practice.
How to shop for datastructures
As you build experience programming, you’ll get to know the structures you always use. It’s worth knowing about the other data structures easily available to you, and having some idea of their characteristics as outlined below, for when you realise that the everyday datastructures don’t quite fit the bill.
Know what operations you need to be able to perform (like traversal, retrieval, deletion, insertion, updating), what operations you’ll perform frequently (insertion?), what operations you won’t need at all (deletion?). You’ll need this so you can (a) identify if a data structure meets your needs at all; (b) identify if you can combine it with another to meet your needs; and (c) know if it’s going to be fast or slow for your use case.
This is about the question “which data structure will still be efficient when it’s really big”; or rather “which operations will still be efficient when the structure is really big”. Wikipedia has good writeups here and here.
The very very condensed version is this:
If an operation (like insertion into a dict) always takes the same amount of time, no matter how big it is, we call it “constant time”, and denote it with O(1).
If it scales linearly with the number of entries in the structure (i.e. if the size of the collection triples, the time taken to do the operation triples) we call it “linear time”, and denote that with O(N).
If it scales as a polynomial of the input (i.e. if the time is proportional to the square or the cube of the input) we call it polynomial time, and denote that with O(N^2) or O(N^3) (or whatever the highest order term is in the polynomial). This is generally bad. You want to avoid these cases if possible.
Similarly for even faster growing functions, such as exponentiation — O(2^N). That’s an operation which takes time proportional to 2 to the power of the number of items in the collection.
Also interesting is operations that take log(N) time — O(log(N)). These operations grow as the logarithm of the number of items in the collection, which means that they will be faster than operations that grow in linear time (O(N)). Accordingly, where constant time operations are not available, operations that grow as log(N) of the collection size are still considered fast, and will take a lot of items before they get slow.
Two data structures to rule them all
Hash table-backed dictionary
Use this when you want to look data up by keys. That means, when you know that each datum you want to look up has a “name” of some kind which you can pass around, and use to look up the full entry. If you’ve ever wanted to dynamically create variables, this is the tool you’re looking for (don’t dynamically create variables, even if you can).
By nesting dicts within dicts you get an n-ary tree. Your code can get complex if you don’t do this right, though. Dicts are a particularly good choice for sparse collections — when only some integers will map to values, when describing a grid which is mostly empty, and the like. This is because space consumption is linear in the amount of actual entries, and lookup time should not grow with the number of entries.
Retrieval of a specific element, insertion, deletion, all take place in constant time (O(1)) or amortised constant time (n operations in O(N) time — i.e. constant time on average), depending on the exact implementation. Traversal is linear (O(N)).
What’s not to like? First of all, traversal is usually in arbitrary order (to see why, read up on how they are implemented), so if you want to traverse a dict in a specific order, you’ll need to do some additional book-keeping. If you want to traverse them in the order of insertion, there are implementations that do that. If you want to traverse in some other order, you’ll need something else either in addition to, or instead of, the dict.
Use this when you want to model a sequence of things that can change. That means, essentially that for every integer in the range (0…n) there is a defined entry (even if that defined entry is some kind of place holder that says “this entry is undefined). It’s good for cases where you start with no entries, and you build up the collection. It’s good (or at least acceptable) for cases where you might want to append to either end. It’s good for cases where you want to look things up by their index in the sequence; it’s good for cases where you want to overwrite entries by index; it’s good for cases where you want to traverse in index order.
Retrieval of a specific element is typically O(1) (constant time); update of an existing index will have the same time complexity. Traversal in index order (or reverse index order) will be linear time (O(N)). Insertion of new elements in the middle will usually be O(N). Insertion at the high-index end will usually be amortised constant time (n insertions in O(N) time), and depending on the implementation, the same may be true at the 0-index end.
What it’s not good for: keeping things in sorted order, doing lots of random insertions in the middle of the sequence (which is actually something you rarely want to do).
Things neither of these are good for
Neither type of collection is especially well adapted for searching for elements which have a particular relationship to the overall collection (minimum, maximum, median). Frequently, you can keep track of the maximum, mean, and minimum on insertion, so in practice you will relatively rarely need that functionality.
Similarly, you can combine these structures to achieve new functionality. If you want to iterate over a dict in the order you inserted things, you could also append the keys to a list as you inserted them, and iterate over the list to drive iteration over the dict. Similarly, if you really want a sequence with fast testing of whether an item is in the list, you could use the list as the main structure, but use a dict (or set) to check for membership.
Other relatively common and useful structures.
Hash table-backed set
Just like the hash table-backed map, but without mapping functionality. A fairly common usecase.
Heap queue (or priority queue)
Read the wikipedia entry here. This is good when you want to push items into your collection as you find them, but retrieve or iterate over them in sorted order. Bounded priority queues are good for doing that over a whole collection, then getting the top n in sorted order.
Like this and want others to see it? Click the green heart or share buttons! Have any comments, criticisms, or suggestions for improvements? Leave a comment, I would really appreciate it, and will update the article to make it better.