Data Structures Part 1: Essentials
A data structure basically is a technique for storing data. Each has its own advantages over the other. As you would imagine, it is a matter of trade-offs. There are many data structures that can be handy in various situations. Yet there are some, who have proven themselves in countless battles. I believe every programmer must know how they work, their pros and cons by heart. Without further ado, behold, the Gods of the Arena:
Linked List
A Linked List consist of what is called nodes. Each node has a pointer to the node next to it (and another pointer to the previous node if the list is a doubly linked list). Aside from the pointer(s), a node of course holds the desired data such as an integer.
Linked List vs Array
LL’s seem similar to array though they have crucial differences. Unlike arrays, LL’s do not have a fixed size. They grow as new nodes are added. There is no random access support in LL, hence, you cannot access say the 3rd index of a LL by subscript. Arrays take less memory since they do not need to allocate extra memory for pointers. LL provide constant time insertion and deletion which can also make them a weapon of choice.
Insertion
Appending to the LL happens intuitively in constant time [O(1)]. A pointer to the head must always be up-to-date. Assign the next pointer of the head to the newly created node and update the head. Insertion is a constant operation iff (if and only if) we have an access to the node which will be the node before the new node (often, this is not the case). I know this sounds confusing, lets go over an example. Say we want to insert the value 41 after the node with the value 99. Hold a pointer to the next node of 99. Assign 99’s next pointer to 41. Assign 41’s next to the previously created temporary node. Voila, insertion complete. Other than this scenario, insertion to an index takes linear time [O(n)].
Deletion
When we have an access to the node to be deleted, deletion is a constant time [O(1)] operation. If the node in question is at the edges, all we need to do is to set the node to null. It is a little bit more tricky when the node is in the middle. Assign next node to temporary node. Get the data of the next node and assign it to current node. Assign current node’s next pointer to temporary node’s next. Delete the temporary node. If we have the index only, we have to traverse the list which takes linear time [O(n)].
Find/Search
So the good old find operation has a linear [O(n)] complexity due to the fact that there is no random access. Starting from an edge, list must be traversed one by one until the comparison returns true.
Use Cases
Most languages support dynamic sized arrays. So if you need a LL only because of its growing ability, think again :] LL’s are powerful when you need those constant time insertions and deletions. LRU (Least Recently Used) Caching is a good match. LRU caching will often require you to remove the last access node from the middle and append it to the end. You can use LRU for image caching in say to store images from a news feed in a social media mobile application. Instead of waiting for request every time a recently shown image is to appear again, you can pull it up from the cache without any delay or whatsoever.
Binary Search Tree
Going to a developer interview without knowing how BST works is like being in the situation below. But instead of a sword, by holding a spaghetti, a half cooked spaghetti.
Tree structure consists of some nodes with a parent node and child node(s). The parenthesis in the latter “node” is crucial: Each node can have one parent and one parent only though they can have many children depending on the tree structure. In Binary Search Tree, as the name suggests, each node can have up to two children.
BST like many other trees provide fast look up. Logarithmic complexity is the best scenario after constant look up. BST’s have logarithmic [O(log n)] complexity for search, insertion and deletion. Though many other structures provide constant insertion and deletion time, this is the price we have to pay for an ordered structure with [O(log n)] look up. As it can be seen in the previous image, each node has the lower valued descendants on its left subtree, and higher ones in its right subtree (the equal values may be on left or right depending on the implementation as long as it is consistent, yet most tree implementations will allow unique values).
Search
Let’s say we want to retrieve the node with the value 7. In all tree structures, we have an access to the root node. Starting from the root node, we compare the current value with the one in hand. If the value we are searching is less than current value, we go to left subtree, if higher, we continue to right subtree. We do this recursively (or iteratively for that matter) until we find the node we are searching or until we reach to some leaf node (a leaf node is one of the bottommost nodes or in other terms, a node without any children). So, 7 is less than 8 [GO LEFT]. 7 is more than 3 [GO RIGHT]. 7 is more than 6 [GO RIGHT]. 7 is equal to 7 EUREKA! Now think about what happens when we continue to a subtree: At each downward step in the hierarchy, a subtree is discarded. i.e. we do not care about any element that is in right subtree of 8 since every value in that subtree must be higher than 7. When we dump half of the rest of inputs at each step, we get logarithmic complexity [O(log n)]. To elaborate further, imagine we have a tree with 128 – 1 nodes (2⁷ – 1): A tree with height (or depth) 7 (in the previous image the height is 3). And each node has 2 children. We subtract one since the root is a lone ranger without any siblings. Starting from the root, when we move to left (or right obviously) subtree, we discard half of the rest, 63 (63 in left subtree, 63 in right and 1 root) nodes. By going 1 level deeper, we discard 31 nodes and we are now left with 31 nodes to search for. Then 15 nodes and so on and so forth. Hence the [O(log n)] (in Computer Science, default base for log is 2 and not 10 as in mathematics) complexity.
Insertion
The only deal here is to find the corresponding spot for our value. We do this by exactly using the same aforementioned searching mechanism. Using the same tree in the image, let’s insert the value 9 to the tree. Starting from the root, go until we get to the node with value 10. Now we assign left pointer of node with value 10 to our new node. And if we wanted to insert 11 all we had to do is to assign right pointer (of 10) to 11, then assign right pointer of the new node to 14, thus, shifting 14 one level deeper. Searching is [O(log n)], inserting is [O(1)]. Result: [O(log n)] in total.
Deletion
Finding the node to be deleted is again just the same; though deletion is a little bit more complex. There are 3 different cases: Node to be deleted has no children, has one child and has 2 children. Each of these cases require different treatment. In any of these cases, the operation runs again in [O(log n)]. I will not reinvent the wheel here as there is a short, plain and simple explanation in wikipedia. Please refer to this clear explanation from wikipedia for all three cases.
Recap
Search, insertion and deletion works in [O(log n)] for BST on average. And BST is great when we need a sorted structure with a fast search/find operation. Yet it works in [O(n)] in worst case. The reason being is the following: Consider the case when we insert a value greater than the previous one all the time. Insert 1, then 3 as right child then 17 as 3’s right child and so on. We end up with having what can be called a linked list which will take linear time to traverse the structure. To solve this problem, league of data structures introduces balanced search trees: AVL Tree and Red Black Tree (I highly recommend learning at least one balanced tree. I will discuss AVL Tree in the next post). Balanced trees, keeps the tree “balanced” by arranging the layout of nodes after each insertion and deletion, thus, preventing the tree from becoming a linked list. There are many other tree structures aiming at different problems such as tries and heaps. However, now that you are familiar with BST, you can learn the rest much more easily. Yay!
Stack
The structure that result in a remarkable thing, a home for developers from all around the world when it overflows (and also occasional crash in programs), introducing, the Stack.
This photograph illustrates exactly how a stack works: You can only add new elements (in this case, stones) to the top, and you can only remove from the top. Unless you are some kind of ninja, you cannot get the pebble at the middle without making some of them fall. So you only have access to the top element, and you can only append to top and remove from top. In technical terms, this is called LIFO (last-in-first-out) order.
A stack must support push, pop and peek operations, all working in constant time [O(1)]. Pop removes the element at the top of the stack and returns the result value. Push appends the new item to the top of the stack. And finally peek retrieves the value of the element at the top of the stack (basically a pop without removal). Along with these three operations, isEmpty and isFull operations are nice-to-haves of a stack implementations. They can be nicely implemented using Linked Lists (we just found another use case for our good friend Linked List). Stack is the backbone of memory management (more on stack & heap memory). Apart from that, any time you need to retrieve the information from lastly processed image, integer, item, whatever you will (or should) probably use a stack. They are as common as cheesy scenes from Twilight trilogy (or tetralogy? pentalogy? *insert dodge face*).
Queue
The thing we all hate to be in: The queueueueueue. This guy is doomed to be introduced after Stack. If you were to go to a planet in Andromeda Galaxy, and attend to the data structures lesson of some alien life form, you would hear in some weird alienish language: Stack and Queue not Queue and Stack.
Similar to stacks, queues provide push and pop operations along with first and last properties (or methods depending on the language and implementation). Queues as real life queues work in FIFO (first-in-first-out) order. Push prepends the value (adds to the beginning) and pops from the end. We typically have access to the first and last elements of the queue. They can also be implemented using a Linked List. Again, Queue is a highly common, core data structure. You will most probably need to use a in your programming life. For instance, Breadth First Search traversal makes use of a queue.
Hash Table
I will venture to say that what spikes are to the balloon fish, Hash Table is to programming. Without the spikes, you are just a fat potato drifting in the ocean.
Hash Table stores data in pairs: A key and a corresponding value. It is a type of Unordered Associative Array. Unordered means that there are no indexes, no particular placement. You give a key to HT, it uses its internal hash function (more on that later on) to compute the index for that key, and uses the index to find the entry. Hash Table provides constant [O(1)] search {BAM!}, constant insertion {WHAM!} and constant deletion {DAMN!}.
Insertion
The real deal is to understand how insertion works. When we want to insert a value to the table, we will need to assign a key to the value we want to insert so that we can retrieve the value later on. Think of a scenario where you need to create a shopping list for the user. For each item (key) that is entered to the system, we want to know its quantity (value). Our drunken user wants to add 3 melons, 2 tooth brushes and 9 bottles of cheap wine to her shopping list. The hash table will go over the following steps: Initializing the hash table will create an array of buckets (these buckets may also be linked lists). Then the designated hash function is called with the key, in this case “melon”. The result of the hash function will be reduced to the correct index by modulo operator [hash % array_size]. Array size is predetermined and will be increased as new inputs are added and when the [amount of entries in the table] over [number of buckets] (called the load factor). The calculated index will ideally point a unique value. Say the array’s size is 10. We just inserted 3 values each of which can be hashed to the same index. When that happens (it is called collision), we must resolve the issue by using one of collision resolution methods. There are two main techniques: Separate Chaining and Open Addressing. Separate Chaining typically holds a pointer to a linked list in the buckets. Every time a collision occurs, it appends the new value to the linked list at calculated index. Going back to the last example: say “melon” and “tooth brush” hashes to index 6. At index 6, we now have a linked list with two values: melon as key and 3 as value, and tooth brush as key and 2 as value. The other method, Open Addressing to me seems a little bit smarter. Each bucket directly holds the key-value pair (instead of a linked list). If the calculated index is already occupied, we just need to find an empty slot by one of three probing methods. Linear Probing, Quadratic Probing and Double Hashing. To evaluate as quickly as possible: Linear probing increments the index until an empty slot is found, quadratic probing increases the index quadraticly (1, 2, 4, 8, 16…), — DUH — and double hashing as the name suggests, uses one more hash function. In all three versions, when the array is full (or usually passes the load factor), array size is increased (typically doubled). Insertion complexity is equal to hash function calculation time + probing or chaining time. Hash function has a constant time [O(1)] complexity of course. If the collision resolution implementation and the hash function were implemented carefully, with a smart load factor, the whole operation almost always takes constant time [O(1)]. While a bad implementation can result in linear time [O(n)].
Deletion
As I mentioned in the beginning, after knowing how insertion works, the rest is a piece of cake. Using the key, the corresponding index is calculated and then the key-value pair is removed from the array. Same rules apply here: Good implementation => [O(1)]. Bad implementation => [O(n)].
Search
I think you got the deal.
Use Cases
This is like asking “what are some use cases of duct tape?”. We are talking about a constant time look-up, insertion and deletion. I mean, come on… However as mentioned multiple times, be very careful about your hash function, collision method and load factor. If collisions start going down the hill, so will your program. The silver lining is: most programming languages will have some version of a hash table already implemented in the standard library.
What’s Next?
I will talk about balanced trees, graphs, heaps and possibly one or two other data structures on Part 2. Cheers!