Self-Organizing Sorted Linked List in C++

How would Frankenstein implement a doubly linked list?

We are going to create a monster, a doubly linked list, working as a self-organizing list and having additional pointers to interpret its data in a sorted order. All with fancy illustrations, so you would never need another tutorial on how does a linked list work. If you are new to linked lists, this is gonna take some effort to dive in. Prepare for the worst. It’s worth it. No doubt.

If you are familiar with pointers and know how they work, skip to part “Hodog”, otherwise start with “No hodog”.


No hodog

With great power comes great bills for electricity. With great abstraction comes great misunderstanding. Programmers nowadays suffer from these “abstractions” and “misunderstandings”. There are tons of layers of abstractions in programming, which make it sort of “redundant” to understand the internals or to look under the hood of programs, which we will try to do now. Just a little bit, with the help of abstractions. So, there are two main components interesting at this point, it’s the CPU (Central Processing Unit) and the RAM (Random Access Memory).

CPU interacts with RAM via buses

We will concentrate on the yellow thing, the RAM. Whenever we declare a variable, it will take some specified place in the RAM (well, an easy abstraction). We’ll assume that each box on the illustration above represents 1 byte memory, so if we declare a variable that takes 4 bytes, it means 4 boxes should be occupied by it.

Variable declaration and the memory

You might see illustrations like the previous one in other tutorials, but with opposite address ordering, starting from the bottom. In this context there is no any difference between illustration approaches.

We assume that int takes 4 bytes and char takes 1 byte, so again, to get more illustrative:

The same, with more painting

We wrote hexadecimal (same decimal, with extra hexas) numbers near each box, because we wanted to emphasize that each box has its unique address. Memory consists of boxes, each of which has its unique address. Each box is able to hold 1 byte data. That’s all we need to know to move on to...

Pointers. Now that we see that a variable has its type (which is needed to calculate how many boxes will be needed to store the data) and an address, we could come up with a question. What about the name? No, not that question. Can we store the address of a variable? Yes, we can. There are specialized data types called pointers, allowing to store a variable’s address and access to that variable via that address. You already knew that. But seriously, what about the name? Mm, turns out it’s just a convenient way for programmers to operate with data, CPU operates with memory boxes by their relative addresses. Relative to the beginning. It takes constant time to access any box if you know how far is it from the first box. “How far from the first box” is called offset. That’s how arrays work. Contiguous data blocks of the same type.

int arr[4] = {44, 55, 66, 88};

Here’s how arr could be illustrated:

Making this illustration more compact (we use different colors throughout this tutorial to give each color a chance to express itself), let’s assume each box contains 4 bytes of data, but before that, a small remark on so-called “endianness”, it’s about the byte ordering in memory boxes. No worries on byte ordering if you declare a ‘char’ variable, as it takes just 1 byte (one box), things get confusing when we have a variable taking more than a byte. We can place those bytes in two opposite orders, first byte starting first or last byte starting first. The following illustration from Wikipedia is more than enough to understand the difference between little-endian and big-endian byte ordering:

From Wikipedia

There are systems supporting little-endian byte ordering, and systems supporting big-endian ordering. What’s the point of having two different approaches? Well, it’s easy to answer, [in programming] whenever there are two or more choices, there will be two or more standards. This is called “Standardizato ad Quantum Absurdum”. I made it up, there is no such thing, but you can call anything that could be done in one way, but instead is done in all possible “different” ways simultaneously, a “Standardizato ad Quantum Absurdum”.

Back to our array declaration, let’s make a simpler illustration, let’s assume that each memory box represents 4 bytes. So int arr[4] = {44, 55, 66, 88} could be illustrated like this (my painting will be better soon, I promise):

4 byte memory boxes

Pay attention to addresses, arr[0] resides at addresses 0x5000 through 0x5003 bytes (4 consecutive bytes), arr[1] at 0x5004 through 0x5007 and so on. The illustration above is done just to illuminate one deceptively simple idea of contiguous memory locations. Any array element could be accessed in constant time, and by constant time we mean that element access is done in k operations (starting address + number of bytes to jump to target element’s address), and each such element access would take up to k operations each time. Just because array elements are of the same type, all we need to know is its first element’s address, e.g. where does the array start at all. Having the starting address and having the type of an element, we know how many bytes should be read to fetch one element. To access some j-th element, we may just “jump” to that particular location by adding j to the starting address of the array (with respect to element type size, shown a couple of illustrations ago). This could be a crucial point, because sooner we will discuss linked lists, where unfortunately such luxury isn’t an option. And this is one of the most asked interview question: “What is the difference between a vector and a linked list?”.

We got the picture, now get back to pointers. What happens when we declare a pointer.

int* ptr;

Same thing that would happen to a declared variable, int var;

Pointer is just another variable

Turns out pointer is just a disguised variable. Its only difference is the declaration style, using * (asterisk) after the type name, which means that this variable is kind of “special”, it is able to store addresses, actual addresses of actual variables. You are able to get the address of any variable in C++ using a special operator for that, & (ampersand). Put it in front of any variable, and it will return the address at which variable resides.

Assigning an address to a pointer.

So what’s the point of having the address of a variable stored in another variable. Well, the actual usefulness of pointers comes in really handy when operating with dynamic memory (we’ll discuss later in Hodog section), now, we could use this pointer to access that variable without mentioning its name. For instance, if you want to assign that variable a new value, say 15, you just write *ptr = 15; . Or just print its value, cout << *ptr; .

Accessing a variable via some pointer

Take a good look at the illustration above. It discusses 3 cases,

  1. Printing ptr , which does simply print ptr’s value, in this example, 0x5000 .
  2. Printing &ptr, which prints ptr’s address, e.g. at which address does ptr reside in memory.
  3. Printing *ptr , which takes ptr’s value, which is just an address, finds a memory box which resides at that very address and returns that box’s content. This case is harder to understand if you are totally new to pointers, so, take a look at the following illustration:
Dereferencing a pointer

Easy peasy. Pointers are really easy to understand if you imagine them in memory boxes. It’s easier than drawing a Patrick in two easy steps. Finally, the following illustration shows a more abstract way to illustrate pointers.

More abstract way to illustrate pointers

We will mostly use the last approach, if you will have doubts, you can always refer to the following “associative” illustration:

Two approaches to our illustrations

Pay attention to the last lines:

int* pa = &arr[0];
int* pxa = pa + 1;

We’ve added them on purpose, to torture your brain (in case you are new to pointers).


Hodog

Linked lists are data structures. That’s it. Go read more on Wikipedia. A linked list is usually described by its Node structure and standard operations on nodes (insert, remove, fetch/find):

struct ListNode
{
Type item;
ListNode* link; // next
};

ListNode describes (defines) a single list element. In case of arrays, we would use only the “item”. Linked list nodes usually require more data to be able to show us its neighbor node. For the sake of simplicity, let’s just define two types of data structures: 1) contiguous, 2) node-based, the latter being of our interest. Node-based data structure’s nodes are spread across memory, which means that we can use a node’s link to refer it to another node, eventually creating a whole network of hand shaking nodes. Arrays, in the contrary, are contiguous, ensuring their elements reside close to each other:

Illustrations are better than words

So, being more specific, let’s compare an array of integers with a linked list containing integer items. To define each list node, we will use a structure like this:

struct ListNode
{
int item;
ListNode* next;
};

The following code shows an array declaration and further element insertion, and then a linked list declaration and node insertions:

int arr[3]; // array of integers, containing three elements
arr[0] = 21;
arr[1] = 32;
arr[2] = 34;
ListNode* head = new ListNode(); // yep, magic happens with 'new'
head->item = 21;
head->next = new ListNode();
head->next->item = 32;
head->next->next = new ListNode();
head->next->next->item = 34; // a bit of confusing, isn't it?

If you are not familiar with thenew operator, go read on the web, we can’t spend much time on it in this article. Let’s just remind that there usually is a separation of memory into two types, stack and heap. Heap is also called dynamic memory, meaning it will be dynamically allocated whenever requested (by new operator). You might have seen already illustrations like below:

The illustration above is not much “valid”. In reality, both stack and heap are parts of the “same” virtual memory. Just imagine the same old memory separated into two parts, one of them being called stack memory, and the other one dynamic memory. So we will use a hardcore illustration (see below), also pay attention to the sequence of expressions.

So using this illustration, we can depict previous code (array and list creation/initialization). The following illustration contains step by step code mapping, take your time:

There are plenty of more professional illustrations on the web from proven professionals, depicting a better “real-life” situation. The main point of having [at least] two different memory areas for programs is to ease the execution of different blocks of code and managing variables’ lifetimes. See, stack is a “kind of” fixed-sized memory, it can’t be used to create a constantly increasing container elements. Whenever you declare a variable, you command the compiler to allocate a space for that strongly-typed variable, e.g. an integer, taking 4 bytes, or a double, taking 8 bytes and so on, but whenever you need something that “grows” continuously, i.e. you are not able to predict its size while declaring a name for it (say, keeping records of watched movies), but want to make sure it will have enough space (and not more than needed) to hold elements when the time comes (when you watch a new movie and add to the list of already watched movies), you can’t operate on the stack, as stack is a provision of memory for variables whose size is already known. Heap helps with this kind of problems. It provides you a bucket of available memory. Take a byte whenever you need one.

As linked list grows as it goes, it’s a real good candidate to use for storing elements or more precisely, a collection of elements. Imagine we declared a linked list to store movie titles of already watched movies (by Patrick). Patrick already watched “Interstellar”, so we could do something like this:

LinkedList movies;
movies.Insert("Interstellar");

While program runs (suppose it runs days and weeks) and Patrick keeps watching new movies, the list grows.

movies.Insert("Sponge Bob episode 10");
movies.Insert("Sponge Bob episode 4");

What we gonna do now is implement list insertion and deletion functions to show the main differences from arrays. Before the actual implementation, let’s discuss the types of list nodes. As mentioned above, a list node could be represented as a structure having two fields, one for the actual element’s value (the payload), and the other for linking to the next node. We will add another pointer to this structure, to make it able to keep records for both left and right neighbors.

struct ListNode
{
std::string item;
ListNode* next;
ListNode* prev;
};

Here’s how we most likely would illustrate such list node:

Simple illustration of doubly linked list node

The main issue with a list node is the additional pointer variables. Comparing with arrays, if we are storing 10 integers each taking 4 byte memory, an array would require just 10*4 bytes of memory to store the actual elements and an additional pointer to that memory’s starting address, which again, would take 4 or 8 bytes. So for storing 10 integers, we will need 44 bytes (or 48 bytes, depending on the pointer size). Before showing similar calculations for a doubly linked list, let’s assume a pointer will take 4 bytes of memory, i.e. we are operating in 32-bit system. Pointer takes 4 bytes of memory on 32-bit systems as it’s a variable, which allows to store address values, and an address value in 32-bit system has 32-bit length, i.e. 4 bytes. In case of working in a 64-bit system, a pointer would take 8 bytes of memory, as it would hold an address value, which length would be 64 bit, i.e. 8 bytes. For simplicity, we will assume pointers are 4 bytes. So, an array of 10 integers requires 44 bytes of memory. To store one integer in a doubly linked list node, we need 4 bytes for the ‘item’, and additional 8 bytes for ‘next’ (4 byte pointer) and ‘prev’ (4 byte pointer) pointers, which sums up to 12 bytes for just one node. And to store 10 such integers we will need 12*10 bytes = 120 bytes of memory (we’re ignoring additional ‘head’ or ‘tail’ pointers for now, also ignoring memory alignment issues).

A little details

Let’s make it even harder to imagine. Whenever we are storing two strings in a list like this:

struct ListNode
{
std::string item;
ListNode* next;
ListNode* prev;
ListNode() : next(nullptr), prev(nullptr) {} // meh
};
// .. other code
ListNode* node = new ListNode();
node->item = "Interstellar";
node->next = new ListNode();
node->next->item = "Sponge Bob";

We will picture that list as the left side of the following illustration, however, it will look really different in memory (the right side of the illustration). It is again, simplified, the real memory layout looks slightly different, but the illustration helps one to picture what’s going under the hood while declaring or creating linked list nodes.

More details, yet again, simplified

In a more professional approach, creating and operating with bare list nodes like we did above is not much accepted. Instead, a LinkedList class is suggested, which will hide all the implementation details (that’s why sometimes lists and other data structures are called ‘abstract data types’). We will show the bare minimum of a LinkedList class and will implement element insertion (with fancy illustrations) before we finally will reach Frankenstein’s list. So here’s what we are going to code:

See the additional pointers? Head and tail. We will use the tail pointer to insert or access list elements from the end, it also allows constant time insertion at the end. Head pointer is the main access point of the entire list.

template <typename T>
class LinkedList
{
protected:
struct Node
{
T item;
Node* next;
Node* prev;
Node() : next(nullptr), prev(nullptr) {}
Node(const T& i) : item(i), next(nullptr), prev(nullptr) {}
};
public:  
const T GetItemAt(int pos) const;
  void InsertAt(const T& elem, int pos);
void PushFront(const T& elem);
void PushBack(const T& elem);
  void RemoveAt(int pos);
  void Traverse(void (*visit)(const T&)) const;
private:
Node* head_;
Node* tail_;
int size_;
};

We have three functions for element insertion, to insert at the front, at the end and at any other position in the list. These three functions are emphasized just because we have to discuss three different cases of list element insertion. It’s because for each case we should take care for nodes’ next and prev pointers. Here’s an illustration of insertion at the front of the list:

As you can see, the most important thing while inserting at the front of the list is to properly update the ‘head’ pointer. Here’s a sample implementation:

void PushFront(const T& elem)
{
Node* node = new Node(elem);
if (head_) {
head_->prev = node;
} else {
tail_ = node;
}
node->next = head_;
head_ = node;
++size_;
}

Almost the same goes with the element insertion at the end:

void PushBack(const T& elem)
{
Node* node = new Node(elem);
if (!head_) {
head_ = node;
tail_ = node;
} else {
tail_->next = node;
node->prev = tail_;
tail_ = node;
}
++size_;
}

Final and even more interesting case is the insertion at any [valid] position of the list. Inserting at any position also involves inserting at the front or at back of the list, so we should check the corresponding cases to handle them properly.

Here’s a possible implementation:

void Insert(const T& elem, std::size_t pos)
{
if (pos == 0) {
return PushFront(elem);
}
if (pos >= size_) { // insert at the end even if pos is way bigger
return PushBack(elem);
}
Node* node = new Node(elem);
Node* cur = GetNodeAt_(pos);
node->next = cur;
node->prev = cur->prev;
cur->prev->next = node;
cur->prev = node;
++size_;
}

We used a GetNodeAt_ private function (private as it returns a Node* by its position), here’s its implementation:

Node* GetNodeAt_(std::size_t pos) const
{
// be careful, there might be a bug
std::size_t count = 0;
Node* cur = nullptr;
if (pos > size_ / 2) {
cur = tail_;
while (count != (size_ - pos)) {
cur = cur->prev;
++count;
}
} else {
cur = head_;
while (count != pos) {
cur = cur->next;
++count;
}
}
return cur;
}

As we store both head and tail pointers, we can halve the number of node checks based on the value of the position. If it’s bigger than the half of the list elements, than we could reach to it from tail, otherwise from head.

What about node removal? Almost the same as insertion, deletion of a node has three special cases, from the front, from the end and deletion of a node at any position. We won’t bring detailed illustrations or even a source code samples for deletion and the rest of the operations for simple doubly linked list. Try to come up with own implementation for a simple doubly linked list, while we will meet the Frankenstein’s monster!


Self-organizing list vs sorted list

Linked list operations allow some better performance comparing to arrays. For example the PushFront() function inserts a new item into list in constant time, meaning it does the same amount of work each time, despite the number of already existing elements in the list, e.g. has no dependency on N (magic letter used to identify the variable amount of data count in data structures and algorithms). To insert a new element at the front of an array, we have to move N already existing array elements to make a space for that new element. This makes new element insertion a very expensive operation for an array (we’re talking about the front insertion). But this doesn’t mean that lists are fast. That’s why programmers are trying to improve simple lists to achieve more horsepower. One of such improvements is self-organizing list. The idea is very simple, whenever client code accesses an element, that element advances its position (moves closer to the front), which makes it to be accessed faster the next time, and even faster after that.

And sorted list is a list which keeps its elements in a sorted order. By combining these two ideas, we are going to make a monster linked list, as promised in the beginning of this tutorial.

First, let’s look at the node of self-organizing sorted linked list (we gonna call it a Frankenstein’s list from now on):

struct Node
{
T item;
Node* next;
Node* prev;
Node* greater;
Node* lesser;
};

Two new pointers appeared, greater and lesser. These new pointers allow to traverse the list in a sorted order, both ascending and descending, that’s why there are two pointers ( greater for ascending order and lesser for descending). Here’s a illustration of a monstrous node:

Frankenstein list node

Having two more pointers for ascending and descending order of nodes also means we should have two more pointers as ‘head’ and ‘tail’ pointers for our entire list, i.e. a head pointer that points to the smallest node and a tail pointer that points to the biggest node. And these nodes have nothing to do with our ‘head’ and ‘tail’ pointers for accessing list nodes as usual (as we did it previously for doubly linked list). So, a class definition for Frankenstein’s list may look like the following:

template <typename T>
class FrankensteinList
{
protected:
struct Node
{
T item;
Node* next;
Node* prev;
Node* greater;
Node* lesser;
// assume a constructor initializing pointers to nullptr
};
public:
void Insert(const T& elem);
// other methods
// ...
private:
Node* head_; // same as for simple doubly linked list
Node* tail_; // same as before
Node* asc_head_; // access nodes in ascending order
Node* desc_head_; // access nodes in descending order
std::size_t size_;
};

Two new pointers, asc_head_ and desc_head_ , each of which points to corresponding node in proper order. Let’s picture a Frankenstein list consisting of integers 5, -3, 1, 22, 0, 8. While the head pointer points to the first node, i.e. 5, and the tail pointer points to the last node, 8, ascending head pointer must point to a node having minimum value among others. And in a similar manner, descending head pointer points to node with maximum value. Here’s an illustration, keep in mind, nodes have been “back inserted” in this order 5, -3, 1, 22, 0, 8.

Frankenstein’s list

This is the one of few cases where illustration is harder to understand than the actual code. Patrick’s first reaction after seeing this:

List, Patrick, Frankenstein’s list

Anyway, if we try to examine the list we’ll see that the whole chaos in the illustration is because of lesser and greater pointers. Try to find the asc_head pointer, it points to the minimum node in the list. If we try to go through greater pointer (in blue) of that node, we will reach to the next minimum node, which is greater than previous one, but is lesser than all other nodes in the list. So, greater pointer allows us to traverse through list nodes in ascending order. In the contrary, to traverse the list nodes in descending order, we will start from desc_head pointer and continue through lesser pointers. In the illustration above, desc_head points to 22, the biggest node in the list, and 22’s lesser pointer points to 8, the next biggest node, which is less than 22 but is greater than all other nodes in the list. For Patrick’s sake, let’s depict a much simpler example with 3 nodes.

Frankenstein’s list with three nodes

This is much better. So what’s the purpose of having such a monster? Well, first of all we should admit that the memory overhead is too much, because we store 4 pointers with a single data item, e.g. if we store an integer (4 bytes) in a Frankenstein’s list node it will take additional 16 bytes of memory just to save its next/prev and greater/smaller pointers, and overall a 20 byte of memory for one integer node. In a simplest case it’ll take ~2Kb to save 100 integers. In the contrary, an array will take ~0.4Kb. However, in real life we operate with big enough objects and sometimes not caring for small memory overheads over efficient data lookups is considered a good practice. Say, we have to save a full movie data, its title, release date, directors’ names, actors names, movie posters, quotes from the movie, URLs to it’s home page/trailer video/discussion threads/etc. And let’s assume that a single movie object will require ~1Mb of memory, so to save 100 movie objects in an array, we’d need exactly 100Mb memory + 4 byte pointer to array. To save 100 movie objects in Frankenstein’s list, we’d need 100Mb + 1.6Kb (we again assume that we operate in a 32-bit system). Overhead is arguably acceptable for big enough objects. The good part with Frankenstein’s list is that we always have its elements in sorted order, both ascending and descending. Imagine you put a large such list in a data grid, or some table, and viewing the items in a sorted order will run really smoothly.

The next big thing with Frankenstein’s list is its self-organizing nature. As mentioned earlier, the idea behind a self-organizing list is simple, whenever a node is being accessed, we advance its position to the front. This makes accessing the same element faster the next time. Let’s illustrate this idea for a simple list (for the sake of brevity we’ll omit greater/lesser pointers in the illustration below):

Self-organizing list

The illustration above shows how a node advances its position when it is being accessed. By advancing its position it makes faster to access the same element the next time. You can call it an “in-place caching”.

As always, there are many ways to implement a self-organizing list, one of them is keeping an access counter, e.g. recording how many times a particular element had been accessed and moving it to appropriate position. Another way to implement it is moving an accessed element right to the front, making it the first element in the list. The approach illustrated above makes a single swap with accessed node’s left node (the one closer to the front). So a Frankenstein’s list is a list with which self-organizes itself and keeps nodes’ sorted order. Here comes the code.

You can find the full source code here (a bunch of good tests needed though).

Let’s just discuss the interesting parts, node insertion. Again, insertion has three cases, at the front, at the back and at any other position.

void InsertAt(const T& item, const std::size_t pos)
{
if (pos == 0) {
return PushFront(item);
}
if (pos >= size_) { // you've seen this before
return PushBack(item);
}
Node* node = new Node(item);
std::size_t count = 0;
Node* cur = head_;
while (count != pos) {
++count;
cur = cur->next;
}
cur->prev->next = node;
node->prev = cur->prev;
cur->prev = node;
node->next = cur;
++size_;
  PutInSortedOrder_(node);
}

The most interesting part of the code above is the last line: PutInSortedOrder_(node). This private function makes sure to update the list’s sorted order by placing this new node in the right position:

void PutInSortedOrder_(Node* node)
{
if (!node) return;
if (!asc_head_) {
asc_head_ = node;
desc_head_ = node;
return;
}
  Node* cur = asc_head_;
while (cur->item < node->item && cur->greater) {
cur = cur->greater;
}
if (cur->item < node->item) {
node->greater = cur->greater;
cur->greater = node;
node->lesser = cur;
if (!node->greater) {
desc_head_ = node;
}
} else {
node->lesser = cur->lesser;
if (cur->lesser) {
cur->lesser->greater = node;
} else {
asc_head_ = node;
}
node->greater = cur;
cur->lesser = node;
}
}

The code above is not an elegant implementation, but it shows how do we deal with new node insertion. At first we should just insert the new node as before, by properly updating its and its neighbors’ prev and next pointers. After that, we should take care of the list’s overall sorted order, by updating new node’s and its “sorted” neighbors’ greater and lesser pointers, respectively. Node removal is similar to node removal of a doubly linked list, with a little quirk of taking care of the sorted order. Find a sample implementation here.

Finally, we should worry about Frankenstein’s list self-organizing nature.

const T GetAt(const std::size_t pos)
{
if (pos >= size_) {
throw std::out_of_range("Invalid position");
}
Node* cur = head_;
std::size_t count = 0;
while (count < pos) {
cur = cur->next;
++count;
}
AdvanceNode_(cur);
return cur->item;
}

Pay attention to the AdvanceNode_() function. It has been called just before returning the required item at corresponding position. AdvanceNode_() is a private function whose only purpose is to swap itself with the left neighbor. For the sake of brevity we won’t bring its implementation here, instead you can find it and other methods’ implementations at the same place.


Please leave a comment if you find any issue with the code or illustrations.