Software Engineering Must Know: Data Structures

Part 1

I’m going back to basics and brushing up on my data structures and algorithms. It’s essential to know these basic data structures, how to implement them and when to use them. If you are interviewing at any large tech company the following will be required knowledge.

Just as important is knowing the time and space complexity of each. I did not explain how to calculate time and space complexity as that is its own topic. You can easily Google “Big O notation” to learn more.

Part 1 covers data structures, a forthcoming part 2 will cover essential algorithms.

Array

Types of arrays

  • Static (fixed size set at declaration)
  • Dynamic (size not set at declaration) aka ArrayList in some languages (Java, C#…)
  • 1..n dimensions

Time Complexity

Dynamic Array Insertion (at end): O(1). Why O(1)? The array doubles in size every time it is full, and this happens half as frequently at each doubling. Here is how you work this out: If the array doubled 4 times from the original size and currently has size N and you work backwards on the number of elements copied at each doubling, then the total elements copied are N/2 + N/4 + N/8 + N/16 which is slightly less than N elements copied. If N total elements were copied, then insertion time for a single element is N/N which equals 1. True, when the array doubles in size the runtime is N for that insertion, but the amortized runtime is always 1.

Insert anywhere besides end: O(n). Inserting in the middle of an array requires copying N elements one index right.

Deletion: O(n). See below.

Index lookup: O(1). Because arrays store elements in contiguous memory, an element retrieved is 0th element address location + index requested, which is a pointer to memory and very fast.

Iteration: O(1) (to next element).

Search (unsorted array): O(n). See below.

Search (sorted array): O(log n). See below.

When to use an Array

  • Retrieving data by index.
  • Often iterating through many/all elements. Arrays are optimized such that elements are contiguous in memory, thus iteration is simply moving a pointer to the next address in memory. This reduces the chances of a cache miss.

When to avoid

  • Searching. An unsorted array would have O(n) search runtime, although a sorted array would have O(log n) if sorted in ascending order.
  • Deleting. To delete an element you must first locate the element then shift all elements to the right of the element left. If you were to delete the 0th element then you would have to shift N elements left. Thus, deletion runtime is O(n).

HashTable

Time Complexity

Insert: O(1). A reasonable implementation is to hash the key, usually to an integer, which corresponds to a location in an array, which is turn a pointer to a memory location. A rehashing operation is performed when the underlying array runs out of room, this will suffer from the “doubling and copying” problem of an array, but amortized runtime is still O(1).

Note: Hash collisions may occur if the hash function for more than one key return the same value. As a result the HashTable must then iterate over all keys at that memory location which is slow — worst case O(n).

Delete: O(1). Worst case is O(n) if the hash function results in high collision and a lookup requires traversing all keys.

Key lookup: O(1). Also suffers from O(n) worst case lookup the same as Delete.

Iteration: (to next element) O(n) or O(1). This depends on the underlying implementation.

Search: O(1). This is the same as a key lookup, but could suffer from O(n) worst case lookup the same as Delete.

When to use a HashTable

  • Search. If you frequently need to retrieve values by key.
  • Insertion.
  • Deletion.

When to avoid a HashTable

  • Low latency read/write. Because a HashTable has nodes stored in non-contiguous memory cache misses could be frequent.
  • Extremely large data sets as cache misses will be frequent.

Linked List

Types of Linked Lists

  • Singly linked
  • Doubly linked

Sample singly linked list implementation in JavaScript:

Doubly linked lists will have both a next and previous property on nodes. This allows traversing forward and backward.

Time Complexity

Insert (add method): O(1). Whether a node is added to the head or tail of a linked list, time complexity is 1.

Delete: O(n) (based on the above implementation). The implementation of delete can vary, and there are certainly optimizations to avoid iterating over every node of the list to find the node to delete.

Index lookup: O(n). Requires iterating through N nodes to find desired node.

Iteration: O(1) (to next element). Note that while iterating to the next element is O(1) just like an array, a linked list iteration may not be as fast since the next node memory location may not be in memory and would result in a slow cache miss.

Search: O(n). Requires iterating over all elements from head. Could be optimized to O(log n) if the linked list was also sorted and doubly linked.

When to use a Linked List

  • Insertion. Insertion is always O(1). While arrays are also O(1) insertion, Linked Lists never have the “doubling and copying” problem of arrays.
  • Deletion. The sample above is not optimized for most deletes. Imagine, however, that deleting a node simply updated that nodes next to node.next.next. You could bypass the iteration entirely and achieve O(1) deletion, except for deleting the tail which would require further work to optimize deletion.
  • Stack Implementation
  • Queue Implementation

When to avoid a Linked List

  • Node lookup by index.
  • Searching.

Stack

How stacks are implemented

Stacks are Last In First Out (LIFO) data structures. A Linked List can be used as the underlying data structure for a stack. Stacks commonly have the methods Push, Pop, Peek, IsEmpty that can operate on an underlying Linked List. The Big O values below are based on a Linked List implementation.

Time Complexity

Push: O(1). Pushing is simply pointing the Linked List head to the new node.

Pop: O(1). Popping is accessing and deleting the Linked List head.

Peek: O(1). Peeking is accessing and returning the Linked List head.

IsEmpty: O(1). If Linked List head is not null, the stack is not empty.

When to use a Stack

When you need to access data in Last In First Out order. An example would be a stack of cards in a game, the last card dealt is on top, and it’s the first card you pick up.

When to avoid a Stack

There is no specific reason for not using a stack, rather the decision to use a stack is algorithmic as opposed to performance reasons.

Queue

How queues are implemented

Queues are First In First Out (FIFO) data structures. A Linked List can be used as the underlying data structure for a queue. Queues commonly have the methods Enqueue, Dequeue, Peek, IsEmpty that can operate on an underlying Linked List. The Big O values below are based on a Linked List implementation.

Time Complexity

Enqueue: O(1). Adding an item to a queue is a matter of updating the tail of the Linked List.

Dequeue: O(1). Accesses and deletes the Linked List head.

Peek: O(1). Accesses the Linked List head.

IsEmpty: O(1). If the Linked List head is not null, the Queue is not empty.

When to use a Queue

The same as a Stack, the decision to use a queue is algorithmic, not for performance reasons. An example of a Queue would be bids for an auction, bids are processed in FIFO order.

When to avoid a Queue

Again, the same as a Stack, there is no reason to not use a queue, unless a FIFO data structure is not needed.

Graph

A graph is a collection of nodes (aka vertices) and lines (aka edges). A graph can be directed (the lines have direction) or undirected (the lines do not have direction).

Types of Graphs

  • Directed/Undirected (lines between nodes do or do not have direction)
  • Cyclic/Acyclic (a route can be formed from a node back to the same node)

Additionally, graph edges may contain weight, value and/or cost. To do this you could store an object on the edge such as {w: 2, v: 15, c: 2.7} instead of a boolean.

There are two ways to represent a graph: Adjacency list (AL), Adjacency Matrix (AM)

Space Complexity: AL is more space efficient than AM in a sparsely populated graph. That is, a graph with few edges. Eventually with enough edges the AM will become more space efficient because an edge can be stored in a single bit. Determining when the AM becomes more space efficient is dependent on edge density and the amount of space that an AL edge takes. In general, use an adjacency list when the graph is sparsely populated, and an adjacency matrix when the graph is densely populated.

Consider the following scenario, assuming edges store two references: A graph with 40 nodes has 40² possible edges. On a 64 bit system an AL edge takes 64*2 (references for startNode and endNode)=16 bytes. 6 edges take 6*16 bytes = 96 bytes. Thus, with a 40 node graph, space complexity is favorable with an AM with 7 or more edges.
Sample Directed Graph (also is acyclic)

Adjacency List Implementation

Time Complexity

Insertion (add vertex or edge): O(V).

Deletion (vertex): O(V).

Deletion (edge): O(V + e). All vertexes + all edges for the given vertex.

Determine if edge exists from vertex m -> n: O(m). Because a given vertex could have edges to all nodes and you would need to iterate over all edges to find n.

Breadth First Search (BFS): O(V + E). Each vertex must be visited once, and for each vertex, each edge must be visited once.

Depth First Search (DFS): O(V + E). Again, each vertex and edge must be visited once.

Space Complexity

O(V + E) where V = vertices, E = edges.

Adjacency Matrix Implementation

The above graph in an adjacency matrix representation in JavaScript:

Time Complexity

Insertion (vertex): O(V*2)

Insertion (edge): O(1)

Deletion (vertex): O(V²)

Deletion (edge): O(1)

Determine if edge exists from vertex m -> n: O(1).

Breadth First Search (BFS): O(V²)

Depth First Search (DFS): O(V²)

Space Complexity: O(V²)

When to use a Graph

Decision is based on requirements, not speed. If you need to record edges between nodes, you need a graph.

When to avoid a Graph

n/a

Tree

Trees are a subset of graphs with no cycles.

Types of Trees

  • Binary Tree: each node has 0, 1 or 2 child nodes
  • Complete Binary Tree: Every level is filled except perhaps the final level, and the final level is filled left to right.
  • Full Binary Tree: Every node has 0 or 2 descendents.
  • Perfect Binary Tree: Every node except leaf nodes have 2 descendents, and every level is fully filled.
  • Binary Search Tree: left descendents ≤ node < right decendents
  • Balanced: “mostly balanced” in that the depth delta of any tree path is not greater than 1.
  • Unbalanced: depth delta of any tree path has no limit.
  • Binary heaps: A heap is a complete binary tree where each nodes children are smaller than the parent node (max heap) or each nodes children are larger than the parent node (min heap). Heaps are useful for a number of things including heap sorting, which I will cover in part two of this series on algorithms.
    Trie (Prefix Tree): useful for storing an entire language for auto complete. Each node can have up to ALPHABET_SIZE + 1 (for *) children to choose from next.

When to use Trees

Binary Search Trees (BST) are especially useful for searching as they have O(log n) search time.

BST’s may be traversed in-order, pre-order or post-order. recursive pseudo code for each to demonstrate:

function traverseInOrder(n)
traverseInOrder(n.left)
visit (n)
traverseInOrder(n.right)
function traversePreOrder(n)
visit(n)
traversePreOrder(n.left)
traversePreOrder(n.right)
function traversePostOrder(n)
traversePostOrder(n.left)
traversePostOrder(n.right)
visit(n)