Better Python Programming for Data Scientists, Part 3: Data Structures
In Parts 1 and 2 of this series, we explored Python fundamentals, including the built-in data types.
In part 3, we are going to look at data structures. A data structure is a collection of data elements organized in a particular format that supports basic operations for accessing those elements. Some of the Python built-in data types like lists, tuples, sets, and dictionaries are data structures, as are the commonly used Pandas data frames and series, and Numpy arrays.
In this post we are going to look at the broader spectrum of data structures used in computer science and how they are implemented in Python. You’ll see how the data structures commonly used in data science fit into the bigger picture. There are other, more advanced data structures not covered here. For those, I refer you to the Geeks For Geeks overview of the subject.
As you learn about data structures, one thing to keep in mind is the difference between a data structure and an abstract data type. A data structure is a specific format used to store data. An abstract data type is a mathematical model that has particular behavior and properties and can be implemented by data structures, or in some cases by the simpler built-in data types such as integers.
Array
An array is an indexed sequence of values that all have the same data type. These elements are stored in adjacent memory locations and consist of an (index, element) tuple. Because each element is the same data type (for example a 32-bit integer), the amount of memory required for each element is known in advance and the memory location can be calculated as start location + index * memory size. In many programming languages, the size and data type of an array must be specified when it is created in order to allocate the correct amount of adjacent memory locations. When all of the memory is used, depending on the implementation the array may be dynamically resized by copying it to a new, larger array.
As you have probably guessed, Numpy’s array data structure is implemented as the standard CS array. Pandas series are also arrays with labeled axes, and an “object” data type if they contain mixed data.
Arrays are also very similar to Python lists, but they are not exactly the same since a list does not require all of the elements to have the same data type. Under the hood, lists are implemented as an array of pointers to memory locations that contain the actual elements, which is how it supports multiple data types. Laurent Luce, one of the creators of Python, provides an explanation of how this works here if you want to read more about it.
Array operations:
- Traverse: iterate through the array element by element.
- Insert: add a new element at the given index.
- Delete: remove the element at the given index.
- Search: find the element by index or the index by the element value.
- Update: modify the element at the given index.
Because arrays are stored in adjacent memory locations, accessing the elements in an array by index (sometimes called peeking) is a constant-time operation.
However, this requirement for adjacent locations is a drawback of the data structure if the array is large.
Linked List
Linked lists solve the problem of memory allocation by storing the location of the next element in the list with the current one. That is, a linked list is a sequence of nodes consisting of (element, next location) pairs. The head, or location of the first node, is explicitly stored as a property of the linked list, and the last node has Null as the next location. Since the location of an item cannot be calculated by index, the elements cannot be directly accessed by index as in an array. Instead, the linked list must be traversed until the target element is reached (or in the case of insertion, the last node is reached) unless the location of the node is already known. However, there is no limit on the size of a linked list, no need to specify the size in advance, and no need to copy data to resize it.
A linked list has the same operations as an array, but with some differences in implementation:
- Traverse: iterate through the linked list starting at the head and accessing each node based on the location pointer stored in the previous one.
- Insert: nodes can be added at the beginning by changing the stored head and pointing to the previous head in the new node. Alternatively, they can be added at the end by pointing the last node to the new one, or in the middle by swapping the stored locations as shown in the example below.
- Delete: nodes can be removed through the inversion of the insert operations.
- Search and Update: the nodes must be traversed to find the target element.
Here we have an example linked list implementation with insert operations:
If you find the syntax confusing, remember that when an object is stored in a variable (like self.head
), the variable is actually storing a pointer to the object’s memory location. Here we are not overwriting objects, we are just changing the location pointer.
A doubly linked list addresses some of the inefficiency of a linked list by storing both the previous and next locations so that the list can be traversed in either direction.
Arrays and linked lists serve as core data structures that other data structures are implemented with.
Bag
A bag is an unordered collection to which new elements can be added but existing ones cannot be removed. They allow duplicates, and because of this are sometimes called multisets. The collection can be traversed, and individual elements accessed by searching.
To understand the concept behind this data structure, consider the example of a running standard deviation calculation. A running mean can be computed without storing individual values since you just need to update their sum and quantity to calculate the current mean. However, a running standard deviation requires storing the values since for each of them the difference between the value and the current mean must be calculated. We don’t want to remove any of these values at any point, and we don’t care about their order, we just need to store them and be able to iterate over them.
Bags can be implemented with arrays or linked lists.
Stacks and Queues
A stack is an ordered collection to which elements can be added and removed following a Last In First Out (LIFO) ordering. That is, the most recently added item is operated on first, like a stack of plates in the cupboard. The two key operations of a stack are:
- Push: add a new item to the top of the stack.
- Pop: remove an item from the top of the stack.
A queue is an ordered collection in which items are added and removed in a First In First Out (FIFO) ordering. That is, items are removed in the order in which they were added, like how lines (at least in theory) work in real life. The operations for a queue are:
- Push: add a new item to the back of the queue.
- Pop: remove the item from the front of the queue
Stacks and queues are linear and can be implemented with either arrays or linked lists. They are also available in the Python collections module through the deque
(“deck”) class. This is a left-to-right ordered generalized implementation that can push or pop from either direction.
Here we have an example of working with the deque class:
Binary Tree
A binary tree is a tree in which each node has at most two child nodes (usually called the left and right child). In concept, they are somewhat similar to a doubly linked list in that they have as properties the data of the node and two pointers, one to each child. Some of the key concepts for binary trees are:
- Root: the top-level node that does not have a parent node.
- Leaf: a terminal node.
- Depth/height, h: how many levels the tree has. A tree consisting of a root with no leaves has a height of 0.
- Full: a tree in which every node has either 0 or 2 child nodes. A full tree has at least 2h + 1 nodes and at most 2^(h+1) -1.
- Perfect: a tree in which every interior node has 2 children and all leaves are at the same depth/level.
- Complete: a tree in which every level except the last is full, and the last-level nodes are all as far to the left as possible; a perfect tree is complete but a complete tree does not have to be perfect or full
- Balanced: the left and right branches of every node differ in height by at most 1
- Degenerate: every parent has at most 1 child.
The operations of a tree are:
- Traversal: this can be depth-first (going all the way to the terminal leaf of a branch before visiting the next branch) or breadth-first (visiting the closest nodes first).
- Insertion: new nodes can easily be inserted as a child of a leaf, or as an internal by updating the pointers of the existing nodes.
- Deletion: a node with no children can easily be deleted, and if it has one child, the child can become a child of the removed node’s parent. Depending on the implementation, algorithms exist for rearranging the tree if a deleted node has two children.
For implementation, binary trees can be created either with a node class and pointer variables similar to our linked list example, or they can be implemented as an array in breadth-first order. In this implementation, the node at index i
has children at 2i + 1
and 2i + 2
, while its parent is at floor((i-1)/2)
. A breadth-first array is more compact since the pointers to the child nodes are calculated rather than explicitly stored, but it is more expensive to grow and can waste space.
Just as arrays and linked lists underlie more complex linear data structures, binary trees are used to implement non-linear ones.
Binary Search Tree
Binary Search Trees (BSTs) are a special type of binary tree. They are ordered so that at each internal node, the values in the left branch are less than the node value, and the values in the right branch are greater than the node value. This data structure is useful for fast search and information retrieval because the time complexity is O(h). How well it performs depends on how balanced the tree is, but on average operations performed with a BST are O(log n).
A BST that automatically keeps its height optimal when items are inserted or deleted is known as a self-balancing binary search tree.
Heap
Heaps are another special type of binary tree. There are two types: min-heaps and max-heaps. Like BSTs, these are sorted trees, but each node is either greater than (for max-heap) or less than (for min-heap) its child values. This property is recursive throughout the tree. Insertion and deletion operations require maintaining the ordering by swapping values until order has been restored in a process known as heapify.
Heaps are complete binary trees and can be implemented as breadth-first arrays. They are used to implement priority queues, in which the “priority” of the element based on its value rather than insertion order determines the sequence in which it leaves the queue. Unlike BSTs, heaps do not guarantee that everything on one side of the tree will be greater than or less than everything on the other side, the recursive ordering applies along the full depth of a branch from root to leaf. The priority ordering means that the min or max of the collection of values can be identified in constant time.
Binary trees, BSTs and heaps are not included in the Python standard library but there are various libraries you can install to leverage them if you don’t want to implement the data structures yourself.
Hash Table
A hash table, also known as a dictionary, is a data structure in which a hash function is used to compute an index for a bucket to store a key-value pair.
If you’re thinking, Wait a minute, a dictionary… That’s right! We finally have a data structure that is one of the Python built-in data types. Additionally, Pandas data frames can be thought of as dictionaries of series (labeled arrays).
Because the index for a pair is computed from the key, there is the chance that two different keys could yield the same hash value. This is known as a collision, and there are two main strategies for handling it:
- Separate chaining: each index contains a pointer to a linked list (or a self-balancing BST for ordered storage). As collisions happen, the new items are added to the linked list, which must be traversed to find an individual key-value pair when the pair needs to be accessed.
- Open addressing: each index contains a pointer to the actual element. When a collision happens, a sequence is followed to find the next open slot in a process known as probing. For example, in a linear probe the interval between slots to check is fixed, and usually 1.
Open addressing is more compact than separate chaining, but as the example above shows, can be inefficient for finding elements if there are many collisions.
Aside from the built-in dictionary data type, hash tables can also be used in machine learning algorithms such as locality sensitive hashing, which is used to group similar items and is one algorithm for Approximate Nearest Neighbors problems. You can see an example use of this for multi-armed bandits in the MABWiser library (documentation, source code). It can also be used in natural language processing for use cases like identifying similar word embeddings for translation or entire web pages for filtering pages that are too similar from search results.
A Note on Associative Arrays
An associative array is the more general abstract data type also known as a map, a symbol table, or a dictionary. They store key-value pairs, and under the hood can be implemented with hash tables, linked lists, or binary trees, with hash tables being the most common. As previously mentioned, Python 3.6+ uses a hash table for its dictionary data type. If you want to learn more, the Wikipedia article on it provides a thorough overview of the different implementation methods.
Matrix and Sparse Matrix
When referring to a matrix as a data structure, it is a special type of two-dimensional array, or an array of arrays, that is noteworthy as the elements must all be the same size. That is, each array within the array represents either a row with the same number of columns (row-major order), or a column with the same number of rows (column-major order).
Sparse matrices, however, represent a unique problem. Because arrays are a fixed size with the memory allocated at creation, they can be very inefficient for sparse data. Various solutions involving other data structures have been developed to solve this:
- Dictionary of keys (DOK): each key is a (row, column) pair.
- List of lists (LIL): each row is a sorted list of column index and value pairs.
- Coordinate list (COO): a sorted list in which each element is a (row, column, value) tuple, ordered by row then column.
- Compressed sparse row (CSR): three one-dimensional arrays store the values, the corresponding column indices of those values, and the starting and ending slice indices for the row data in the value and column arrays.
- Compressed sparse column (CSC): similar to CSR but tracks column slice indices rather than rows.
To give an example of the CSR format, consider the matrix:
[[ 1, 0, 0, 0][0, 0, 2, 0,][3, 4, 0, 0]]
We would have the arrays:
- values = [1, 2, 3, 4]
- column_index= [0, 2, 0, 1]
- row_index = [0, 1, 2, 4]
This translate to row 1 is values[0:1] and column_index[0:1], row 2 is values[1:2] and column_index[1:2], and row 3 is values[2:3] and column_index[2:4]. This format is faster for accessing the row data and performing calculations. Scipy provides implementations of each of these formats, but scikit-learn classes that return sparse matrices such as the TF-IDF Vectorizer default to using the CSR format.
Graphs
The last stop on our tour of data structures is the graph. Graphs are both abstract data types and data structures. They consist of a set of vertices or nodes and the edges that connect them. Edges represent the relationships between nodes, and they can be directed or undirected. Edges can have a value, called the cost or weight, associated with them.
When using a graph to represent data, often (though certainly not always) they can be conceptualized as the nodes capturing the nouns of the data while the edges are the verbs. For example, in the graph of a social network website, each person would be a node. An undirected edge might represent a mutual relationship such as being friends or some other reciprocal connection, and a directed edge might represent following or subscribing to a person’s posts. That is A --> B translates to “A follows B”.
Another common example for graphs is the game of Six Degrees of Kevin Bacon (https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon). In this game, a random Hollywood celebrity is selected, and through films they have acted in, directed, or produced (our directed edges) they can be connected to Kevin Bacon through up to 5 other celebrity-nodes (where Kevin Bacon is the 6th node in the “6 degrees).
The primary operations of a graph data structure is to report the list of neighbors of a vertex as well as any stored properties such as the cost and directionality.
There are three types of data structures commonly used to implement graphs:
- Adjacency list: each vertex is stored as an object that contains a list of adjacent vertices. Additional data can be stored in the object as well. If edges are also objects, they can store additional data as well, and each vertex contains a list of edges while each edge contains a pair of vertices. Specific implementations include an array of linked lists or a hash table.
- Adjacency matrix: a two-dimensional matrix where the rows are source vertices and columns are destination vertices. Any additional data must be externally stored. The value stored can be the cost of the edge. The cost or directionality can be stored as the value in the matrix, with -1 for leaving and 1 for entering.
- Incidence matrix: a two-dimensional matrix where the rows are vertices and the columns are edges. The cost or directionality can be stored as the value in the matrix, with -1 for leaving and 1 for entering.
If we consider the above undirected graph, these representations would be:
Adjacency list
0: {1, 3, 4}1: {0}2: {4, 5}3: {0, 4}4: {0, 2, 3, 5}5: {2, 4}
Adjacency matrix
Node 0 1 2 3 4 50 0 1 0 1 1 01 1 0 0 0 0 02 0 0 0 0 1 13 1 0 0 0 1 0 4 1 0 1 1 0 15 0 0 1 0 1 0
Note the symmetry of the adjacency matrix for an undirected graph.
Incidence matrix
Node E1 E2 E3 E4 E5 E6 E70 1 1 1 0 0 0 01 1 0 0 0 0 0 02 0 0 0 0 1 0 13 0 1 0 1 0 0 04 0 0 1 1 1 1 05 0 0 0 0 0 1 1
Which data structure representation is best depends on how densely connected the network is and what data needs to be stored with the relationships. Adjacency lists are often used for sparse graphs, while adjacency matrices are often used for dense ones. As we can see in the example above for an incidence matrix the number of columns grows with the number of edges making it less suitable for a dense graph, while for a sparse graph there would be the inefficiency of a sparse matrix.
Knowledge graphs have recently emerged as a useful way of organizing information for machine learning. Check out this post from the Standford AI Lab blog for an introduction to the subject.
Graphs are not directly available in standard Python, but data structures such as an adjacency list or matrix are simple to implement and graph databases such as Neo4j have Python APIs.
Conclusion
Data structures is one of the subjects that is foundational in computer science but is easy to skip if you enter data science from another field. However, if you need to write a lot of custom functionality such as creating a library or a complex pipeline, choosing the right data structures for the problem can make your code simpler and more efficient. This post has given an overview of the fundamental data structures you would learn about in an undergraduate computer science course.
If you want to learn more about these data structures and their use in algorithms, the book site for Algorithms by Robert Sedgewick and Kevin Wayne contains a condensed version of the text that is an excellent resource and the corresponding course is available for free on Coursera (Part 1 | Part 2). The book and course use Java but figuring out how to implement the data structures and algorithms in Python from the Java examples is a good exercise to help you master them.
Stay tuned for the rest of the Better Python Programming for Data Scientists series!
Python Fundamentals | More Python Fundamentals | Data Structures | Object-Oriented Programming 1 | Object-Oriented Programming 2 | Parallelization | Strategies and Tools | Books