Data Structures You Need NOW for Data Science and Machine Learning Algorithms

Published in

Geek Culture

9 min readJul 26, 2021

In the last couple years, I have noticed an incredible surge in the amount of students pursuing data science/machine learning skills, yet they only know packages in Python. That is not true machine learning. You must know theory. Even then, having machine learning skills is not enough. Knowing theory is not enough. You need good working knowledge of Data Structures. Maybe you’re wondering where do I start? What do I need the most? To be specific, I will be focused on the data structures I have used the most programming machine learning algorithms in Python.

Well first, you need to know the basics. There are two different types of data structures: linear and non-linear.

Linear Data Structures

Definition: A type of data structure that arranges the data items in an orderly manner where the elements are attached adjacently.

Array

An array is the most basic and common data structure around town. You will use arrays constantly in machine learning, whether it’s:

Turning a column of a Pandas DataFrame into a list for preprocessing or analysis
Using an array of tuples to order the frequency of words present in a dataset
Using a list of tokenized words to begin clustering topics
Creating multi-dimensional matrices for word embeddings
and more…!

Each element can be uniquely identified by their index in the array. The lowest index is arr[0] and corresponds to the first element, and the highest index to the last element.

Python has a set of built-in methods that you can use on lists/arrays.

Now, a Python array is a little bit different to arrays in other programming languages. Python ‘lists’ offer more flexibility than arrays because they can contain different types of data and their length can vary. If you are programming in Python for your machine learning algorithms, I highly recommend starting off by becoming extremely comfortable with using arrays.

Stacks and Queues

Stacks are used to program your undo and redo buttons on the computer because they function like a stack of books. It makes no sense to add a book to the bottom (the first element) of the stack. It’s impossible. You can only check the most recent one that has been added. Addition and removal occurs at the top of the stack. Think of it as last in first out (LIFO).

Queues work differently. They are a first in first out (FIFO) structure. Think of it as people standing waiting in line. First come first serve. However, the Queue data structure in Python has three types: FIFO, LIFO (stack), and Priority Queue. In Priority Queue the elements are kept sorted and the lowest valued element is first out.

SimpleQueue() is unbounded, while Queue() can have an upper bound.

Queue objects(Queue, LifoQueue, SimpleQueue, or PriorityQueue) provide the following public methods:

Queue.qsize() — returns the size of the queue
Queue.empty() — returns True if the queue is empty, False otherwise
Queue.full() — returns True if the queue is full, False otherwise
Queue.put(item) — puts item in the Queue
Queue.get() — removes and returns an item from the queue

The most interesting thing about the Queue module in Python is how it can be used for multithreading. This concept is so useful for machine learning, as data collection, web-scraping, and common pre-processing tasks can usually be time-intensive. The most efficient use of multithreading and multiprocessing is to optimize and compare the different parallel paradigms to maximize the efficiency of machine learning algorithms.

Here is an example of how to use multithreading with Queues where the function “worker” is being run 30 times simultaneously.

Queues are a great resource in machine learning for:

Curating a list of websites to be scraped for data
Queueing documents for data parsing
Handling a large amount of data in an organized file system (not repeating processes on documents)
and more…!

Building a queue can eat up your memory if the datatype is large. If you are considering using queues where the tasks are generated faster than they are completed, I recommend using a thread-safe queue that queues items on disk, with an optional in memory buffer. Check out this GitHub if you’re interested and serious about using queues but want to conserve memory: https://github.com/GP89/FileQueue.

For newer programmers, I recommend trying LeetCode easy questions to gain some practice with queues, and then try to solve this problem https://leetcode.com/problems/task-scheduler/ using a PriorityQueue.

FUN FACT: **You can always implement your own queue class using Python lists!**

Linked List

A linked list is a sequence of nodes, where each node is just an object that contains a value and a pointer to the next value. There are also doubly linked lists in which each node contains the address of the next as well as the previous node.

In data science and machine learning, linked lists are best for inserting a large amount of data into a list in constant O(1) time, especially when you don’t know how many items will be in the list. However, you have to ensure that you don’t need random access to any elements.

Another benefit of using a linked list is that we don’t need to have adjacent space requirements because the nodes can be anywhere in memory. For a regular array (Python list), the nodes have to be allocated as a sequence of memory.

Non-Linear Data Structures

Maps (Dictionary of Keys in Python)

Definitely the most famous data structure around, maps (HashMap in Java, dictionaries in Python, unordered_map in C++, etc.) are the go-to when trying to minimize run-time in algorithms.

Dictionaries in Python are extremely useful in data science and machine learning because many functions and algorithms return dictionaries. In Python, they are usually used to map key,value pairs where there are multiple values per key. In other words, keys are mapped to sets and lists. This is super useful for word embeddings in multiple dimensions (25,50,100,200 etc).

The only rules are that each key must be unique (if not, it will be overwritten) and there is no particular order (they are not sorted).

Dictionaries can also be helpful when implementing sparse matrices (very common in machine learning). Sparsity refers to matrices that contain mostly zero values (less pairwise interactions), different from dense matrices where most of the values are non-zero. This concept is useful in network theory, and I ran across sparse matrices A LOT in natural language processing. Almost every one-hot encoding technique produces sparse vectors.

Using standard matrix structures (2D arrays) would mean processing and precious memory is wasted on the zeros. A list of lists is usually used, but still wastes memory. However, dictionaries can also save the day! For the keys, we can use tuples that contain the row and column numbers of the matrix, and the values will represent the actual values in the matrix.

Graphs

Visualization of a paper citation network

Network theory has been by far one of the most interesting things I have studied in my data science journey so far. Using the networkx package and working with Gephi for visualizations has made me fall in love with graphs, especially since they so easily can be loaded in as Python dictionaries where the key is a node and the values represent all the nodes that node is connected to. This makes it extremely easy to find the shortest path between nodes. Graphs are such an elegant data structure that can provide amazing visualizations and collect real information from all types of data, even text.

I really recommend diving head first into graph theory before taking on the networkx package. Then, attempt to build a social network graph based off your tweet data with web-scraping, or using datasets on Kaggle!

The goal of data science and machine learning is to provide new insights. Computers continue to learn and find patterns in ways humans cannot do alone. Graphs are a great way for humans to move beyond their visual capabilities, as well as find and see connections in every aspect of human life.

Graphs can be loaded into various algorithms, notably neural networks, where tasks like regression, classification, and clustering can be performed.

Here are some beautiful ways graphs have been used in machine learning to solve real problems:

Knowledge Graphs
Social Network Graphs
Keyword Graphs

I have implemented a Keyword Graph based off of twitter data and used it for node classification and community detection in order to predict/hypothesis a summary of a potential event, and it’s by far the most fun I’ve ever had in machine learning.

I have also been able to locate influencers using social network graphs by applying centrality measures.

If you’re interested in all the beautiful things graphs can do, I encourage you to read articles about graph machine learning, and checkout this detailed breakdown.

Trees

Above is a binary tree, where there can be at most two child nodes per node. There is also a unique node called the “root” in trees (the start of the tree). There is also no possibility for a cycle in trees, unlike graphs.

Decision-trees serve well for classification and regression tasks because they are able to capture complex non-linear relationships. However, they do memorize the noise present in the data.

You must understand the structure of trees for your first machine learning models (usually classification and regression based). Knowing how to prune a tree helps to reduce overfitting and will improve your models.

Conclusion

Data structures are an essential part of programming, which is an essential skill in data science and machine learning. You have to start somewhere. It is not enough to claim you are a data scientist or machine learning engineer, yet you have zero experience in choosing appropriate data structures when trying to solve/analyze a problem.

To be the best ML/AI professional, you need more than theory. Here is a simple diagram showing the necessary skills involved to get there:

Notice how programming is the first one. You cannot implement any of your ideas efficiently if you don’t have a good understanding of the data structures involved. You could either learn along the way and waste time in trial and error, or take a serious Data Structures and Algorithms course that will give you the proper tools to think through the problem first. Here is a link to the best ones in 2021.

Above is a more detailed visualization of skills necessary for data science. Notice how machine learning is within that scope. Notice how entire branches are dedicated to data pre-processing, data visualization, statistics and mathematics. Data Science is a career that takes years to develop. It does not involve simply knowing how to use Python packages. There is a reason why they are so highly valued and difficult to find. Many claim to possess the skills but haven’t actually developed the right ones, nor do they understand what it means to truly be a data scientist.

If you’re just starting out, experiment with pre-processing and collecting data. Do some fun web-scraping projects. Learn how to collect data on your own after knowing how to use and manipulate provided data sets. Then, make sure you know your data structures. Because if you don’t, you’ll be one of the first ones rooted out of the interview process. But don’t be discouraged. There is so much time to learn, and there is no rush. This generation is going a mile a minute and consuming information way too quickly. Give your brain time to process. Just like your models need time to train, so do you. Accept that and keep moving at the right pace. Do not move on from data structures until you have a firm understanding of them.

Good luck on your machine learning journey and happy coding!

Sources:

queue - A synchronized queue class - Python 3.9.6 documentation

Source code: Lib/queue.py The module implements multi-producer, multi-consumer queues. It is especially useful in…

docs.python.org