Why Data Scientists Should Learn Algorithms and Data Structures?

Understanding concepts such as algorithmic complexity and proper use of data structures will enable you to write more optimal code.

Denis Lapchev
The Startup
5 min readOct 5, 2020

--

Photo by Kevin Ku on Unsplash

Introduction

Algorithms and data structures are considered core skills for software engineers. How useful are these skills for data scientists and analysts?

A typical data scientist spends most of their time in high-level languages such as Python/R/SQL and rarely needs to think about underlying implementations. This is due to fact that the majority of data analysis and machine learning algorithms are already packaged in ready-to-use, heavily optimised libraries — such as Scikit-Learn, Numpy, Pandas, and others (R fans have their own tools).

To answer the question if algorithms and data structures are worth your time, I will list a few tools you will have under your tool-belt after taking a typical algorithms course.

Algorithmic Complexity

The first useful concept you will encounter is algorithmic complexity and Big-Oh notation. It is a method that allows understanding how well your code scales with the data. This concept is important to data scientists due to the need to process an ever-increasing amount of information produced daily.

By getting rid of less important details, you will be able to reason about the performance of an algorithm regardless if it is written in Python or C and executed on a laptop or NASA’s supercomputer. In a sense, it defines a basic vocabulary for the design and analysis of algorithms while suppressing architecture and language-dependent details — these are considered a constant factor, not relevant to the big picture.

As an example, imagine you are asked to choose a machine learning algorithm for a certain classification task and considering using Support Vector Machine (SVM) due to its good ability to handle non-linear decision boundaries. You open documentation to check SVM’s details and notice the following:

From Scikit-Learn’s documentation for SVM

After understanding algorithmic complexity, you might reconsider using SVM for large datasets due to high compute requirements, ranging between O(n²) and O(n³). It implies that if we double the amount of data, execution time is likely to increase between 4 and 8 times.

A small word of caution about Big-Oh notation — it is considered “asymptotic analysis”, meaning, in theory, is valid only for sufficiently large inputs, due to the presence of constant factors, which in practice can be quite large.

Graph Algorithms

Graph algorithms are relevant in the data science world, having applications in fraud detection, clustering, ranking, recommendation systems, and others.

Usually, algorithm courses focus heavily on graphs, probably because it is easier to demonstrate on them the common algorithm paradigms.

Among graph algorithms you will encounter:

  • Generic graph search — allowing generic exploration of graphs.
  • Shortest path — allowing to compute the shortest path between any two vertices. Applicable for tasks such as network analysis, navigation, social media connections, and fraud detection.
  • Connected components and minimum cuts— allows detecting communities and clusters within the graph.
  • Minimum spanning tree — useful for clustering and image segmentation.

You will spend significant time studying graph algorithms and will be able to implement fast solutions.

Even if you are more likely to use existing implementations of graph algorithms instead of writing one from scratch, knowledge of inner workings and limitations will help you to choose the right algorithm for the task.

Data Structures

Python is an example of a programming language having convenient and versatile data structures. So convenient, that many people tend to use Python’s list as default data storage for all their needs. It can result in significant performance bottlenecks in certain scenarios.
By studying data structures, you will understand their strength and weaknesses and will be able to reason about the data structure’s applicability for different tasks.

You will learn about useful data structures, such as:

  • Heaps — for use cases when you require computation on objects having minimum (or maximum) values. Interesting applications of heaps are event-driven simulations and online (streaming) computation of median.
  • Binary search trees — Swiss-army knife data structure, highly efficient for a variety of operations, such as insertion, selection, and search.
  • Hash tables — will allow understanding how sets and dictionaries are implemented behind the scenes, and why they are preferable for look-up applications. You will learn about hash functions, collisions, and security risks that arise from improper implementation.
  • One of the more exciting data structures is the bloom filter. It is a probabilistic data structure suitable for efficient storage and look-up. A small caveat — it has a non-zero probability of false positive, claiming that an object is present, while in practice it isn’t. For some applications, this drawback is not critical. For example, bloom filters are used in web browsers for caching recently visited page addresses and in internet routers.

Recognising bottlenecks and optimising computation by using appropriate data structures will allow you to speed up certain applications by an order of magnitude. It is especially relevant to data scientists who deal with simulations and optimisation problems.

Algorithmic paradigms

You will understand the common ways of tackling a variety of algorithmic problems:

  • Divide and conquer algorithms — dividing larger problems into smaller sub-problems, and solving recursively. Best demonstrated using sort algorithms.
  • Randomised algorithms — relying on probability theory to derive an optimal solution. These are especially interesting from a data science point of view. Example usages — QuickSort or finding ith order statistic such as minimum, maximum, or median.
  • Greedy algorithms — intuitive algorithms, useful for applications such as scheduling and caching.
  • Dynamic programming — applicable to a variety of problems. Usually demonstrated using the “Knapsack problem” [1], which behind the scenes is a constrained optimisation problem arising in multiple real-life scenarios.

The advantage of studying algorithmic paradigms — you will learn how algorithms are designed without relation to any programming language and will be able to apply this knowledge in your work.

Intractable Problems

If you ever were curious about what “NP-Completeness” [2] means — welcome to the world of intractable problems. You will understand what problems are considered intractable and why there is no proven way to guarantee an optimal solution besides brute force search. These problems are an active field of research until today and are considered the hardest problems of theoretical computer science.

You will be able to recognise NP-Complete problems and different approaches for obtaining approximate solutions using methods like heuristics and local search. A typical example of the NP-Complete problem is the “Travelling salesman” [3] problem.

You will rarely encounter NP-Complete problems in practice, however knowing that they exist and how to tackle them is beneficial.

Summary

In this article, we discussed several important concepts you will understand after learning algorithms and data structures:

  • Algorithmic Complexity
  • Graph Algorithms
  • Data Structures
  • Algorithmic Paradigms
  • Intractable Problems

Knowledge of these concepts will improve your understanding of the programming side of data science and will allow you to write better code.

References

[1] Knapsack Problem, Wikipedia, https://en.wikipedia.org/wiki/Knapsack_problem

[2] NP-Completeness, Wikipedia, https://en.wikipedia.org/wiki/NP-completeness

[3] Travelling Salesman Problem, Wikipedia, https://en.wikipedia.org/wiki/Travelling_salesman_problem

--

--