Top 8 Must-Know Algorithms in Python for Data Scientists

Joe Jacob
8 min readMar 3, 2023

--

As a data scientist, I’ve come across various complex problems that require efficient and effective solutions. One key aspect of solving these problems is understanding different algorithms that can be used to process data and generate meaningful insights. In Python programming language, there are many algorithms that data scientists should learn. In this blog, I’ll be discussing the top 8 algorithms that you should know in Python to enhance your data science skills.

quicksort algorithm generated on carbon

Algorithm is a set of instructions or a procedure used to solve specific problems or tasks. In data science, algorithms are used to analyze data and extract meaningful insights. Python is one of the most popular programming languages used in data science, and learning the following algorithms can help you in improving your data science skills:

  1. Sorting algorithms — Used to sort data in ascending or descending order.
  2. Search algorithms — Used to find specific data in a set of data.
  3. Graph algorithms — Used to represent and analyze complex relationships between data points.
  4. Dynamic programming — Used to solve complex optimization problems by breaking them down into smaller sub-problems.
  5. Greedy algorithms — Used to make decisions based on immediate gain without considering long-term effects.
  6. Divide and conquer — Used to break down complex problems into smaller, more manageable sub-problems.
  7. Backtracking — Used to find solutions by trying different possibilities and then undoing them if they lead to dead-ends.
  8. Randomized algorithm — Used to introduce randomness into the solution-finding process to improve efficiency.

By understanding these algorithms, data scientists can tackle complex problems more efficiently and generate more accurate results. So, it’s essential to learn these algorithms in Python to enhance your data science skills and become a more effective problem solver.

These algorithms can be categorized based on the technique used to solve a specific problem, time and space complexity, and the type of problem it solves. I will try to explain them in the best possible way. The first type of algorithms is sorting algorithms, which are used to sort elements in a specific order.

1. Sorting Algorithms:

  • Quicksort: Quicksort is a divide-and-conquer algorithm that chooses a “pivot” element from the array and partitions the other elements into two sub-arrays, according to whether they are less than or greater than the pivot.
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort(left) + middle + quicksort(right)

print(quicksort([3,6,8,10,1,2,1]))
  • Mergesort: Merge Sort, on the other hand, is a divide-and-conquer algorithm that divides an array in two, sorts the two halves, and then merges them back together.
def merge_sort(arr):
if len(arr) <= 1:
return arr
else:
mid = len(arr) // 2
left = arr[:mid]
right = arr[mid:]
left = merge_sort(left)
right = merge_sort(right)
return merge(left, right)

def merge(left, right):
result = []
while len(left) > 0 and len(right) > 0:
if left[0] <= right[0]:
result.append(left[0])
left = left[1:]
else:
result.append(right[0])
right = right[1:]

if len(left) > 0:
result.extend(left)
if len(right) > 0:
result.extend(right)
return result

arr = [4, 7, 9, 12, 1, 5, 1]
sorted_arr = merge_sort(arr)
print(sorted_arr)
  • Heapsort: Heap Sort is a comparison-based sorting algorithm that builds a heap from the input elements and then repeatedly extracts the maximum element from the heap and places it at the end of the sorted output array.
def heapify(arr, n, i):
largest = i
left = 2 * i + 1
right = 2 * i + 2

# check if left child exists and is greater than the root
if left < n and arr[i] < arr[left]:
largest = left

# check if right child exists and is greater than the root
if right < n and arr[largest] < arr[right]:
largest = right

# swap the root with the largest child if needed
if largest != i:
arr[i], arr[largest] = arr[largest], arr[i]
heapify(arr, n, largest)

def heap_sort(arr):
n = len(arr)

# build a max-heap from the input array
for i in range(n // 2 - 1, -1, -1):
heapify(arr, n, i)

# extract elements one by one from the heap and place them in sorted order
for i in range(n - 1, 0, -1):
arr[i], arr[0] = arr[0], arr[i]
heapify(arr, i, 0)

return arr

# example usage
arr = [5, 3, 8, 4, 2, 1, 9, 7, 6]
print(heap_sort(arr))

2. Search Algorithm

  • Binary Search: Binary search is an algorithm that searches for a specific value in a sorted list or array. It works by repeatedly dividing the search interval in half until the value is found or determined to be not in the array. At each step, the algorithm compares the middle element of the interval with the target value. If the middle element is equal to the target, the search is successful and the algorithm returns the index of the middle element. If the middle element is greater than the target, the algorithm continues the search in the left half of the interval, and if it’s less than the target, the search continues in the right half of the interval. The algorithm repeats this process until the target value is found or the search interval is empty.
def binary_search(arr, target):
low = 0
high = len(arr) - 1
while low <= high:
mid = (low + high) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
low = mid + 1
else:
high = mid - 1
return -1
  • Hash Tables: Hash tables are another important data structure used in many algorithms. A hash table is a data structure that stores key-value pairs, where the keys are unique and the values can be any data type. A hash table uses a hash function to compute an index into an array of buckets, where the desired value can be found. The hash function takes the key as input and produces a hash code, which is then used to compute an index into the array. The hash function should produce the same hash code for any given key, and different hash codes for different keys. If two keys produce the same hash code (known as a hash collision), the hash table uses a collision resolution technique to store and retrieve the values.
class HashTable:
def __init__(self):
self.size = 100
self.table = [[] for _ in range(self.size)]

def hash_function(self, key):
return hash(key) % self.size

def set(self, key, value):
hash_value = self.hash_function(key)
bucket = self.table[hash_value]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))

def get(self, key):
hash_value = self.hash_function(key)
bucket = self.table[hash_value]
for k, v in bucket:
if k == key:
return v
raise KeyError(key)

3. Graph Algorithms

  • Dijkstra’s shortest path algorithm: Dijkstra’s shortest path algorithm is an algorithm for finding the shortest path between nodes in a graph.
import heapq

def dijkstra(graph, start):
distances = {node: float('inf') for node in graph}
distances[start] = 0
pq = [(0, start)]
while pq:
(dist, curr_node) = heapq.heappop(pq)
if dist > distances[curr_node]:
continue
for (neighbor, weight) in graph[curr_node].items():
new_dist = dist + weight
if new_dist < distances[neighbor]:
distances[neighbor] = new_dist
heapq.heappush(pq, (new_dist, neighbor))
return distances

# example graph represented as a dictionary
graph = {
'A': {'B': 1, 'C': 4},
'B': {'A': 1, 'C': 2, 'D': 5},
'C': {'A': 4, 'B': 2, 'D': 1},
'D': {'B': 5, 'C': 1}
}

# find shortest paths from node 'A' to all other nodes
shortest_paths = dijkstra(graph, 'A')

print(shortest_paths)

In this example, we define the dijkstra function that takes a graph represented as a dictionary and a starting node as input. The function uses a priority queue (implemented as a heap) to keep track of the nodes to be explored and their tentative distances from the starting node.

The algorithm initializes all node distances to infinity except the starting node, which is assigned a distance of 0. It then repeatedly extracts the node with the smallest tentative distance from the queue, and relaxes all its outgoing edges by updating the distances of their adjacent nodes if a shorter path is found. The process continues until all nodes have been explored or the target node (if specified) is reached.

Finally, the function returns a dictionary of the shortest distances from the starting node to all other nodes in the graph. In the example, we use this function to find the shortest paths from node ‘A’ to all other nodes in the graph.

4. Dynamic Programming

  • Fibonacci sequence: A classic example of a problem that can be solved using dynamic programming is the Fibonacci sequence.
def fibonacci(n):
if n <= 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n-1) + fibonacci(n-2)


print(fibonacci(10))

5. Greedy Algorithms

  • Huffman coding: Huffman coding is a way to compress data without losing any information. It’s like squeezing a sponge to make it smaller, but you can still get all the water back out. Huffman coding uses a smart way of assigning shorter codes to more frequently used symbols and longer codes to less frequently used symbols. This makes the compressed file smaller, as the most commonly used symbols take up less space. The algorithm is called “greedy” because it always chooses the best option at each step without considering the long-term consequences.
import heapq

def huffman_code(freq):
heap = [[wt, [sym, ""]] for sym, wt in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
for pair in left[1:]:
pair[1] = '0' + pair[1]
for pair in right[1:]:
pair[1] = '1' + pair[1]
heapq.heappush(heap, [left[0] + right[0]] + left[1:] + right[1:])
return dict(sorted(heapq.heappop(heap)[1:], key=lambda x: (len(x[-1]), x)))

# Example usage
freq = {'a': 45, 'b': 13, 'c': 12, 'd': 16, 'e': 9, 'f': 5}
code = huffman_code(freq)
print(code)

6. Divide and Conquer:

  • Merge Sort: already explained above

7. Backtracking

  • N-Queens problem: The N-Queens problem is a puzzle where you have a chessboard of size N by N, and you need to place N queens on the board so that no two queens can attack each other. This means that no two queens can be on the same row, column, or diagonal. To solve this puzzle, you can use a technique called backtracking, which involves trying out different solutions and undoing them if they lead to a dead end.

Let’s take the case of a 4x4 chessboard. In this case, we need to place 4 queens on the board such that no queen can attack any other queen.

The solution would look like this:

|   Q   |       |       |   Q   |
| | Q | Q | |
| | | | Q |
| Q | | | |

Each Q represents the position of a queen on the board. As you can see, in this solution, no two queens are in the same row, column, or diagonal. This is the solution to the 4-Queens problem on a 4x4 chessboard.

Similarly, we can solve the N-Queens problem for any given value of N, where N represents the size of the board and the number of queens to be placed on it.

8. Randomized Quicksort:

Randomized QuickSort is a modified version of the QuickSort algorithm that selects the pivot element randomly, rather than using a predetermined strategy.

import random

def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = random.choice(arr)
left, equal, right = [], [], []
for x in arr:
if x < pivot:
left.append(x)
elif x == pivot:
equal.append(x)
else:
right.append(x)
return quicksort(left) + equal + quicksort(right)

print(quicksort([3, 6, 8, 10, 1, 2, 1]))

These algorithms are frequently used and it is essential for programmers to know them. Having knowledge of these algorithms and how they work can assist a programmer in making better choices while developing efficient solutions.

If you’re interested in staying updated on the latest trends and advancements in technology, be sure to follow me on LinkedIn 💼 , Twitter 🐦 and Medium 📝! I share informative articles, insightful posts, and valuable industry insights. Let’s stay connected and continue the conversation on the latest tech advancements and innovative solutions. Don’t forget to give a shout out and say hi! 🤖🚀

--

--

Joe Jacob

Self-taught Data Scientist | I want to help individuals and businesses to solve problems through technology. Sharing insights and inspiring innovation.