From Data Structures to Science: DSA in Data-Driven World.

Chesta Dhingra
Data And Beyond

--

In this DSA series we venture beyond traditional interview questions to explore the foundational concepts of Data Structures and Algorithms and their practical application on the real datasets. I have designed this immersive journey into understanding how DSA can be powerful tool in hands of Data Scientist.

By integrating DSA principles with hands-on dataset analysis, aiming to provide a dual learning experience.

Let’s start our journey via providing a brief introduction towards data structures and algorithms followed by its implementation on the house pricing dataset.

  1. First Data Structure and Algorithm that we’ll be discussing is the Merge Sort. This is used for sorting the arrays or lists efficiently. It divide the array into smaller parts, sort those parts and then merge them together. Merge Sort algorithm is based on divide-and-conquer algorithm. Based on the dataset that we have; we’ll apply the merge sort algorithm on SalesPrice column of the dataset. Below is code for the same.
## first import the pandas library to read the dataset
import pandas as pd
data = pd.read_csv("train.csv")

## turn series into list
a=data['SalePrice'].tolist()

## now create a merge sort algorithm that will merge the sort arrays

def merge(arr,l,mid,r):
left = arr[l:mid+1]
right = arr[mid+1:r+1]
m = len(left)
n = len(right)
i=0
j=0
k=l
while i<m and j<n:
if left[i]<right[j]:
arr[k] = left[i]
i+=1
else:
arr[k]=right[j]
j+=1
k+=1
while i<m:
arr[k]=left[i]
i+=1
k+=1
while j<n:
arr[k]=left[j]
j+=1
k+=1


def merge_sort(arr,l,r):
if l<r:
mid=(l+r)//2
merge_sort(arr,l,mid)
merge_sort(arr,mid+1,r)
merge(arr,l,mid,r)


## now we have created the merge sort and did the sorting of the array in the dataset
## itself.
## the first merge is doing the comparison of the sorting values that has been
## created by the merge_sort functionand then creating a whole sorted list.

## to do the implementation of the same on the dataset or the list
## that we have created that is "a"

mergesort(a, 0, len(a)-1)


a[:5]

## below is the output result of
# [34900, 35311, 37900, 39300, 40000]

2. After sorting an array or dataset using the merge sort algorithm, the next step often involves searching for specific elements within this sorted array. The binary search algorithm, another example of a divide-and-conquer strategy, is ideally suited for this task. However, it’s crucial to note that binary search requires the array to be sorted beforehand, making it a perfect follow-up to merge sort

## create a binary function that will divide the data and then find the elements
## provided by the user

def find_elements(arr,low,high):
mid = (low+high)//2
if low>high:
return -1
if arr[mid]==val:
return mid
elif arr[mid]<val:
return find_elements(arr,mid+1,high)
else:
return find_elements(arr,low,mid-1)


## above function will find the index of the elemnts.
## there might be chances that the same value appeared more than once in the dataset
## thus to find the frequency of it we create one more function
## that helps in finding all the occurrences

def find_occurrences(arr,val,index):
if index==-1:
occurrences =[index]
left = index-1
while left>=0 and arr[index]==val:
occurrences.append(left)
left-=1
while right < len(arr) and arr[index]==val:
occurrences.append(right)
right+=1
return sorted(occurrences)

# now we can apply both the functions first to find the desired element
## and second to find the occurrence of the elements appeared in the list

high =len(sale_prices)
low=0
val = 200000

index = find_elements(a, val, low, high)
all_occurrences = find_occurrences(a, val, index)
print(all_occurrences)

## this will provide below results
# [1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032]

3. Ready to mix a little DSA magic with real-world data? We’re zooming in on how to find the shortest paths, not just on maps, but in data columns! This cool trick will show you the power of data structures and algorithms in action, making sense of data in ways that are both super useful and kinda fun. It’s like finding the shortest line at the grocery store but for numbers. Let’s dive in!

class solution:
def min_dist(self,arr,a,b,n=None):
if n is None:
n= len(arr)
index_a = None
index_b = None
min_distance = float("inf")

for i in range n:
if arr[i]==a:
index_a = i
if index_b is not None:
current_dist = abs(index_a-index_b)
if current_dist<min_distance:
min_distance = current_distance
if arr[i]==b:
index_b = i
if index_a is not None:
current_dist = abs(index_b-index_a)
if current_dist < min_distnce:
min_distnce = current_dist
if index_a is None or index_b is None:
min_distance =-1
else:
if min_distance==float("inf"):
min_distance =-1
return min_distance

min_dist_dataset = solution().min_dist(sale_prices,a=200000, b=250000)
min_dist_dataset

Up next in our DSA meets Data Science adventure, we’ll tackle even more mind-bending concepts and dive into algorithms that’ll seriously level up the logic and coding game. Stay tuned, because we’re just getting started on this journey to become data wizards together!

Hope you had as much fun reading this as I did writing it! Got thoughts, feedback, or just want to chat about it? Drop your comments below — I’d love to hear from you!

--

--