Data Science

Practical Optimizations for Pandas

Image for post

In this to guide, I am going to show you some of the most common pitfalls that can cause otherwise perfectly good Pandas code to be too slow for any time-sensitive applications, and walk through a set of tips and tricks to avoid them.

Let’s remind ourselves what is pandas, apart from a cute animal 🐼. Its a widely used library for data analysis and manipulation that load all the data into RAM.

In this article, I am going to use a dataset that contains meal invoices (one million rows) 📉.

df = load_dataset()
df.head()
png

Why Performance🤨

  • Fast is better than slow- because no one loves to wait for his code to be executed 🐇. …

Increasing productivity in Jupyter notebooks using debugging

Image for post
Image for post
Source

Naive Way — Logging

We all encounter bugs in our code while developing programs, logging is a great way to debug our code and probably the most intuitive one. A normal way to do that is to add the print statement in your code which helps us track down the source of many issues.

import random
def find_max (values):
max = 0
print(f”Initial max is {max}”)
for val in values:
if val > max:
max = val
return max
find_max(random.sample(range(100), 10))

Advantages:

  • Easy
  • No installation required

Disadvantages:

  • Can be spammy
  • Hard to pinpoint error-prone locations
Image for post
Image for post
source

Classical Way — PDB

We saw print statements helps us find relevant information regarding the issues, but these tools aren’t enough to find every root cause. When we need something more powerful, it’s time to try Python’s built-in interactive debugger. …


Search In Practice- Approximate Nearest Neighbors

Image for post
Image for post
Source: https://baike.baidu.com/item/%E6%B3%B0%E6%A3%AE%E5%A4%9A%E8%BE%B9%E5%BD%A2/3428661?fromtitle=voronoi&fromid=9089406

Nearest Neighbors Motivation

Today as users consume more and more information from the internet at a moment’s notice, there is an increasing need for efficient ways to do search. This is why “Nearest Neighbor” has become a hot research topic, in order to increase the chance of users to find the information they are looking for in reasonable time.

The use cases for “Nearest Neighbor” are endless, and it is in use in many computer-science areas, such as image recognition, machine learning, and computational linguistics (1, 2 and more).

Image for post
Image for post

About

Eyal Trabelsi

Eyal is a data engineer at Salesforce with a passion for performance. His main areas of expertise are within data-intensive applications, improvement of process

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store