Battle of the Pandas: Why FireDuck Pandas Might Be Your New Best Friend

Let’s dive into a head-to-head comparison and see why FireDuck Pandas might just become your new go-to tool

Published in

Accredian

4 min readJul 1, 2024

Data analysis is a critical skill in today’s data-driven world, and Python is one of the most popular programming languages for this purpose.

In the world of Python, when it comes to data manipulation, the pandas library is a go-to tool for many. But now, there’s a new contender in town: FireDuck Pandas. In this article, we’ll compare FireDuck Pandas to the classic pandas library in a way that’s easy to understand and interesting to read. We’ll also highlight the differences.

Meet the New Pandas With FireDuck

FireDucks, developed by an R&D team at NEC focused on speeding up Pandas code, is designed to maximize efficiency.

Highlights of FireDucks include:

Multi-threaded support for running on multiple cores.
Integration of a JIT compiler for optimized performance.
Full compatibility with the pandas API, with only a different import statement required.
Includes a lazy execution mode similar to Polars.

Curious about how FireDucks performs compared to regular Pandas? Let’s explore its capabilities!

Commencing the Battle

To make the comparison more concrete, we will use a synthetic dataset and measure different aspects such as speed performance and memory usage. We’ll compare Normal Pandas and FireDuck Pandas on these metrics.

Generating the Dataset

We’ll create a large synthetic dataset to simulate real-world data.

import pandas as pd
import numpy as np

# Generating a large synthetic dataset
np.random.seed(42)
data = {
    'A': np.random.rand(1000000),
    'B': np.random.rand(1000000),
    'C': np.random.randint(0, 100, 1000000),
    'D': np.random.choice(['X', 'Y', 'Z'], 1000000)
}
df = pd.DataFrame(data)
# Save dataset to a CSV file
df.to_csv('Data.csv', index=False)

Comparing Performance and Memory Usage

We’ll compare the two libraries on the following operations:

Reading data from CSV.
Performing a groupby operation.
Descriptive statistics.

First, we’ll compare the speed performance.

Speed Performance

Pandas Code

# Pandas
import pandas as pd
import time

# Reading data
start_time = time.time()
df = pd.read_csv('Data.csv')
read_time_pandas = time.time() - start_time

# Groupby operation
start_time = time.time()
grouped_df = df.groupby('D').mean()
groupby_time_pandas = time.time() - start_time

# Descriptive statistics
start_time = time.time()
desc_stats_pandas = df.describe()
desc_time_pandas = time.time() - start_time
read_time_pandas, groupby_time_pandas, desc_time_pandas

FireDuck Pandas Code

# FireDuck Pandas
import fireducks.pandas as pd
import time

# Reading data
start_time = time.time()
df_fd = fpd.read_csv('Data.csv')
read_time_fireduck = time.time() - start_time

# Groupby operation
start_time = time.time()
grouped_df_fd = df_fd.groupby('D').mean()
groupby_time_fireduck = time.time() - start_time

# Descriptive statistics
start_time = time.time()
desc_stats_fireduck = df_fd.describe()
desc_time_fireduck = time.time() - start_time
read_time_fireduck, groupby_time_fireduck, desc_time_fireduck

Memory Usage

We’ll use memory_profiler to measure the memory usage of these operations.

Normal Pandas

import pandas as pd
from memory_profiler import memory_usage

# Reading data
mem_usage_pandas = memory_usage((pd.read_csv, ('Data.csv',)))
max_mem_usage_pandas = max(mem_usage_pandas)

# Groupby operation
df = pd.read_csv('Data.csv')
mem_usage_pandas_groupby = memory_usage((df.groupby, ('D',)))
max_mem_usage_pandas_groupby = max(mem_usage_pandas_groupby)

# Descriptive statistics
mem_usage_pandas_desc = memory_usage((df.describe,))
max_mem_usage_pandas_desc = max(mem_usage_pandas_desc)

FireDuck Pandas

import fireduck_pandas as fpd
from memory_profiler import memory_usage


# Reading data
mem_usage_fireduck = memory_usage((fpd.read_csv, ('Data.csv')))
max_mem_usage_fireduck = max(mem_usage_fireduck)

# Groupby operation
df_fd = fpd.read_csv('Data.csv')
mem_usage_fireduck_groupby = memory_usage((df_fd.groupby, ('D',)))
max_mem_usage_fireduck_groupby = max(mem_usage_fireduck_groupby)

# Descriptive statistics
mem_usage_fireduck_desc = memory_usage((df_fd.describe,))
max_mem_usage_fireduck_desc = max(mem_usage_fireduck_desc)

Visualizing the Comparison

Now, let’s visualize the performance and memory usage comparisons

import matplotlib.pyplot as plt

# Performance data
operations = ['Read CSV', 'Groupby Mean', 'Describe']
normal_pandas_times = [read_time_pandas, groupby_time_pandas, 
                      desc_time_pandas]
fireduck_pandas_times = [read_time_fireduck, groupby_time_fireduck, 
                        desc_time_fireduck]

# Memory usage data
normal_pandas_mem = [max_mem_usage_pandas, max_mem_usage_pandas_groupby, 
                     max_mem_usage_pandas_desc]
fireduck_pandas_mem = [max_mem_usage_fireduck, max_mem_usage_fireduck_groupby,
                       max_mem_usage_fireduck_desc]

# Plotting speed performance
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.bar(operations, normal_pandas_times, width=0.4, label='Normal Pandas',
        align='center')
plt.bar(operations, fireduck_pandas_times, width=0.4, label='FireDuck Pandas',
        align='edge')
plt.ylabel('Time (seconds)')
plt.title('Speed Performance Comparison')
plt.legend()

# Plotting memory usage
plt.subplot(1, 2, 2)
plt.bar(operations, normal_pandas_mem, width=0.4, label='Normal Pandas',
        align='center')
plt.bar(operations, fireduck_pandas_mem, width=0.4, label='FireDuck Pandas',
        align='edge')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Comparison')
plt.legend()

# Displaying the plots
plt.tight_layout()
plt.show()

Comparing the Contenders

Performance

Normal Pandas performs well for datasets that fit into memory, making it ideal for small to medium-sized data. However, it can struggle with very large datasets. FireDuck Pandas is designed to handle larger datasets more efficiently, making it a better choice for big data scenarios.

API and Usability

One of the strengths of Normal Pandas is its intuitive and easy-to-use API, which has made it very popular. FireDuck Pandas maintains a similar API, which means that users familiar with Normal Pandas can easily transition to using FireDuck Pandas with minimal learning curve.

Community and Ecosystem

Normal Pandas has a vast user base and a rich ecosystem of libraries and resources. FireDuck Pandas is newer and still growing its community, but it benefits from compatibility with the existing pandas ecosystem.

Conclusion

FireDuck Pandas offers way better performance and Somewhat Similar Memory Management for large datasets compared to Normal Pandas. While Normal Pandas remains a great choice for smaller datasets and is backed by a strong community, FireDuck Pandas is an excellent alternative for handling big data scenarios more effectively.

With these insights, you can make an informed decision on which library to use based on your data size and performance requirements. Whether you stick with the tried-and-true Normal Pandas or explore the high-performance FireDuck Pandas, both libraries offer robust tools to tackle your data analysis needs.