Introduction to FireDucks: Get performance beyond pandas with zero learning cost!
This article was written by Shigeru Kuroyanagi.
Introduction to FireDucks: Get performance beyond pandas with zero learning cost!
If you are a data scientist, you must be familiar with pandas. Although it is one of the most popular libraries, when I was looking for an easier way to speed up its performance, I discovered FireDucks and became interested in it!
1. Introduction
What is FireDucks?
FireDucks is a python library developed by NEC to speed up the analysis of “pandas”. It is said to make use of NEC’s high-performance programming technology and performance tuning know-how developed over more than 30 years of supercomputer development!
Comparison with pandas: Why FireDucks?
A Pandas program might encounter a serious performance issue depending on how it has been written. However, being a data scientist, I want to spend more and more time analyzing data rather than improving my code performance. So, it would be great if it could do something like interchange the order of processes and speed up the program performance automatically. For example, Process A =>Process B will be slower, so we will replace it as Process B =>Process A. (Of course, the result is guaranteed to be the same.) It is said that data scientists spend about 45% of their time preparing the data, and when I was thinking of doing something to speed-up the process, I came across a module called FireDucks.
Purpose of this article
This article is an introduction to FireDucks, mainly covering the environment settings, installation, benefits and actual use of FireDucks.
2. FireDucks Basics
Environment
From the FireDucks documentation, it seems to be supported for Linux only platforms. Since I use Windows on my main machine, I would like to try it from WSL2 (Windows Subsystem for Linux), an environment that can run Linux on Windows.
The environment I tried is as follows.
- OS Microsoft Windows 11 Pro
- Version 10.0.22631 Build 22631
- System model Z690 Pro RS
- System Type x64-based
- PC Processor 12th Gen Intel(R) Core(TM) i3–12100, 3300 Mhz, 4 Cores, 8 Logical Processors
- Baseboard Product Z690 Pro RS
- Platform Role Desktop
- Installed Physical Memory (RAM)64.0 GB
Installing and Configuring FireDucks
Install WSL
WSL was installed with the help of the following Microsoft documentation; the Linux distribution is Ubuntu 22.04.1 LTS.
https://learn.microsoft.com/ja-jp/windows/wsl/install
Install FireDucks
Then actually install FireDucks. It is very easy to install, though.
pip install fireducks
It will take a few minutes to install FireDucks (along with pyarrow, pandas and other libraries).
Brief usage and basic operation
Let’s see if we can use it right away. You can start python3.exe as it is, but I use jupyter lab as my analysis environment, so I will install jupyter lab and check it.
I actually wanted to see how useful it is. So, I used the Video_Ads Engagement Dataset from Kaggle. The dataset is about 500MB and contains 3 million video ad auctions.
Load and run data
I could successfully execute the exact same code written for pandas. The loading speed was extremely faster. In my environment, it took 4 seconds with pandas, but with FireDucks it took 74.5ns.
# 1. analysis based on time period and creative duration
# convert timestamp to date/time object
df['timestamp_converted'] = pd.to_datetime(df['timestamp'], unit='s ')
# define time period
def get_part_of_day(hour):
if 5 <= hour < 12:
return 'morning'
elif 12 <= hour < 17:
return 'afternoon'
else:
return 'evening'
# Add time period in new column
df['part_of_day'] = df['timestamp_converted'].apply(lambda x: get_part_of_day(x.hour))
# Calculate average creative duration by time period
df_ duration_by_time = df.groupby('part_of_day')['creative_duration'].mean() print(df_duration_by_time)
# 2. campaign performance per different advertiser
df_ campaigns_per_advertiser = df.groupby('advertiser_id')['campaign_id'].nunique()
df_creatives_per_advertiser = df.groupby('advertiser_id ')['creatives_id'].nunique()
print(df_campaigns_per_advertiser)
print(df_creatives_per_advertiser)
# 3. language and website association
df_common_website_ per_language = df.groupby('placement_language')['website_id'].apply(lambda x: x.mode()[0])
print(df_common_website_per_language)
# 4. Analyze referrer information
def extract_domain(referrer):
# if referrer is a float (e.g. NaN), return empty string
if isinstance(referrer, float):
return ''
# otherwise, extract domain name
return referrer.split('/')[0]
df['referrer_domain'] = df['referrer_deep_three'].apply(extract_domain)
df_referrer_distribution = df['referrer_domain'].value_counts()
print(df_referrer_distribution)
All these data preprocessing and analysis took around 8 seconds in pandas, whereas it could be completed within 4 seconds when using FireDucks. Almost 2 times speed up could be achieved.
3. Benefits of using FireDucks
Improved performance
One of the most stressful things about using pandas is waiting when loading large data sets, and then I have to wait for complex operation like groupby. On the other hand, since FireDucks does lazy evaluation, loading itself takes no time at all, so processing is done where it is needed, and I felt it was very significant with a great reduction in total waiting time.
As for other performance, it seems that up to 16 times faster compared to pandas has been achieved, as officially announced by the organization. (I will compare the performance with various competing libraries next time.)
https://fireducks-dev.github.io/docs/benchmarks/
zero learning cost
The ability to follow the exact pandas notation without having to think about anything is a huge advantage. Apart from FireDucks, there are other data frame acceleration libraries, but they are too expensive to learn and too easy to forget.
For example, if you want to add columns with polars, you have to write something like this.
# pandas df["new_col"] = df["A"] + 1
# polars
df = df.with_columns((pl.col("A") + 1).alias("new_col"))
Nearly no need to change an existing code
I have several ETLs and other projects that use pandas, and it would be nice to see a performance improvement just by installing and replacing the import statement with FireDucks.
4. Example of using FireDucks for a project
Example of actual FireDucks usage
- ETL pipeline acceleration: If you have an existing ETL pipeline using pandas, it would be good to replace it.
- Faster batch processing: Batch are slow and finicky? If you have an existing batch process using pandas, you can expect to speed up and reduce costs
- EDA and analysis of data sets larger than 1GB: The amount of data is becoming very large these days, isn’t it? For such data sets, experience shows that pandas often makes you wait for quite a long time for processing time. If you are stressed about processing time, you may want to consider FireDucks.
5. Community and Support
X (formerly Twitter)
https://twitter.com/fireducksdev
Official Documentation
6. Summary
I have summarized the strengths of FireDucks and how to actually install it. If you love pandas, I found FireDucks to be a good library that allows you to take over pandas skill set and benefit from the speed-ups. In the next article, I would like to touch on it from a more performance evaluation perspective.