Feather vs Pickle: A Comparative Analysis of Data Storage

Ankush Singh
3 min readJun 11, 2023

--

Feather vs Pickel format

Introduction

Data storage and retrieval are foundational aspects of any data processing task. In Python, among the plethora of data storage options available, two formats often pop up in discussions — Feather, a language-agnostic columnar format backed by Apache Arrow, and Pickle, Python’s own object serialization method.

In this blog post, we compare Feather and Pickle in the context of storing Pandas DataFrames, focusing on their performance in read/write operations. Let’s dive in!

Understanding Feather and Pickle

Feather, a part of Apache Arrow, provides a lightweight, fast, and easy-to-use binary file format for storing data frames. It uses the Arrow columnar memory format, enabling rapid access to data, even for large datasets.

On the other hand, Pickle is a Python-specific binary serialization format. It converts Python objects into a byte stream for storage or transmission. While not as fast as Feather for large data frames, Pickle is versatile, capable of serializing almost any Python object.

Comparing Performance

To benchmark the performance of Feather and Pickle, we use a large dataset containing ten million records with three columns: ‘age’, ‘gender’, and ‘income’. We then time the write and read operations for both formats.

import pandas as pd
import numpy as np
import pyarrow.feather as feather
import time

# Simulating a large dataset
num_records = 10**7
df = pd.DataFrame({
'age': np.random.randint(18, 100, size=num_records),
'gender': np.random.choice(['Male', 'Female'], size=num_records),
'income': np.random.uniform(30000, 80000, size=num_records)
})

# Saving and loading using Feather (Arrow format)
start_time = time.time()
feather.write_feather(df, 'data.feather')
read_time = time.time()
df_feather = feather.read_feather('data.feather')
end_time = time.time()

print(f"Arrow Feather: Write Time - {read_time - start_time} seconds, Read Time - {end_time - read_time} seconds.")

# Saving and loading using Pickle
start_time = time.time()
df.to_pickle('data.pkl')
read_time = time.time()
df_pickle = pd.read_pickle('data.pkl')
end_time = time.time()

print(f"Pandas Pickle: Write Time - {read_time - start_time} seconds, Read Time - {end_time - read_time} seconds.")

Output

In our experiment, Feather generally outperforms Pickle in terms of speed for both write and read operations. This performance boost is primarily due to Feather’s columnar storage, which optimizes how data is stored and retrieved from disk. However, your results may vary depending on the complexity and size of your data.

Weighing the Pros and Cons

While speed is an essential factor, it’s not the only aspect to consider when choosing between Feather and Pickle.

Feather’s Advantages:

  • Speed: As demonstrated, Feather is faster for large data frames.
  • Language-Agnostic: Feather files can be read by any language that supports Apache Arrow, such as Python, R, C++, etc.

Pickle’s Advantages:

  • Versatility: Pickle can serialize almost any Python object.
  • Integration: Pickle is built-in with Python, requiring no additional packages.

Conclusion

While Feather’s speed makes it an excellent choice for large data frames, particularly in a multi-language environment, Pickle’s ability to serialize any Python object can be invaluable, depending on the use case. As always, the choice between the two depends on the specific needs of your project.

We hope this comparison between Feather and Pickle provides a clear picture of when to use each format. Stay tuned for more insightful content around data processing in Python!

Follow Me On:

  1. LinkedIn
  2. Twitter

--

--

Ankush Singh

Data Engineer turning raw data into gold. Python, SQL and Spark enthusiast. Expert in ETL and data pipelines. Making data work for you. Freelancer & Consultant