Pandas DataFrame Storage Format Comparision
When we, as data professionals, work on projects.
We have to store data as files on a disk. As a checkpoint to continue our work later or to send our results to others.
There are many types of storage format to use.
In this article, I will share my test results of reading speed, writing speed and size of Pandas Dataframe in different types of storage format.
This is a link to Colab notebook that I used to test in this article.
Create a test Dataframe.
First, I created a test Pandas Dataframe that contain different types of data.
import pandas as pd
import random
import string
import numpy as np
# Config DF
df_length= 10**6
start_date= '2023-01-01'
all_string= list(string.ascii_letters + string.digits)
string_length= 10**1
min_number= 0
max_number= 10**3
# Create Columns
date_col= pd.date_range(start= start_date, periods= df_length, freq= 'H')
str_col= [''.join(np.random.choice(all_string, string_length)) for i in range(df_length)]
float_col= np.random.rand(df_length)
int_col= np.random.randint(min_number,max_number, size = df_length)
# Create DataFrame
df= pd.DataFrame({'date_col' : date_col,
'str_col' : str_col,
'float_col' : float_col,
'int_col' : int_col})
df.info()
df.head()
Test store in different formats.
Next, I created a test function to read and write in different format.
import time
import os
def check_read_write_size(df, file_name, compression= None) :
format= file_name.split('.')[-1]
# Write
begin= time.time()
if file_name.endswith('.csv') : df.to_csv(file_name, index= False, compression= compression)
elif file_name.endswith('.parquet') : df.to_parquet(file_name, compression= compression)
elif file_name.endswith('.pickle') : df.to_pickle(file_name, compression= compression)
elif file_name.endswith('.orc') : df.to_orc(file_name)
elif file_name.endswith('.feather') : df.to_feather(file_name)
elif file_name.endswith('.h5') : df.to_hdf(file_name, key= 'df')
write_time= time.time() - begin
# Read
begin= time.time()
if file_name.endswith('.csv') : pd.read_csv(file_name, compression= compression)
elif file_name.endswith('.parquet') : pd.read_parquet(file_name)
elif file_name.endswith('.pickle') : pd.read_pickle(file_name, compression= compression)
elif file_name.endswith('.orc') : pd.read_orc(file_name)
elif file_name.endswith('.h5') : pd.read_hdf(file_name)
read_time= time.time() - begin
# File Size
file_size_mb = os.path.getsize(file_name) / (1024 * 1024)
return [format, compression, read_time, write_time, file_size_mb]
Then, I ran the function and store results in another Pandas Dataframe.
test_case= [
['df.csv','infer'],
['df.csv','gzip'],
['df.pickle','infer'],
['df.pickle','gzip'],
['df.parquet','snappy'],
['df.parquet','gzip'],
['df.orc','default'],
['df.feather','default'],
['df.h5','default'],
]
result= []
for i in test_case :
result.append(check_read_write_size(df, i[0], compression= i[1]))
result_df= pd.DataFrame(result, columns= ['format','compression','read_time','write_time','file_size'])
result_df
Test results.
The chart and table below are results from the test.
We can conclude popular storage formats as below.
CSV
- The uncompressed’s size is the biggest.
- The compressed’s size is very small but not smallest.
- CSV’s reading speed is the slowest.
- CSV’s writing speed is almost slowest.
Pickle
- Overall, it is not bad.
- But writing speed is the slowest for the compressed.
Feather
- The fastest reading and writing speed while the size is not bad.
ORC
- The smallest of all formats.
- Reading and writing speed are very fast, almost the fastest.
Parquet
- Overall, fast and small format but not the fastest or smallest.
Conclusion.
So, from results, does it mean that we should use ORC or Feather and stop using CSV anymore? And the answer is…
“IT IS DEPENDED ON YOUR ECOSYSTEM”.
If you are working on some solo projects, then just using the fastest or smallest format will definitely make sense.
But most of the time, we will have to work with others.
So, there are more factors than just speed and size, for example.
- Uncompressed CSV may be slow and the biggest, but it is very easy when you need to send your data to another systems.
- Parquet is bigger and slower than ORC, but it has a lot of ecosystems support it.
- ORC is slower than Feather, but it can be used with Data Lake.
Hopefully, this article can help.