Which File Format is Best for Your Data Science Project?

Rushiraj Gadhvi
6 min readFeb 18, 2023

--

Photo by Viktor Talashuk on Unsplash

The selection of an appropriate file format is a critical aspect of data science, as it can significantly impact the efficiency of storing, accessing, and manipulating data.

In this blog, we will look at some uncommon and common file formats and put them to the test.

I will measure the time it takes to read and write data for each of the formats listed below, and record the amount of storage used. The dataset used for writing will be comprised of used car data obtained from vehicle listings on Craigslist.org, with a size of 1.5 GB. These tests were conducted on an M1 Mac Air computer. (Link to Dataset)

  1. XML (Extensible Markup Language)

XML is a text-based file format for storing and exchanging structured data. It uses tags to define elements and attributes to describe their properties. XML is widely used in web applications and data exchange formats. It is human-readable and can be easily parsed using programming languages such as Python and Java. However, XML can result in larger file sizes than binary formats, making it less efficient for storing and transferring large datasets.

code used to measure:

t_xml_write = timeit.timeit("df_clean.to_xml('vehicle.xml')", number=1, globals=globals())
t_xml_read = timeit.timeit('pd.read_xml(datadir + "/vehicle.xml")', number=3, globals=globals())
size = os.path.getsize(datadir + '/vehicle.xml')

output:

write : 28.964869250019547 sec
read : 34.683035082998686 sec
size : 285330571 B

Similarly, for the the formats below we measured the three factors.

2. Parquet

Parquet is a columnar file format developed for big data processing. It is designed to store and retrieve data efficiently by partitioning and compressing columns of data. Parquet is widely used in big data processing frameworks such as Apache Hadoop and Apache Spark. It is suitable for large datasets and supports complex data types such as nested structures and arrays. However, Parquet requires specialized software for reading and writing, which can be a disadvantage for data sharing and collaboration.

write : 0.6625609590555541 sec
read : 0.6355639159446582 sec
size : 16028344 B

3. Feather

Feather is a lightweight and efficient file format for storing and exchanging data frames. It is designed to be fast to read and write, with low overhead and compatibility across multiple programming languages, including R and Python. Feather stores data in a binary format, making it smaller and faster to read and write than text-based formats such as CSV. However, Feather may not be suitable for storing complex data types or datasets with more than one data frame.

write : 0.6625609590555541 sec
read : 0.500286458991468 sec
size : 43255130 B

4. CSV (Comma Separated Values)

CSV is a simple and widely used file format for storing and exchanging tabular data. It separates values in a table using commas and stores each row on a new line. CSV is supported by almost all spreadsheet software, programming languages, and databases. It is easy to read, edit, and convert to other file formats. It is also lightweight and takes up less storage space than other file formats. However, CSV has some limitations, such as the inability to store complex data types, such as images and audio, and the difficulty in handling missing or inconsistent data.

output:

write : 3.1790936249890365 sec
read : 2.1943200839450583 sec
size : 75330898 B

5. JSON (JavaScript Object Notation)

JSON is a lightweight and versatile file format for storing and exchanging structured data. It is a text-based format that uses a key-value pair to represent data objects. JSON is easy to read, write, and parse using programming languages such as Python and JavaScript. It is a popular file format for storing data collected from APIs and web applications. However, JSON may not be suitable for large or complex datasets, as it can result in large file sizes and may not be easily readable.

write : 1.0698593749548309 sec
read : 11.388405709003564 sec
size : 171589259 B

6. HDF5 (Hierarchical Data Format)

HDF5 is a binary file format for storing and sharing scientific data. It is designed to store large and complex datasets and supports compression, parallel I/O, and data chunking. HDF5 is widely used in scientific research, such as astronomy, bioinformatics, and physics. It also supports metadata storage and versioning, making it useful for long-term data archiving. However, HDF5 requires specialized software for reading and writing, and its complexity can make it difficult for beginners to use.

write : 0.5861069170059636 sec
read : 1.0490607499959879 sec
size : 77031920 B

7. Pickle

The Pickle format is a native serialization format in Python, used to store Python objects in a binary format. It’s highly flexible and can be used to store complex data structures and objects, making it ideal for machine learning models and other advanced data analysis applications. Pickle format is only compatible with Python, which means it cannot be used with other programming languages.

write : 0.3262063749716617 sec
read : 0.5055029590148479 sec
size : 72680026 B

8. Excel

Excel is a proprietary file format developed by Microsoft for storing spreadsheet data. It is a powerful tool for data analysis and visualization and supports advanced formatting and calculations. Excel is widely used in business and finance for its familiarity and user-friendly interface. However, Excel files can be large and can take longer to load, making it less efficient for large datasets. Additionally, the file format is proprietary and can only be opened using Microsoft Office software, which can be a disadvantage for collaborative work.

write : 149.02949962503044 sec
read : 203.35776358301518 sec
size : 56452335 B

Plotting the data

When we create a bar graph comparing file formats to their respective read times, it becomes apparent that file formats such as HDF5, Pickle, Parquet, and Feather require significantly less time compared to traditional formats such as Excel, JSON, XML, and CSV. While some formats were not included in the graph, it’s important to note that traditional formats such as CSV still take up to three times as long to read compared to more modern formats like Parquet and Feather, which can be read in as little as 0.6 seconds.

When it comes to writing time, the difference in speed between file formats such as Excel, XML, and others is negligible or hardly noticeable. Additionally, in terms of file size, Parquet is noticeably smaller than other file formats. On the other hand, XML has a large storage footprint, taking up the most space, followed by JSON files.

When we exclude XML and Excel from our analysis, we can see that the writing times for file formats such as Pickle, HDF5, Parquet, and Feather are notably faster than other formats and are similar to one another. Moreover, in terms of file size, Parquet is the most space-efficient file format compared to other formats.

Conclusion

Depending on the nature of the data and the requirements of the project, certain file formats may be more suitable than others. However, determining the ideal format can be influenced by various factors that are specific to each project, such as the size of the dataset, the complexity of the data, and the tools used for data analysis.

As per me, “parquet” is the go-to format I personally use for data science projects.

keep learning, keep coding !

--

--