Using HDF5 with Python

source: www.bhmpics.com

Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5), data models, and library designed to handle and manage extremely large amount of data and complex data collection. It is widely used in several organisations and was used long before anyone was using the terms Big Data, NoSQL or open source!

Is started using HDF5 when I was spending a lot of time in reading and writing large amount of data (in millions) to SQL database. HDF5 was of great help to me in saving and loading large datasets. It is really fast when compared to other file formats such as .xls, .csv, etc.

Here I will be explaining how to use HDF5 with Python programming language. Several research projects use Python and adding HDF5 to it would be an additional benefit. So, lets start!!

OPERATING SYSTEM USED

Ubuntu 16.04

SOFTWARE REQUIRED

Python 2.7

DEPENDENCIES

The following libraries are required for using HDF5 with Python:

  1. Pandas
  2. Tables

Pandas is a Python library that is widely used for Data Analysis, and it comes with HDF5 support with the help of an additional library named Tables

Note: In my next story, I will explain how to use HDF5 with Windows. Installation is bit complicated.

Lets begin!!

Python is installed by default in Ubuntu. So there is no need to install Python again. The dependencies can be installed from the terminal as follows:

$ sudo pip install pandas
$ sudo pip install tables

Note: If in case the above step din’t work for pandas, you can install pandas the following way:

$ sudo apt-get install python-pandas

Once installing you can check whether HDF5 is working by typing the following in the Python shell / terminal:

If the above code works without any error, then you have installed pandas and tables correctly.

Now we need to import some libraries required to work with HDF5, the code is as follows:

Then we will use pandas to create a dataframe, which will work as the data that we are going to save it. Here we use numpy to generate random numbers. Numpy gets installed along with pandas.

Now lets save the dataframe to the HDF5 file:

This doesn't save using the default format, it saves as a frame_table. The advantage of using it is , we can later append values to the dataframe. But the trade-off is its speed, it is slower than the default format.

Viewing the store:

<class 'pandas.io.pytables.HDFStore'>
File path: dataset.h5
/d1 frame_table (typ->appendable,nrows->5,ncols->3,indexers->[index],dc->[A,B,C])

We can even access the dataframe directly from the HDF5 file:

Since we used the format as table we can append values or dataframe to the already existing dataframe:

To close the store:

I will be explaining two different methods in reading a HDF5 file. I use and would also recommend the first method:

  1. Method 1 — using HDFStore()

2 . Method 2 — using pd.read_hdf()

Method 2 will not work if the HDF5 file has multiple datasets inside. It will raise a ValueError stating that the file has HDF file contains multiple datasets.

Now we have loaded the HDF5 file using method 1, lets add more dataframes using the default format:

Lets look at the store:

<class ‘pandas.io.pytables.HDFStore’> 
File path: dataset.h5
/d1 frame_table (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/d2 frame (shape->[7,4])
/d3 frame (shape->[14,3])

We wont be able to append values / dataframe to the dataframes d2 and d3 because we added it using the default format.

We can access the individual dataframes from the HDF5 store as following:

Finally lets close the HDF5 close.

You can get the notebook here.

Happy coding!!!