Toolbox Talk: Storing data using HDF5 files

As an early-career researcher, I am still figuring out the best way to get stuff done. Every day it seems as though my to-do list gets longer and longer, for every item I tick off two more take its place and I have no doubt it will only get worse. I am all about finding the best, most efficient and connected technologies to help alleviate some unnecessary legwork. I have decided to write more on the programs, apps and integrations that I use every day in this Toolbox series. This post focuses on how I store and organise the currency of science: big data!

Table of Contents

Data: the invisible bits and bytes of modern life.

Data is an invisible and yet fundamental driver of our digital life in 2019, and the amount of data worldwide is growing every single day. In fact, the International Data Corporation (IDC) estimates the amount of data in the world will grow from an estimated 33 zettabytes (one zettabyte is equivalent to a trillion gigabytes) in 2018 to 175 zettabytes by 2025, an almost unfathomable amount of bits and bytes. Every industry is grappling with the growing mountain of data, and science is no different. Scientists use data as the fuel that powers insight, discovery, and innovation. This demands new infrastructure from the institutes in which we gather data; but it also demands new approaches to the way we as scientists collect, store and analyse data.

The global datasphere is predicted to reach 175 zettabytes within a decade.
Source: Data Age 2025

During the first five years of my university career, everything I needed for my degree fit safely on a single 16 GB USB stick. Then I went and learnt how to do single-molecule microscopy during the second year of my PhD, and suddenly I was generating 16 GB of data in a single experiment. To cope with this rapid explosion, I invested in the hardware needed to store this data as text or csv files and stumbled my way through establishing a folder-based filing system.

It was during this stint early in my PhD that I first got a taste of programming to deal with this data (for more, check out some of my other programming posts here), however, this still centred on csv/text files for importing data, calculating results, and exporting the final findings. This meant that the data along the way, including input, intermediate and final files, could still be opened by easily-human-readable programs such as Microsoft Excel, and kept supervisors/collaborators without programming experience comfortable.

Recently, this storage pattern was no longer adequate for the size of results I was dealing with. Enter: HDF5 format. If you’ve never heard of this format before, it’s somewhat similar to an Excel document, without the proprietary software tag. Read on for more about what it is, why it’s useful for those dealing with large datasets and how I implemented a storage workflow in HDF5 format using python.

What on earth is HDF5?

Heirarchical Data Format (HDF) is a collection of file formats designed to store large amounts of data in an organised manner. Similar to the way .txt refers to text files or .pdf refers to Portable Document Format files, HDF files are adorned with .hdf, .hdf5, .h5 style extensions. The ongoing development and accessibility of the HDF file format is maintained by the non-profit organisation “The HDF Group”, meaning the tools to store and use HDF files will never rely on proprietary (and often expensive!) software.

More specifically, HDF5 files consist of Datasets that can store arrays of data (think individual sheets in an Excel document), Groups which can store datasets or other groups (think a folder of Excel documents, or folders of folders), and metadata consisting of mapped key-value pairs for attributes of the data (think a detailed description notes page for each sheet/document).

An example HDF5 file structure which contains groups, datasets and associated metadata. Image adapted from **neonscience**

Pros and Cons of the HDF5 format

As with anything, there are good bits and bad bytes about using HDF5 to store data.

Some of the Cons:

Unable to easily open HDF5 with excel or notepad (although there are some tools being developed that could help overcome this)
No inbuilt calculation or manipulation options
Not readily used by many researchers in the life sciences, including many senior researchers making it difficult to share data with collaborators unfamiliar with the format

A few of the Pros:

Heirarchical format allows for logically storing data in a single file with folder-like architechture
Allows pre-processed data, such as date-time data, to be stored efficiently without loosing the effect of preprocessing (as would otherwise happen in .csv format)
Fast. Compared to other file formats, reading and writing HDF5 files is speedy. For example, writing to HDF5 is 16 times faster than to simple csv file, not to mention the extra overhead that comes from Excel documents. For a comparison to other common formats, check out the Pandas docs
Storing metadata within the file architechture makes sure that these attributes are accessible to anyone wanting/needing to access this data at a later date, and doesn’t rely on the bundling of additional “description” files to make sense of all the parameters and conditions under which the data was collected.
There are advanced options to create datasets that can be edited (rows added etc), and those that are read-only. This is provides an extra level of flexibility that allows, for example, intermediate datasets to be appended to making sure the raw data is not inadvertantly changed.

Sounds great.. But how do I use HDF5 files?

If you have no experience with programming, to get started you can download HDF View, which is freeware available from the HDF Group that will allow you to open and edit HDF5 files. In the interest of full disclosure, I have never used this software but it seems to work similarly to graphical user interfaces like Excel (without the price tag!).

Example screen capture of HDFView in action.

If you have a programming language of choice, chances are there are bindings for the HDF5 API – my tool of choice is of course Python! Even within python, however, there are a few choices for how to interface with the HDF5 machinery. Given that I use Pandas most often for data wrangling, it makes sense to leverage Pandas’ built in support. This includes methods that allows dataframes to be written to directly to HDF5 files using a method provided by the dataframe itself:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df.to_hdf('data_filename.h5', key='Key_1', mode='w')

This method also allows another object to be written to the same file, with a second key ‘s’:

s = pd.Series([1, 2, 3, 4])
s.to_hdf('data_filename.h5', key='Key_2')

In this case, the heirarchy is quite simple: two datasets are stored under the single file. To get your data back, it is simple to then read back into a dataframe:

new_df = pd.read_hdf('data_filename.h5', 'Key_1')

For storing multiple files in this manner (i.e. using key-dataframe pairs), I have created a utility function using the HDFStore module (another interface provided by Pandas) that collects dataframes that have been loaded into a dictionary, then saves them to a single HDF5 file:

def dict_to_h5(filename, dictionary, **kwargs):
    store = pd.HDFStore(filename)
    for key, df in dictionary.items():
        store.put(key, df)
        store.get_storer(key).attrs.metadata = kwargs
    store.close()

NB: using the HDFStore in Pandas requires PyTables v3.0.0 or higher. To update your existing Tables installation run the “pip install –upgrade tables“` command in your terminal.

These are both simple examples in which there is a ‘flat’ data storage pattern. This is great for storing a few large dataframes, or sequentially processed intermediate dataframes for a single result. In most cases, however, you will find it useful to introduce an additional level of hierarchical organisation similar to the folder storage system we are familiar with from OS interfaces. To do this, the keys should have a file-path structure which lists the group hierarchy that the dataset is to be added to. For example:

hdf =HDFStore('storage.h5')
hdf.put('tables/t1',DataFrame(np.random.rand(20,5)))
hdf.put('tables/t2',DataFrame(np.random.rand(10,3)))
hdf.put('new_tables/t1',DataFrame(np.random.rand(15,2)))

This will yeild the following data structure within the single HDF5 file:

/new_tables/t1  frame        (shape->[15,2])
/tables/t1      frame        (shape->[20,5])
/tables/t2      frame        (shape->[10,3])

Python also has other packages specifically for HDF5 manipulation, which have a few more specialised tools. If you’d like to know more about these packages, or how to generate more complicated heirarchies, check out the list of resources below.

A few final thoughts.

Data is a part of life as a scientist, and as life scientists we must strive for better, more versatile approaches to storing and handling the ever-growing datasets produced by our experiments. For me, HDF5 has provided a simple and elegant way to interface with my data. It allows me to store raw and computationally-expensive intermediate ‘checkpoints’, and also means that I can maintain a single HDF5 file that combines data and metadata in a single place. The barrier to entry is quite low, and there are plenty of versatile ways to access data stored in HDF format.

Resources

Managing the Growth of Scientific Data
“The Digitization of the World: From Edge to Core” – IDC 2018
A simple introduction to HDF5 with Pandas
Speeding up Pandas with HDF5
You can find additional utitility functions that make it simple to create and store HDF5 files for my projects in my GEN_Utils package on GitHub.
For more complicated heirarchies, check out Quick HDF5 with Pandas
Some other Python HDF utilities include h5py, for which there are some great tutorials on how to generate structured heirarchies.

Do you have a favourite way to store, access and manipulate large datasets? Get in touch on twitter and let me know!

Banner image credit: @ gabons via unsplash