A Look at the Nanopore fast5 Format

Shian Su
7 min readMay 31, 2019

--

The fast5 format is the native container for data coming out of Oxford Nanopore Technology’s (ONT) various nanopore sequencers. It is meant to contain the raw electrical signal levels measured by the nanopores, from which various information can be extracted. This article is my exploration into what the fast5 format contains, and the implications this has on methods research.

As a part of my recently started PhD research project, I am planning to use and develop tools for the analysis of DNA methylation using Nanopore sequencing. To perform this type of analysis requires inference based on the raw “squiggle” signals that come off the nanopore sequencers, and these are stored in the hdf5-based fast5 format.

There appeared to efforts to do away with the fast5 format, as hinted to by a slide in the tweet below,

and again at 1:12:50 of the Clive Brown NCM Plenary 2018.

However as basecallers still have room for improvement, and basecalled files do not have sufficient information for methylation calling, there is still a need for the raw signals. As far as I am aware, there is no public formal specification of the fast5 format, at least not to the level of clarity as the SAM format specification or Illumina’s raw bcl file documentation. The closest I could discover are the yaml files contained in the schemas from the ont_fast5_validator github project from ONT.

To understand what is inside an fast5 file requires some understanding of HDF5 files, which I am also learning about for the first time at the time of writing, so please correct me on any mistakes.

HDF5

The HDF stands for Hierarchical Data Format, this reflects the structuring of the data in a nested format, essentially in the same form as JSON. The data contained inside HDF5 is nested elements, giving it enormous flexibility in the structure of data stored. It is supported by all languages commonly used in scientific programming.

The terms used by HDF5 are “Groups” , “Datasets” and “Attributes”. Groups contain Groups or Datasets whereas Datasets contain homogenous multi-dimensional arrays of data. Groups may have Attributes which provide metadata. So you end up with a folder-like structure that can either contain more folders or eventually data. Given it’s similarity to JSON, its main advantage is in the on-disk nature of the data files.

In general with JSON, you are parsing the entire object into memory and navigating through it to get to the data you want. With HDF5 you can query the object to extract only the piece of information you need, allowing you to work with data piecewise. This is what makes HDF5 so attractive for scientific programming, it has a flexible schema and handles multidimensional blocks of data extremely well.

HDF5 employs a clever chunking strategy for its multidimensional data. Gzipped data in general cannot be randomly accessed, you have to start from the start and decompress data until you end up with what you want. In the pathological case, getting the very last value requires the full dataset to be decompressed.

We see below one possible realisation of HDF5’s chunking strategy. A 9x9 grid of data can be broken into 9 3x3 chunks, each chunk is then individually gzipped. Then obtaining the last element requires only fully decompressing the final chunk. Common access patterns like single rows or columns can also be serviced efficiently without fully decompressing all the data. The chunking strategy can even be customised to better optimise for likely access patterns.

Figure 1: Contiguous Dataset (source: HDF5Group)
Figure 2: Chunked Dataset (source: HDF5Group)

This is only one piece of data inside the HDF5 file, the other pieces of data can sit undisturbed while this is happening, if you were working with JSON you’d have loaded all the other pieces of data in as well.

fast5

The fast5 format is a specification over a HDF5 file, imposing a specific structure over the contents of a HDF5 file. These files are used to store the output of nanopore sequencers. The main data is the “squiggles” that represent pico-amp measurements taken around thousands of times a second at the nanopores. Each read resulting from sequencing a molecule is stores as its own fast5 file. That’s the gist of it, so what is actually inside an fast5 file?

We can find out using h5ls. This is not a “pure” fast5 file as would come off a sequencer, I don’t have one of those handy.

$ h5ls -r read.fast5
/ Group
/Analyses Group
/Analyses/Basecall_1D_000 Group
/Analyses/Basecall_1D_000/BaseCalled_template Group
/Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR}
/Analyses/Basecall_1D_000/Summary Group
/Analyses/Basecall_1D_000/Summary/basecall_1d_template Group
/Analyses/RawGenomeCorrected_000 Group
/Analyses/RawGenomeCorrected_000/BaseCalled_template Group
/Analyses/RawGenomeCorrected_000/BaseCalled_template/Alignment Group
/Analyses/RawGenomeCorrected_000/BaseCalled_template/Events Dataset {470}
/Analyses/Segmentation_000 Group
/Analyses/Segmentation_000/Summary Group
/Analyses/Segmentation_000/Summary/segmentation Group
/PreviousReadInfo Group
/Raw Group
/Raw/Reads Group
/Raw/Reads/Read_362 Group
/Raw/Reads/Read_362/Signal Dataset {7127/Inf}
/UniqueGlobalKey Group
/UniqueGlobalKey/channel_id Group
/UniqueGlobalKey/context_tags Group
/UniqueGlobalKey/tracking_id Group

There are 3 main branches of data stored in the fast5, Analysis, Raw, and UniqueGlobalKey. Raw stores the raw signal levels, Analysis stores analysis results such as base-calls, signal correction and segmentation information.

I’m mainly interested in the raw data, so we can use h5dump -A -g "/Raw" to have a look at it

$ h5dump -A -g "/Raw" read.fast5
HDF5 "batch_0.fast5" {
GROUP "/Raw" {
GROUP "Reads" {
GROUP "Read_362" {
ATTRIBUTE "duration" {
DATATYPE H5T_STD_U32LE
DATASPACE SCALAR
DATA {
(0): 7127
}
}
ATTRIBUTE "median_before" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SCALAR
DATA {
(0): 242.657
}
}
ATTRIBUTE "read_id" {
DATATYPE H5T_STRING {
STRSIZE 38;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "ef9d76d4-f5e8-41ba-89bf-e3d8d1666094"
}
}
ATTRIBUTE "read_number" {
DATATYPE H5T_STD_U32LE
DATASPACE SCALAR
DATA {
(0): 362
}
}
ATTRIBUTE "start_mux" {
DATATYPE H5T_STD_U8LE
DATASPACE SCALAR
DATA {
(0): 3
}
}
ATTRIBUTE "start_time" {
DATATYPE H5T_STD_U64LE
DATASPACE SCALAR
DATA {
(0): 612061
}
}
DATASET "Signal" {
DATATYPE H5T_STD_I16LE
DATASPACE SIMPLE { ( 7127 ) / ( H5S_UNLIMITED ) }
}
}
}
}
}

So there are other pieces of imporant information stored inside “Attribute” fields for helping process the raw signals. The names of the attributes should be relatively self-explanatory.

To access these values in programs you can go through either the official ont_fast5_api for Python or the unofficial C++ API at mateidavid/fast5.

The signals are orignally measured as pA (picoamps) values and stored as 16-bit integer values. To transform back into the original pA values requires offset and scaling, I discovered the transformation in the source code of ONT’s fast5 API: pA_val = scale * (raw + offset). Where raw is the 16-bit values stored inside Raw/Read_####/Signal and scale is calculated as range/digitisation. range, digitisation and offset can be found in the Attributes of /UniqueGlobalKey/channel_id.

h5dump -g "/UniqueGlobalKey/channel_id" read.fast5
HDF5 "read.fast5" {
GROUP "/UniqueGlobalKey/channel_id" {
ATTRIBUTE "channel_number" {
DATATYPE H5T_STRING {
STRSIZE 4;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "381"
}
}
ATTRIBUTE "digitisation" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SCALAR
DATA {
(0): 8192
}
}
ATTRIBUTE "offset" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SCALAR
DATA {
(0): 18
}
}
ATTRIBUTE "range" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SCALAR
DATA {
(0): 1534.59
}
}
ATTRIBUTE "sampling_rate" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SCALAR
DATA {
(0): 4000
}
}
}
}

Without this conversion the raw signal values between two samples cannot be meaningfully compared.

multifast5

Multi-fast5 files, first appearing around September 2018, are the currently stable format to be used for future nanopore sequencing. It batches up thousands of fast5 files into a multifast5 file which has the same extension. Because the extension is the same, the official way to determine if you’re looking at a multifast5 is to run `h5dump -a file_version` on your file

$ h5dump -a file_version batch_0.fast5
HDF5 "batch_0.fast5" {
ATTRIBUTE "file_version" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "2.0"
}
}
}

We see that this file is version 2.0, all previous version are to be treated as single file fast5.

The new structure of these files creates a Group at the top level for each read. I’ve packed 10 reads into a multifast5 and here is the output of h5ls.

$ h5ls batch_0.fast5
read_1 Group
read_2 Group
read_3 Group
read_4 Group
read_5 Group
read_6 Group
read_7 Group
read_8 Group
read_9 Group
read_10 Group

If I dig deeper into each of these groups they will have the exact same structure inside as the single read shown above

$ h5ls -r batch_0.fast5
/ Group
/read_1 Group
/read_1/Analyses Group
/read_1/Analyses/Basecall_1D_000 Group
/read_1/Analyses/Basecall_1D_000/BaseCalled_template Group
/read_1/Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR}
/read_1/Analyses/Basecall_1D_000/Summary Group
/read_1/Analyses/Basecall_1D_000/Summary/basecall_1d_template Group
/read_1/Analyses/RawGenomeCorrected_000 Group
/read_1/Analyses/RawGenomeCorrected_000/BaseCalled_template Group
/read_1/Analyses/RawGenomeCorrected_000/BaseCalled_template/Alignment Group
/read_1/Analyses/RawGenomeCorrected_000/BaseCalled_template/Events Dataset {581}
/read_1/Analyses/Segmentation_000 Group
/read_1/Analyses/Segmentation_000/Summary Group
/read_1/Analyses/Segmentation_000/Summary/segmentation Group
/read_1/PreviousReadInfo Group
/read_1/Raw Group
/read_1/Raw/Signal Dataset {12065/Inf}
/read_1/channel_id Group
/read_1/context_tags Group
/read_1/tracking_id Group
/read_2 Group
/read_2/Analyses Group
/read_2/Analyses/Basecall_1D_000 Group
/read_2/Analyses/Basecall_1D_000/BaseCalled_template Group
/read_2/Analyses/Basecall_1D_000/BaseCalled_template/Fastq Dataset {SCALAR}
/read_2/Analyses/Basecall_1D_000/Summary Group
/read_2/Analyses/Basecall_1D_000/Summary/basecall_1d_template Group
/read_2/Analyses/RawGenomeCorrected_000 Group
/read_2/Analyses/RawGenomeCorrected_000/BaseCalled_template Group
/read_2/Analyses/RawGenomeCorrected_000/BaseCalled_template/Alignment Group
/read_2/Analyses/RawGenomeCorrected_000/BaseCalled_template/Events Dataset {282}
/read_2/Analyses/Segmentation_000 Group
/read_2/Analyses/Segmentation_000/Summary Group
/read_2/Analyses/Segmentation_000/Summary/segmentation Group
/read_2/PreviousReadInfo Group
/read_2/Raw Group
/read_2/Raw/Signal Dataset {4304/Inf}
/read_2/channel_id Group
/read_2/context_tags Group
/read_2/tracking_id Group
...more

This format uses contains the same information but uses less disk space, and is expected to be compatible with future nanopore tools.

EDIT (29 Nov 2021): a more comprehensive, technical and up-to-date description of multi-fast5 files can be found here. If there are any contradictions with information here, assume the newer document is correct and please let me know.

CRAM

There is relatively quiet effort to convert fast5 to CRAM and back.

CRAM is usually an alternative to SAM and BAM files which leverages reference genomes and column-based rearrangement of the data to support better compression. According to the linked slides, ~30GB of fast5 files with basecalling data included can be compressed down to ~9.7GB using the CRAM format. I’m not 100% sure but it seems like the signal data is stored inside SAM tags when using this approach.

Update Log

27-Jan-2019

  • Corrected quantization to digitisation . (Thanks @kxk302)

17-Jun-2019

  • Added h5dump of channel_id attributes

11-Jun-2019

  • Added details about fast5 to pA conversion

05-Jun-2019

  • Added information about CRAM

--

--