Summary of SLOW5 ASCII format
This is just a summary of the latest version of SLOW5 ASCII file format (.slow5). For the full specification and information on SLOW5 binary (called BLOW5) format, refer to the PDF links here.
A SLOW5 ASCII file is a plain text file that uses the American Standard Code for Information Interchange (ASCII) encoding (locale: C/POSIX, code set: US-ASCII). The file extension is .slow5. A SLOW5 file contains a header followed by the sequencing data. An example structure of a SLOW5 ASCII file with a single read group is and an example structure of a SLOW5 ASCII with multiple read groups - i.e., multiple sequencing runs - is provided below. The column/row borders, spacing and cell colours are added to increase the readability. The actual format uses tabs (‘\t’) and newlines (‘\n’) as delimiters
Example of a SLOW5 ASCII file with a single read group:
#slow5_version | 1.0.0 | |||||||
---|---|---|---|---|---|---|---|---|
#num_read_groups | 1 | |||||||
@asic_id | 0004A30B00232BEC | |||||||
@exp_start_time | 2020-01-01T00:00:00Z | |||||||
@flow_cell_id | FAH00000 | |||||||
@run_id | 855cdb | |||||||
… | … | |||||||
#char* | uint32_t | double | double | double | double | uint64_t | int16_t* | … |
#read_id | read_group | digitisation | offset | range | sampling_rate | len_raw_signal | raw_signal | … |
read0 | 0 | 8192 | 6 | 1467.6 | 4000 | 123456 | 498,492,… | … |
read1 | 0 | 8192 | 5 | 1467.6 | 4000 | 2000 | 491,491,… | … |
… | … | … | … | … | … | … | … | … |
readN | 0 | 8192 | 3 | 1467.6 | 4000 | 3000 | 400,400,… | … |
Example of a SLOW5 ASCII file with multiple read groups:
#slow5_version | 1.0.0 | |||||||
---|---|---|---|---|---|---|---|---|
#num_read_groups | 3 | |||||||
@asic_id | 0004A30B00232BEC | 1004A30B00232BEC | 2004A30B00232BEC | |||||
@exp_start_time | 2020-01-01T00:00:00Z | 2020-01-01T00:00:00Z | 2020-01-01T00:00:00Z | |||||
@flow_cell_id | FAH00000 | FAH00001 | FAH00002 | |||||
@run_id | 855cdb | 855cd1 | 855cdc | |||||
… | … | … | … | |||||
#char* | uint32_t | double | double | double | double | uint64_t | int16_t* | … |
#read_id | read_group | digitisation | offset | range | sampling_rate | len_raw_signal | raw_signal | … |
read-0 | 1 | 8192 | 6 | 1467.6 | 4000 | 4000 | 498,492,… | … |
read-1 | 0 | 8192 | 5 | 1467.6 | 4000 | 2000 | 491,491,… | … |
… | … | … | … | … | … | … | … | … |
read-N | 2 | 8192 | 3 | 1467.6 | 4000 | 3000 | 400,400,… | … |
SLOW5 Header
The SLOW5 header stores metadata regarding the experiment. Header lines start with either ‘#’ or ‘@’. The header contains two parts: the global header and the data header.
Global header
lines starting with ‘#’ form the global header.
- The first line of a SLOW5 ASCII file is a key-value pair that specifies the SLOW5 version. The key is separated from the value using a tab ‘\t’.
- The second line specifies the number of read groups in the file. Observe that in the single read group file example (Table 1), the value for num_read_groups is set to 1. In the second example with three read groups (Table 2) the value is set to 3.
- The last line of the header is always the field names for the subsequent per-read records.
- The second last line of the header specifies the data types of each field for the subsequent per-read records (i.e., for the fields named in the last line of the header). Further information about the fields is provided in the SLOW5 Data section below.
Data header
The header lines that start with ‘@’ form the data header. These header lines contain ONT data attributes that are shared across multiple reads in a sequencing run (read group). For instance, the run_id and the flow_cell_id are common to all the reads in the read group and are therefore stored in the data header.
SLOW5 Data
After the SLOW5 header, the actual data is encoded. Each line contains information about a single read and we refer to this as a record.
Primary fields
These fields are mandatory and must be arranged in the order that they appear below:
Col | Field name | Data type | Description | Example value |
---|---|---|---|---|
1 | read_id | char* | A unique identifier for the read. | 00592138-f120-4ab5-9916-c5567adb8e29 |
2 | read_group | uint32_t | Read group identifier. | 0 |
3 | digitisation | double | Number of quantisation levels in the Analog to Digital Converter (ADC). That is, if the ADC is 12 bit, digitisation is 4096 (212). | 8192 |
4 | offset | double | The ADC offset error. This value is added when converting the signal to pico ampere. | 10 |
5 | range | double | The full scale measurement range in pico amperes. | 1441.389893 |
6 | sampling_rate | double | Sampling frequency of the ADC, i.e., the number of data points collected per second. | 4000 |
7 | len_raw_signal | uint64_t | The number of samples in the raw signal (length of the raw_signal vector below). | 59676 |
8 | raw_signal | int16_t* | The raw signal which are the direct acquisition values from the ADC and are comma separated. | 1039,588,588,593,586…. |
Primary fields contain all the information required for a typical nanopore signal-level analysis. The raw signal can be converted to pico-ampere using the following equation:
signal_in_pico_ampere = (raw_signal + offset) * range / digitisation
Auxiliary fields
These fields are optional and not bound by any strict order. Following are some common auxiliary data fields in SLOW5 format:
Field name | Data type | Description | Example value |
---|---|---|---|
channel_number | char* | The channel number. A flow cell has multiple channels allowing multiple DNA/RNA strands to be sequenced in parallel. For instance, a MinION flow cell has 512 channels and thus can sequence 512 strands in parallel. | 504 |
median_before | double | The estimated median current level immediately preceding the read. In most cases this can be used as an estimate of the open pore level. The open-pore state is when there is no strand inside the pore. | 238.78225708007812 |
read_number | int32_t | A unique number within each channel counted upwards from zero. Note that not all reads generated are “strand” reads, but only strand reads are written to the final fast5 file, so some read numbers may be absent. | 17981 |
start_mux | uint8_t | The MUX setting for the channel when the read began. Each channel contains one or more wells. For instance, a MinION flow cell has 4 wells per channel. The wells within a channel are connected to a multiplexer (MUX), a switch that controls which of the four wells in the channel is controlled and read out for sequencing. | 4 |
start_time | uint64_t | The start time of the read. The unit for start_time is ‘number of signal samples’, so start_time has to be divided by sampling rate (sampling_rate) to get the start time in seconds (i.e. the time since the run was started) | 335845487 |
Please cite the following in your publications when using S/BLOW5 file format:
Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al. Fast nanopore sequencing data analysis with SLOW5. Nat Biotechnol 40, 1026-1029 (2022). https://doi.org/10.1038/s41587-021-01147-4
@article{gamaarachchi2022fast,
title={Fast nanopore sequencing data analysis with SLOW5},
author={Gamaarachchi, Hasindu and Samarakoon, Hiruna and Jenner, Sasha P and Ferguson, James M and Amos, Timothy G and Hammond, Jillian M and Saadat, Hassaan and Smith, Martin A and Parameswaran, Sri and Deveson, Ira W},
journal={Nature biotechnology},
pages={1--4},
year={2022},
publisher={Nature Publishing Group}
}