NAME
f5c(1) - Ultra-fast methylation calling and event alignment tool for nanopore sequencing data with optional GPU (CUDA) acceleration
SYNOPSIS
- indexing:
f5c index --slow5 [slow5_file] [read.fastq|fasta] f5c index -d [fast5_folder] [read.fastq|fasta]
- methylation calling:
f5c call-methylation -b [reads.sorted.bam] -g [ref.fa] -r [reads.fastq|fasta] --slow5 [reads.blow5] > [meth.tsv] f5c call-methylation -b [reads.sorted.bam] -g [ref.fa] -r [reads.fastq|fasta] > [meth.tsv] # for fast5 --pore r10 required if R10.4.1 f5c meth-freq -i [meth.tsv] > [freq.tsv]
- event alignment:
f5c eventalign -b [reads.sorted.bam] -g [ref.fa] -r [reads.fastq|fasta] --slow5 [reads.blow5]> [events.tsv] f5c eventalign -b [reads.sorted.bam] -g [ref.fa] -r [reads.fastq|fasta] > [events.tsv] # for fast5, specify --rna for direct RNA data, --pore r10 for R10.4.1 data
- resquiggle:
f5c resquiggle [OPTIONS] [reads.fastq|fasta] [reads.blow5]
DESCRIPTION
Given a set of base-called nanopore reads and associated raw signals, f5c call-methylation detects the methylated cytosine at genomic CpG cites and f5c eventalign aligns raw nanopore signals (events) to the reference k-mers. f5c can optionally utilise CUDA enabled NVIDIA graphics cards for acceleration. f5c is a heavily re-engineered and optimised implementation of the call-methylation and eventalign modules in Nanopolish. f5c v1.2 onwards support the latest nanopore R10.4.1 chemistry (make sure to specify –pore r10 if input is FAST5, autodetected for S/BLOW5 input). For best performance and easy usability, it is recommended to use f5c on BLOW5 format.
COMMANDS
index
:
Build an index for accessing the base sequence and raw signal for a given read ID (optimised nanopolish index).call-methylation
:
Classify nucleotides as methylated or not at genomic CpG sites (optimised nanopolish call-methylation).meth-freq
:
Calculate methylation frequency at genomic CpG sites (optimised nanopolish calculate_methylation_frequency.py).freq-merge
:
Merge multiple methylation frequency tsv files.eventalign
:
Align nanopore events to reference k-mers (optimised nanopolish eventalign).resquiggle
: Align raw signals to basecalled reads. Introduced in f5c v1.1.
OPTIONS
index
f5c index [OPTIONS] --slow5 signals.blow5 reads.fastq
f5c index [OPTIONS] -d nanopore_raw_file_directory reads.fastq
Build an index for accessing the base sequence and raw signal for a given read IDs. f5c index is an extended and optimised version of nanopolish index by Jared Simpson. The output of f5c index (for fast5) is equivalent to that from nanopolish index.
-h
,--help
:
Print the help to the standard out.-d
,--directory
:
Path to the directory containing fast5 files. This option can be given multiple times. Both multi-fast5 and single-fast5 are supported. The specified directory will be recursively searched for files ending with the .fast5 extension.-s
,--sequencing-summary
:
The sequencing summary file generated by Guppy base-caller. This option is not encouraged as inconsistencies in the format of sequencing summary files lead to issues during subsequent steps.-f
,--summary-fofn
:
File containing the paths to the sequencing summary files (one per line). This option is not encouraged as inconsistencies in the format of sequencing summary files lead to issues during subsequent steps.-t INT
:
Number of threads used for bgzf compression [default value: 1]. Increasing the number of threads makes indexing faster. Ideally, this should be the number of CPU cores.--iop INT
:
Number of I/O processes to read fast5 files [default value: 1]. Increasing the number of I/O processes makes indexing significantly faster, especially on HPC with RAID systems (multiple disks) where this can be as high as 64. Note that unless value of iop is 1, options -s and -f are ignored.--slow5 FILE
:
slow5 file containing raw signals. -d, -s, -f and –iop options are not required when indexing using a slow5 file.--skip-slow-idx
:
Do not build the .idx for the slow5 file (useful when a slow5 index is already available). Introduced in f5c v1.1.--verbose INT
:
Verbosity level for the log messages [default value: 0].--version
:
Print the version number to the standard out.
call-methylation
f5c call-methylation [OPTIONS] -r reads.fa -b alignments.bam -g genome.fa --slow5 signals.blow5
f5c call-methylation [OPTIONS] -r reads.fa -b alignments.bam -g genome.fa # for fast5 --pore r10 required if R10.4.1
Classify nucleotides as methylated or not at genomic CpG cites (optimised nanopolish call-methylation). Note that the list below contains the options for both CPU-only and CPU-GPU versions of f5c. Options related to the GPU (CUDA) do NOT apply to the CPU-only version.
basic options:
-r FILE
:
The file containing the base-called reads in FASTQ or FASTA format. Can be gzip compressed files.-b FILE
:
The file contaning the alignment records sorted based on genomic coordinates in BAM format.-g FILE
:
The file containing the reference genome in FASTA format.-w STR
:
Only process the specified genomic region STR. STR should be in the format chr:start-end. From v0.7 onwards, STR can be a bed file (.bed extension) containing multiple regions. If this option is not specified, the whole genome will be processed.-t INT
:
Number of processing threads [default value: 8]. Ideally, this should be the number of CPU cores.-K INT
:
Maximum number of reads loaded at once to the memory [default value: 512]. A larger value maximises multithreading performance at cost of increased peak RAM.-B FLOAT[K/M/G]
:
Maximum number of bases loaded at once to the memory [default value: 2.0M]. A larger value maximises multithreading performance at cost of increased peak RAM.-h
:
Print the help to the standard out.-o FILE
:
The file to write the output. If this option is not specified, the output will be written to the standard out.-x STR
:
Parameter profile to be used for maximising the performance to a particular computer system. The profile parameters are always applied before other options, i.e., the user can override these parameters explicitly. Some example profiles are laptop, desktop, hpc. See profiles for the full list and details.--iop INT
:
Number of I/O processes to read FAST5 files [default value: 1]. Increase this value if reading FAST5 limits the overall performance. A higher value (can be as high as 64) is always preferred for systems with multiple disks (RAID) and network file systems.--pore STR
Set the pore chemistry. Specify r9 for R9.4 data and r10 for R10.4 data [default value: r9 for FAST5, autodetected for SLOW5]. Introduced in f5c v1.2.--slow5 FILE
:
read raw signals from a slow5 file instead of fast5 files. –iop option is not required for slow5.--min-mapq INT
:
Minimum mapping quality of an alignment (MAPQ in the BAM record) to be considered for methylation calling [default value: 20].--secondary=yes|no
:
Whether secondary alignments are considered or not for methylation calling [default value: no].--verbose INT
:
Verbosity level for the log messages [default value: 0].--version
:
Print the version number to the standard out.--disable-cuda=yes|no
:
Disable running on the GPU or not [default value: no]. If this option is set to yes, GPU acceleration is disabled.--cuda-dev-id INT
:
CUDA device identifier to run GPU kernels on [default value: 0]. The device identifier of the first GPU is 0, the second GPU is 1 and so on. This can be found by invoking thenvidia-smi
command. Currently, only a single GPU can be specified. To utilise multiple GPUs, you have to manually invoke multiple f5c commands on different datasets with a different device identifier.--cuda-max-lf FLOAT
:
Process reads with read-length less than or equal to the product of cuda-max-lf and the average read length in the current batch on GPU. The rest is processed on CPU [default value: 3.0]. Useful for tuning the CPU-GPU load balance for atypical datasets. Refer to performance guidelines for details.--cuda-avg-epk FLOAT
:
The average number of events-per-kmer used for allocating the arrays in GPU memory [default value: 2.0]. Useful for tuning the CPU-GPU load balance for atypical datasets. Refer to performance guidelines for details.--cuda-max-epk FLOAT
:
Process the reads with events-per-kmer less than or equal to cuda_max_epk on GPU. The rest is processed on CPU [default value: 5.0]. Useful for tuning the CPU-GPU load balance for atypical datasets. Refer to performance guidelines for details.
advanced options:
--skip-ultra FILE
:
Skip ultra-long reads and write those alignment entries to the bam file provided as the argument. Ultra-long reads refer to reads longer than 100 kbases by default, unless specified by –ultra-thresh option below. Useful for tuning the CPU-GPU load balance for datasets containing many ultra-long reads. Also useful to cap the peak RAM usage in systems with limited memory. After the execution, ultra-long reads cab be separately processed, i.e., f5c can be again invoked on the produced bam file as the input. Refer to performance guidelines for details.--ultra-thresh INT
:
Threshold to skip ultra-long reads [default value: 100000]. This option is to be used in conjunction with--skip-ultra
above.--skip-unreadable=yes|no
:
Whether to skip any unreadable fast5 files or to terminate the program [default value: yes]. Ifyes
, the programme will continue to run while skipping unreadable fast5 files. Ifno
, the programme will terminate with an error when an unreadable fast5 file is found.--kmer-model FILE
:
Custom nucleotide k-mer model file. The file should adhere to the format in r9.4_450bps.nucleotide.6mer.template.model. The maximum supported k-mer size is 6.--meth-model FILE
:
custom methylation k-mer model file. The file should adhere to the format in r9.4_450bps.cpg.6mer.template.model. The maximum supported k-mer size is 6.--meth-out-version INT
:
Format version of the output Methylation tsv file. If set to 1, the columns printed adhere to the output format of Nanopolish early versions. If set to 2, adhere to the latest nanopolish output format that additionally includes the strand column and the header num_cpgs renamed to num_motifs) [default value: 1]--min-recalib-events INT
: Minimum number of events to recalbrate (decrease if your reads are very short and could not calibrate) [default value: 200]. Introduced in f5c v0.8.--cuda-mem-frac FLOAT
:
Fraction of free GPU memory to allocate [default value: 0.9 for non-tegra GPUs and 0.7 for tegra GPUs]. On GPUs with dedicated RAM (e.g., GeForce, Tesla and Quadro) almost all available free GPU memory can be allocated. A slightly lower value such as 0.9 is preferred instead of 1.0 to prevent unexpected crashes. In GPUs with integrated memory shared with RAM (e.g., Tegra GPUs that are in Jetson boards), this value should be at most 0.7 to allow enough free RAM for both f5c and other programmes.
developer options:
--print-events=yes|no
:
Print the event table (the output of the event detection step) to the standard out.--print-banded-aln=yes|no
:
Print the event alignment (the output of the adaptive banded event alignment step) to the standard out.--print-scaling=yes|no
:
Prints the estimated scaling values to the standard out.--print-raw=yes|no
:
Prints the raw signal to the standard out.--debug-break INT
:
Terminate the programme after processing the specified batch number. E.g., If 0 is specified, the programme breaks after processing the 0th batch.--profile-cpu=yes|no
:
Process section by section and separately print the time spent on different steps such as the event detection, ABEA and HMM. This option is used for profiling the workloads on the CPU.--write-dump=yes|no
:
Write the fast5 dump to a file or not. The file name is hardcoded to f5c.tmp.bin and will be written to the current working directory. The required raw signal data in the fast5 files subsequent processing will be serially written to f5c.tmp.bin.--read-dump=yes|no
:
Read from a fast5 dump file or not. This is used to read from a dump file generated using--write-dump
above. The raw signal data will be serially loaded from the dump file instead of the fast5 files.
meth-freq
meth-freq [OPTIONS] -i methcalls.tsv
Calculate methylation frequency at genomic CpG sites from a tsv file containing methylation calls produced by f5c call-methylation. This is an optimised version of the nanopolish calculate_methylation_frequency.py script.
-c FLOAT
:
Call threshold for the log likelihood ratio [default value: 2.5]. If abs(log_lik_ratio) < c, those sites are considered ambigious and ignored when computing called_sites and called_sites_methylated. If log_lik_ratio >= c, those are considered methylated (called_sites_methylated).-i FILE
:
Input file containing methylation calls in tsv format (output of f5c call-methylation). Read from stdin if not specified. Any tsv file produced by f5c call-methylation despite what was specified for--meth-out-version
(with/without strand column and/or num_cpg/num_motif) is supported and the format is automatically detected.-o FILE
:
Output file to write the methylation frequencies in tsv format. Write to stdout if not specified.-s
:
Split groups. If not specified, the default behaviour is to compute the methylation frequency per each group (a group contains nearby CpG sites considered together when calling methylation). If methylation frequency is required at an individual base resolution, this option must be specified to split the groups.-h
:
Print the help to the standard out.--version
:
Print the version number to the standard out.
freq-merge
f5c freq-merge [OPTIONS] input1.tsv input2.tsv ...
Merge multiple methylation frequency tsv files (output files from f5c meth-freq) to a single tsv file. Useful to combine the results when meth-freq was run separately on batches of reads, for instance, when performing real-time methylation calling or an SGE array job. Can be also used to merge methylation frequency tsv files from different samples as long as the reference genome used was the same.
For each methylation calling output (.tsv) file, perform meth-freq separately (without concatenation the input tsv files manually). Then feed those output (.tsv) files to this tool, to obtain the final methylation frequency file.
-o FILE
:
Output file to write the methylation frequencies in tsv format. Write to stdout if not specified.-h
:
Print the help to the standard out.--version
:
Print the version number to the standard out.
eventalign
f5c eventalign [OPTIONS] -r reads.fa -b alignments.bam -g genome.fa --slow5 signals.blow5
f5c eventalign [OPTIONS] -r reads.fa -b alignments.bam -g genome.fa # for fast5, specify --rna for direct RNA data, --pore r10 for R10.4.1 data
Align nanopore events to reference k-mers (optimised nanopolish eventalign). Note that the list below contains the options for both CPU-only and CPU-GPU versions of f5c. Options related to the GPU (CUDA) do NOT apply to the CPU-only version.
basic options:
Same as those for call-methylation and thus not repeated here.
advanced options:
--skip-ultra FILE
:
Same as for call-methylation.--ultra-thresh INT
:
Same as for call-methylation.--skip-unreadable=yes|no
:
Same as for call-methylation.--kmer-model FILE
:
Same as for call-methylation.--summary FILE
:
Write the summaries of the alignment of each read to the file specified.--paf
:
Write output in PAF format. Introduced in f5c v1.3. Output explained in https://hasindu2008.github.io/f5c/docs/output.--sam
:
Write the alignment output in SAM format instead of tsv.--sam-out-version INT
:
Sam output version (set 1 to revert to old nanopolish style format) [default: 2]. Introduced in f5c v1.3. New SAM output is explained in https://hasindu2008.github.io/f5c/docs/output.--m6anet
:
Write output in m6anet format. Introduced in f5c v1.5.--print-read-names
:
Print read IDs instead of indexes.--scale-events
:
Scale events to the model, rather than vice-versa.--samples
:
Write the raw samples for the event to the tsv output.--signal-index
:
Write the raw signal start and end index values for the event to the tsv output.--rna
:
Specify that this dataset is direct RNA.--collapse-events
:
Collapse events that stays on the same reference k-mer. Introduced in f5c v0.8.--min-recalib-events INT
:
Same as for call-methylation.--cuda-mem-frac FLOAT
:
Same as for call-methylation.
developer options:
Same as those for call-methylation and thus not repeated here.
resquiggle
f5c resquiggle [OPTIONS] reads.fastq signals.blow5
Align raw signals to basecalled reads. Introduced in f5c v1.1. Output format is explained in https://hasindu2008.github.io/f5c/docs/output.
options
-t INT
:
Same as for call-methylation.-K INT
:
Same as for call-methylation.-B FLOAT[K/M/G]
:
Same as for call-methylation.-h
:
Same as for call-methylation.-o FILE
:
Same as for call-methylation.-x STR
:
Same as for call-methylation.-c
:
Print in paf format--verbose INT
Same as for call-methylation.--version
Same as for call-methylation.--kmer-model FILE
Same as for call-methylation.--rna
The dataset is direct RNA.--pore STR
Same as for call-methylation.--disable-cuda=yes|no
:
Same as for call-methylation.--cuda-dev-id INT
:
Same as for call-methylation.- ` –cuda-mem-frac FLOAT`:
Same as for call-methylation.
EXAMPLES
- download and extract the dataset including sorted alignments:
wget -O f5c_na12878_test.tgz "https://f5c.page.link/f5c_na12878_test" tar xf f5c_na12878_test.tgz
- index, call methylation and get methylation frequencies:
f5c index -d chr22_meth_example/fast5_files chr22_meth_example/reads.fastq f5c call-methylation -b chr22_meth_example/reads.sorted.bam -g chr22_meth_example/humangenome.fa -r chr22_meth_example/reads.fastq > chr22_meth_example/result.tsv f5c meth-freq -i chr22_meth_example/result.tsv > chr22_meth_example/freq.tsv
- event alignment:
f5c eventalign -b chr22_meth_example/reads.sorted.bam -g chr22_meth_example/humangenome.fa -r chr22_meth_example/reads.fastq > chr22_meth_example/events.tsv
AUTHOR
Hasindu Gamaarachchi wrote the framework of f5c, CUDA code and integrated with adapted components from Jared T. Simpson’s Nanopolish, with tremendous support from Chun Wai Lam, Gihan Jayatilaka and Hiruna Samarakoon.
LICENSE
f5c is licensed under the MIT License. f5c reuses code and methods from Nanopolish which is also under the MIT License. The event detection code in f5c is from Oxford Nanopore’s Scrappie basecaller which is under Mozilla Public License 2.0. Some code snippets have been taken from Minimap2 and Samtools that are under the MIT License.
If you use f5c, please cite Gamaarachchi, H., Lam, C.W., Jayatilaka, G. et al. GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis. BMC Bioinformatics 21, 343 (2020). https://doi.org/10.1186/s12859-020-03697-x
SEE ALSO
Full documentation: https://hasindu2008.github.io/f5c/docs/overview
Source code: https://github.com/hasindu2008/f5c/
Publication: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03697-x