F5C GPU performance


We tested on a variety of different systems (covering embedded systems, laptops, workstations and servers - table below) for comparing the performance of f5c CPU version with the GPU accelerated version.

System name Info CPU CPU cores/threads RAM GPU GPU memory
Jetson TX2 NVIDIA Jetson TX2 embedded module ARMv8 ARM Cortex-A57 + NVIDIA Denver2 6/6 8 GB Tegra shared with RAM
Acer E5 laptop Acer F5-573G i7-7500U 2/4 8 GB Gefore 940M 4 GB
XPS15 laptop Dell XPS 15 laptop i7-8750H 6/12 16 GB Gefore 1050 Ti 4 GB
Old workstation Unbranded Workstation i7-6700K 4/8 32 GB Tesla C2075 6 GB
HP workstation HP Z640 workstation Xeon E5-1630 4/8 32 GB Kepler K40 12 GB
Dell server Dell server Xeon Silver 4114 20/40 376 GB Tesla V100 16 GB

Speedup of adaptive banded event alignment

Adaptive banded event alignment algorithm consumed a vast majority of ~70% runtime of the whole methylation calling processing when run on CPU. Hence, as the first step we decided to offload this alignment process to the GPU. In order to maximise the benefit, we follow a heterogenous approach in our GPU accelerated version - both the CPU and the GPU are simultaneously utilised. The performance of adaptive banded alignment are below.


Overall speedup of processing (excluding I/O)

The overall performance for the whole methylation calling process (not including I/O) is as below. proc

Overall speedup (including I/O)

The overall performance for the whole methylation calling process (including I/O) is as below. overall

Note that the speed up for the overall speedup for the Dell server was limited by slow fast5 I/O compared to the processing capability of the Tesla V100.

Note : In all cases the programs were executed with maximum number of hardware threads supported by the CPU of the particular system. The small chr20 dataset (total ~150M bases with ~20K reads) was used.

Acknowledgement : Thanks @GPrathap for running the experiments on the Acer laptop.