ROCK: digital normalization of whole genome sequencing data

Due to advances in high-throughput sequencing technologies, generating whole genome sequencing (WGS) data with high coverage depth (e.g. ≥500×) is now becoming common, especially when dealing with non-eukaryotic genomes. Such high coverage WGS data often fulfills the expectation that most nucleotide positions of the genome are sequenced a sufficient number of times without error. However, performing bioinformatic analyses (e.g. sequencing error correction, whole genome de novo assembly) on such highly redundant data requires substantial running times and memory footprint.

To reduce redundancy within a WGS dataset, randomly downsampling high-throughput sequencing reads (HTSR) is trivial. Nevertheless, this first-in-mind strategy is not efficient as it does not minimize variation in sequencing depth, thereby eroding the coverage depth of genome regions that are under-covered (if any). To cope with this problem, a simple greedy algorithm, named digital normalization, was designed to efficiently downsample HTSRs over genome regions that are over-covered. Given an upper-bound threshold κ>1, it returns a subset of HTSRs inducing an expected coverage depth of at most εκ across the genome (where ε>1 is a small constant). By discarding highly redundant HTSRs while retaining sufficient and homogeneous coverage depth (≈ εκ), this algorithm strongly decreases both running times and memory required to subsequently analyze WGS data, with often little impact on the expected results.

Interestingly, the digital normalization algorithm can be easily enhanced in several ways, so that the final subset contains fewer but more qualitative HTSRs. ROCK (Reducing Over-Covering K-mers) was therefore developed with the key purpose of implementing a fast, accurate and easy-to-use digital normalization procedure. Developed in C++, ROCK enables to observe fast running times using only a unique thread. To improve the digital normalization procedure, ROCK also implements two novel strategies: (i) downsampling the HTSRs based on their Phred scores, and (ii) implementing a final step that filters out low-covering HTSRs. Thanks to these improvements, ROCK can be used as a preprocessing step prior to performing fast genome de novo assembly. The source code is available under GNU Affero General Public License v3.0 at https://gitlab.pasteur.fr/vlegrand/ROCK.

[ In: Lemaitre C, Becker E, Derrien T (eds), Proceedings of JOBIM 2022, Posters & Demos, Rennes, France, 5-8 July, p. 21]

Content

Author and article information

Journal

Title: ScienceOpen Posters

Publisher: ScienceOpen

Publication date (Electronic preprint): 23 July 2022

Affiliations

[1 ] Institut Pasteur, Université Paris Cité, Plateforme HPC, F-75015 Paris, France

[2 ] Prédicteurs moléculaires et nouvelles cibles en oncologie, INSERM, Gustave Roussy, Université Paris-Saclay, Villejuif, France

[3 ] Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, F-75015 Paris, France

Author notes

[* ]Email: alexis.criscuolo@ 123456pasteur.fr .

Author information

Alexis Criscuolo https://orcid.org/0000-0002-8212-5215

Article

DOI: 10.14293/S2199-1006.1.SOR-.PPNAZX5.v1

SO-VID: 95663f55-8772-444b-8abc-76d974aa22b9

License:

This work has been published open access under Creative Commons Attribution License CC BY 4.0 , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com .

History

Date received : 23 July 2022

Data availability: The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

ScienceOpen disciplines: Data structures & Algorithms,Bioinformatics & Computational biology

Keywords: high-throughput sequencing,digital normalization,k-mer