Sparse and Skew Hashing of K-Mers

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings – in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge.

Results

To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0, n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions.

Availability

The C++ implementation of the dictionary is available at https://github.com/jermp/sshash.

Contact

giulio.ermanno.pibiri@ 123456isti.cnr.it

Related collections

Author and article information

Contributors

Giulio Ermanno Pibiri: (View ORCID Profile)

Journal

Publisher: bioRxiv

Publication date (Electronic preprint): January 18 2022

Article

DOI: 10.1101/2022.01.15.476199

SO-VID: 3cb7c5c9-6f28-4d33-b591-dbfd48a54570

History

ScienceOpen disciplines: Quantitative & Systems biology,Biophysics

Data availability:

ScienceOpen disciplines: Quantitative & Systems biology, Biophysics

Sparse and Skew Hashing of K-Mers

Read this article at

Abstract

Motivation

Results

Availability

Contact

Related collections

iGEM

Author and article information

Contributors

Journal

Article

History

Comments

Comment on this article

Similar content 59

Cited by 2