0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The minimizer Jaccard estimator is biased and inconsistent

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences.

          Results

          We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool.

          Availability and implementation

          Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: not found

          Minimap2: pairwise alignment for nucleotide sequences

          Heng Li (2018)
          Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Mash: fast genome and metagenome distance estimation using MinHash

            Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash). Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0997-x) contains supplementary material, which is available to authorized users.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens

              The Network of Cancer Genes (NCG) is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles. These genes were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites. This represents a more than 1.5-fold content increase compared to the previous version. NCG also annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality. NCG is accessible at http://ncg.kcl.ac.uk/. Electronic supplementary material The online version of this article (10.1186/s13059-018-1612-0) contains supplementary material, which is available to authorized users.
                Bookmark

                Author and article information

                Contributors
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                July 2022
                27 June 2022
                27 June 2022
                : 38
                : Suppl 1 , ISCB ISMB 2022 Proceedings
                : i169-i176
                Affiliations
                Department of Computer Science and Engineering, The Pennsylvania State University , University Park, PA, USA
                Department of Computer Science and Engineering, The Pennsylvania State University , University Park, PA, USA
                Department of Biology, The Pennsylvania State University , University Park, PA, USA
                Department of Computer Science and Engineering, The Pennsylvania State University , University Park, PA, USA
                Department of Biology, The Pennsylvania State University , University Park, PA, USA
                Huck Institutes of the Life Sciences, The Pennsylvania State University , University Park, PA, USA
                Department of Computer Science and Engineering, The Pennsylvania State University , University Park, PA, USA
                Huck Institutes of the Life Sciences, The Pennsylvania State University , University Park, PA, USA
                Department of Biochemistry and Molecular Biology, The Pennsylvania State University , University Park, PA, USA
                Author notes
                To whom correspondence should be addressed. E-mail: pzm11@ 123456psu.edu

                Authors are listed in alphabetical order.

                Article
                btac244
                10.1093/bioinformatics/btac244
                9235516
                35758786
                b69104c2-7fa8-4071-88c3-bae2ce49ff97
                © The Author(s) 2022. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                Page count
                Pages: 8
                Funding
                Funded by: National Science Foundation, DOI 10.13039/100000001;
                Award ID: 2029170
                Award ID: 1453527
                Award ID: 1931531
                Award ID: 1356529
                Categories
                ISCB/Ismb 2022
                Genome Sequence Analysis
                AcademicSubjects/SCI01060

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article