32
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          An important assessment prior to genome assembly and related analyses is genome profiling, where the k-mer frequencies within raw sequencing reads are analyzed to estimate major genome characteristics such as size, heterozygosity, and repetitiveness. Here we introduce GenomeScope 2.0 ( https://github.com/tbenavi1/genomescope2.0), which applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. We describe and evaluate a practical implementation of the polyploid-aware mixture model that quickly and accurately infers genome properties across thousands of simulated and several real datasets spanning a broad range of complexity. We also present a method called Smudgeplot ( https://github.com/KamilSJaron/smudgeplot) to visualize and estimate the ploidy and genome structure of a genome by analyzing heterozygous k-mer pairs. We successfully apply the approach to systems of known variable ploidy levels in the Meloidogyne genus and the extreme case of octoploid Fragaria×ananassa.

          Abstract

          Prior to genome assembly, the raw sequencing reads must be analyzed for assessment of major genome characteristics such as genome size, heterozygosity, and repetitiveness. For this purpose, the authors introduce GenomeScope 2.0, an extension of GenomeScope for polyploid genomes, and Smudgeplot, which can estimate a genome’s ploidy.

          Related collections

          Most cited references37

          • Record: found
          • Abstract: found
          • Article: not found

          A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

          Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics

            Abstract Genomics promises comprehensive surveying of genomes and metagenomes, but rapidly changing technologies and expanding data volumes make evaluation of completeness a challenging task. Technical sequencing quality metrics can be complemented by quantifying completeness of genomic data sets in terms of the expected gene content of Benchmarking Universal Single-Copy Orthologs (BUSCO, http://busco.ezlab.org). The latest software release implements a complete refactoring of the code to make it more flexible and extendable to facilitate high-throughput assessments. The original six lineage assessment data sets have been updated with improved species sampling, 34 new subsets have been built for vertebrates, arthropods, fungi, and prokaryotes that greatly enhance resolution, and data sets are now also available for nematodes, protists, and plants. Here, we present BUSCO v3 with example analyses that highlight the wide-ranging utility of BUSCO assessments, which extend beyond quality control of genomics data sets to applications in comparative genomics analyses, gene predictor training, metagenomics, and phylogenomics.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Phased diploid genome assembly with single-molecule real-time sequencing.

              While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
                Bookmark

                Author and article information

                Contributors
                tbenavi1@jhu.edu
                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group UK (London )
                2041-1723
                18 March 2020
                18 March 2020
                2020
                : 11
                : 1432
                Affiliations
                [1 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Johns Hopkins University, ; Baltimore, MD USA
                [2 ]ISNI 0000 0001 2165 4204, GRID grid.9851.5, University of Lausanne, ; Lausanne, CH Switzerland
                [3 ]ISNI 0000 0001 2223 3006, GRID grid.419765.8, Swiss Institute of Bioinformatics, ; Lausanne, CH Switzerland
                [4 ]ISNI 0000 0004 0387 3667, GRID grid.225279.9, Cold Spring Harbor Laboratory, Cold Spring Harbor, ; New York, NY USA
                Author information
                http://orcid.org/0000-0003-3137-8439
                http://orcid.org/0000-0003-1470-5450
                Article
                14998
                10.1038/s41467-020-14998-3
                7080791
                32188846
                bfbeb655-d8b8-4ec4-a167-b648a52618db
                © The Author(s) 2020

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 27 August 2019
                : 10 February 2020
                Categories
                Article
                Custom metadata
                © The Author(s) 2020

                Uncategorized
                genomic analysis,software,computational biology and bioinformatics,genome informatics

                Comments

                Comment on this article