2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Identifying and removing haplotypic duplication in primary genome assemblies

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.

          Results

          Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.

          Availability and implementation

          The source code is written in C and is available at https://github.com/dfguan/purge_dups.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references1

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly

          Abstract Summary De novo assembly is a difficult issue for heterozygous diploid genomes. The advent of high-throughput short-read and long-read sequencing technologies provides both new challenges and potential solutions to the issue. Here, we present HaploMerger2 (HM2), an automated pipeline for rebuilding both haploid sub-assemblies from the polymorphic diploid genome assembly. It is designed to work on pre-existing diploid assemblies, which are typically created by using de novo assemblers. HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies. Using HM2, we demonstrate that two haploid sub-assemblies reconstructed from a real, highly-polymorphic diploid assembly show greatly improved continuity. Availability and Implementation Source code, executables and the testing dataset are freely available at https://github.com/mapleforest/HaploMerger2/releases/. Contact hshengf2@mail.sysu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online.
            Bookmark

            Author and article information

            Contributors
            Role: Associate Editor
            Journal
            Bioinformatics
            Bioinformatics
            bioinformatics
            Bioinformatics
            Oxford University Press
            1367-4803
            1367-4811
            01 May 2020
            23 January 2020
            23 January 2020
            : 36
            : 9
            : 2896-2898
            Affiliations
            [b1 ] Department of Computer Science and Technology, Center for Bioinformatics, Harbin Institute of Technology , Harbin 150001, China
            [b2 ] Department of Genetics, University of Cambridge , Cambridge CB2 3EH, UK
            [b3 ] Wellcome Sanger Institute, Wellcome Genome Campus , Cambridge CB10 1SA, UK
            Author notes
            To whom correspondence should be addressed. ydwang@ 123456hit.edu.cn or rd109@ 123456cam.ac.uk
            Author information
            http://orcid.org/0000-0002-2715-4187
            http://orcid.org/0000-0003-2237-513X
            http://orcid.org/0000-0002-9130-1006
            Article
            btaa025
            10.1093/bioinformatics/btaa025
            7203741
            31971576
            e3b96554-ed69-4a8c-bae9-a67d42cc89a3
            © The Author(s) 2020. Published by Oxford University Press.

            This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

            History
            : 13 August 2019
            : 17 December 2019
            : 19 January 2020
            Page count
            Pages: 3
            Funding
            Funded by: National Key Research and Development Program of China, DOI 10.13039/501100012166;
            Award ID: 2017YFC0907503
            Award ID: 2018YFC0910504
            Award ID: 2017YFC1201201
            Funded by: Wellcome Trust, DOI 10.13039/100004440;
            Award ID: WT207492
            Award ID: WT206194
            Categories
            Applications Notes
            Genome Analysis

            Bioinformatics & Computational biology
            Bioinformatics & Computational biology

            Comments

            Comment on this article