Identifying and removing haplotypic duplication in primary genome assemblies

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.

Results

Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.

Availability and implementation

The source code is written in C and is available at https://github.com/dfguan/purge_dups.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 1

Record: found
Abstract: found
Article: found

Is Open Access

HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly

Shengfeng Huang, Mingjing Kang, Anlong Xu (2017)

Abstract Summary De novo assembly is a difficult issue for heterozygous diploid genomes. The advent of high-throughput short-read and long-read sequencing technologies provides both new challenges and potential solutions to the issue. Here, we present HaploMerger2 (HM2), an automated pipeline for rebuilding both haploid sub-assemblies from the polymorphic diploid genome assembly. It is designed to work on pre-existing diploid assemblies, which are typically created by using de novo assemblers. HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies. Using HM2, we demonstrate that two haploid sub-assemblies reconstructed from a real, highly-polymorphic diploid assembly show greatly improved continuity. Availability and Implementation Source code, executables and the testing dataset are freely available at https://github.com/mapleforest/HaploMerger2/releases/. Contact hshengf2@mail.sysu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online.

0 comments Cited 97 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Alfonso Valencia: Role: Associate Editor

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 01 May 2020

Publication date (Electronic): 23 January 2020

Publication date PMC-release: 23 January 2020

Volume: 36

Issue: 9

Pages: 2896-2898

Affiliations

[b1 ] Department of Computer Science and Technology, Center for Bioinformatics, Harbin Institute of Technology , Harbin 150001, China

[b2 ] Department of Genetics, University of Cambridge , Cambridge CB2 3EH, UK

[b3 ] Wellcome Sanger Institute, Wellcome Genome Campus , Cambridge CB10 1SA, UK

Author notes

To whom correspondence should be addressed. ydwang@ 123456hit.edu.cn or rd109@ 123456cam.ac.uk

Author information

Shane A McCarthy http://orcid.org/0000-0002-2715-4187

Kerstin Howe http://orcid.org/0000-0003-2237-513X

Richard Durbin http://orcid.org/0000-0002-9130-1006

Article

Publisher ID: btaa025

DOI: 10.1093/bioinformatics/btaa025

PMC ID: 7203741

PubMed ID: 31971576

SO-VID: e3b96554-ed69-4a8c-bae9-a67d42cc89a3

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 13 August 2019

Date revision received : 17 December 2019

Date accepted : 19 January 2020

Page count

Pages: 3

Funding

Funded by: National Key Research and Development Program of China, DOI 10.13039/501100012166;

Award ID: 2017YFC0907503

Award ID: 2018YFC0910504

Award ID: 2017YFC1201201

Funded by: Wellcome Trust, DOI 10.13039/100004440;

Award ID: WT207492

Award ID: WT206194

Comments

Comment on this article

scite_

Cited by 695

See all cited by

Most referenced authors 8

See all reference authors

- Version 1
- Version 1