+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Long-read sequencing and de novo assembly of a Chinese genome

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.


          Short-read sequencing has inherent limitations in the characterisation of long repeat elements. Shi and Guo et al. combine single-molecule real-time sequencing and IrysChip to construct a Chinese reference genome that fills many gaps in the reference genome, and identify novel spliced genes.

          Related collections

          Most cited references 38

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Sequence Alignment/Map format and SAMtools

          Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: Contact:
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Fast and accurate short read alignment with Burrows–Wheeler transform

            Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: Contact:
              • Record: found
              • Abstract: found
              • Article: not found

              The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

              Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

                Author and article information

                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group
                30 June 2016
                : 7
                [1 ]Guangdong-Hongkong-Macau Institute of CNS Regeneration, Jinan University , Guangzhou 510632, China
                [2 ]Ministry of Education Joint International Research Laboratory of CNS Regeneration, Jinan University , Guangzhou 510632, China
                [3 ]Co-innovation Center of Neuroregeneration, Nantong University , Nantong 226001, China
                [4 ]Zilkha Neurogenetic Institute, University of Southern California , Los Angeles, California 90089, USA
                [5 ]Department of Genome Sciences, Howard Hughes Medical Institute, University of Washington , Seattle, Washington 98195, USA
                [6 ]Genetic, Molecular, and Cellular Biology Program, Keck School of Medicine, University of Southern California , Los Angeles, California 90089, USA
                [7 ]Wuhan Institute of Biotechnology , Wuhan 430000, China
                [8 ]Department of Pediatrics, The Ohio State University, and The Research Institute at Nationwide Children's Hospital , Columbus, Ohio 43205, USA
                [9 ]Nextomics Biosciences , Wuhan 430000, China
                [10 ]School of Chemical Engineering and Pharmacy, Wuhan Institute of Technology , Wuhan 430000, China
                [11 ]Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Huazhong University of Science and Technology , Wuhan 430022, China
                [12 ]Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory , New York, New York 11797, USA
                [13 ]USDA/ARS Children's Nutrition Research Center, Department of Pediatrics, Department of Molecular and Human Genetics, Baylor College of Medicine , Houston, Texas 77030, USA
                [14 ]Departments of Systems Biology and Biomedical Informatics, Columbia University , New York, New York 10032, USA
                [15 ]Department of Psychiatry & Behavioral Sciences, Keck School of Medicine, University of Southern California , Los Angeles, California 90033, USA
                [16 ]National Center for Biotechnology Information, U.S. National Library of Medicine , Bethesda, Maryland 20894, USA
                [17 ]Department of Ophthalmology, The University of Hong Kong , Hong Kong, China
                [18 ]State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong , Hong Kong, China
                Author notes

                These authors contributed equally to this work.

                Copyright © 2016, Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.

                This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit




                Comment on this article