Long-read sequencing and de novo assembly of a Chinese genome

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.

Abstract

Short-read sequencing has inherent limitations in the characterisation of long repeat elements. Shi and Guo et al. combine single-molecule real-time sequencing and IrysChip to construct a Chinese reference genome that fills many gaps in the reference genome, and identify novel spliced genes.

Related collections

Most cited references 8

Record: found
Abstract: found
Article: not found

Genotype, haplotype and copy-number variation in worldwide human populations.

Mattias Jakobsson, Sonja Scholz, Paul Scheet … (2008)

Genome-wide patterns of variation across individuals provide a powerful source of data for uncovering the history of migration, range expansion, and adaptation of the human species. However, high-resolution surveys of variation in genotype, haplotype and copy number have generally focused on a small number of population groups. Here we report the analysis of high-quality genotypes at 525,910 single-nucleotide polymorphisms (SNPs) and 396 copy-number-variable loci in a worldwide sample of 29 populations. Analysis of SNP genotypes yields strongly supported fine-scale inferences about population structure. Increasing linkage disequilibrium is observed with increasing geographic distance from Africa, as expected under a serial founder effect for the out-of-Africa spread of human populations. New approaches for haplotype analysis produce inferences about population structure that complement results based on unphased SNPs. Despite a difference from SNPs in the frequency spectrum of the copy-number variants (CNVs) detected--including a comparatively large number of CNVs in previously unexamined populations from Oceania and the Americas--the global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. Our results produce new inferences about inter-population variation, support the utility of CNVs in human population-genetic research, and serve as a genomic resource for human-genetic studies in diverse worldwide populations.

0 comments Cited 286 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Improving PacBio Long Read Accuracy by Short Read Alignment

Kin-Fai Au, Jason Underwood, Lawrence Lee … (2012)

The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

0 comments Cited 158 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A highly annotated whole-genome sequence of a Korean individual.

Jong-Il Kim, Young Seok Ju, Hansoo Park … (2009)

Recent advances in sequencing technologies have initiated an era of personal genome sequences. To date, human genome sequences have been reported for individuals with ancestry in three distinct geographical regions: a Yoruba African, two individuals of northwest European origin, and a person from China. Here we provide a highly annotated, whole-genome sequence for a Korean individual, known as AK1. The genome of AK1 was determined by an exacting, combined approach that included whole-genome shotgun sequencing (27.8x coverage), targeted bacterial artificial chromosome sequencing, and high-resolution comparative genomic hybridization using custom microarrays featuring more than 24 million probes. Alignment to the NCBI reference, a composite of several ethnic clades, disclosed nearly 3.45 million single nucleotide polymorphisms (SNPs), including 10,162 non-synonymous SNPs, and 170,202 deletion or insertion polymorphisms (indels). SNP and indel densities were strongly correlated genome-wide. Applying very conservative criteria yielded highly reliable copy number variants for clinical considerations. Potential medical phenotypes were annotated for non-synonymous SNPs, coding domain indels, and structural variants. The integration of several human whole-genome sequences derived from several ethnic groups will assist in understanding genetic ancestry, migration patterns and population bottlenecks.

0 comments Cited 111 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nat Commun

Journal ID (iso-abbrev): Nat Commun

Title: Nature Communications

Publisher: Nature Publishing Group

ISSN (Electronic): 2041-1723

Publication date (Electronic): 30 June 2016

Publication date Collection: 2016

Volume: 7

Electronic Location Identifier: 12065

Affiliations

[1 ]Guangdong-Hongkong-Macau Institute of CNS Regeneration, Jinan University , Guangzhou 510632, China

[2 ]Ministry of Education Joint International Research Laboratory of CNS Regeneration, Jinan University , Guangzhou 510632, China

[3 ]Co-innovation Center of Neuroregeneration, Nantong University , Nantong 226001, China

[4 ]Zilkha Neurogenetic Institute, University of Southern California , Los Angeles, California 90089, USA

[5 ]Department of Genome Sciences, Howard Hughes Medical Institute, University of Washington , Seattle, Washington 98195, USA

[6 ]Genetic, Molecular, and Cellular Biology Program, Keck School of Medicine, University of Southern California , Los Angeles, California 90089, USA

[7 ]Wuhan Institute of Biotechnology , Wuhan 430000, China

[8 ]Department of Pediatrics, The Ohio State University, and The Research Institute at Nationwide Children's Hospital , Columbus, Ohio 43205, USA

[9 ]Nextomics Biosciences , Wuhan 430000, China

[10 ]School of Chemical Engineering and Pharmacy, Wuhan Institute of Technology , Wuhan 430000, China

[11 ]Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Huazhong University of Science and Technology , Wuhan 430022, China

[12 ]Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory , New York, New York 11797, USA

[13 ]USDA/ARS Children's Nutrition Research Center, Department of Pediatrics, Department of Molecular and Human Genetics, Baylor College of Medicine , Houston, Texas 77030, USA

[14 ]Departments of Systems Biology and Biomedical Informatics, Columbia University , New York, New York 10032, USA

[15 ]Department of Psychiatry & Behavioral Sciences, Keck School of Medicine, University of Southern California , Los Angeles, California 90033, USA

[16 ]National Center for Biotechnology Information, U.S. National Library of Medicine , Bethesda, Maryland 20894, USA

[17 ]Department of Ophthalmology, The University of Hong Kong , Hong Kong, China

[18 ]State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong , Hong Kong, China

Author notes

[a ] libingzh@ 123456gmail.com

[b ] hrmaskf@ 123456hku.hk

[c ] kaiwang@ 123456usc.edu

[*]

These authors contributed equally to this work.

Author information

Hui Yang http://orcid.org/0000-0001-7325-9425

James A. Knowles http://orcid.org/0000-0002-3307-5741

Article

Publisher Item ID: ncomms12065

DOI: 10.1038/ncomms12065

PMC ID: 4931320

PubMed ID: 27356984

SO-VID: 7317ced9-46dd-459a-856e-4750b7456d6e

License:

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Long-read sequencing and de novo assembly of a Chinese genome

Read this article at

Abstract

Abstract

Related collections

Genome Integrity

Most cited references 8

Genotype, haplotype and copy-number variation in worldwide human populations.

Improving PacBio Long Read Accuracy by Short Read Alignment

A highly annotated whole-genome sequence of a Korean individual.

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Categories

Comments

Comment on this article

Similar content 204

Cited by 130

Most referenced authors 2,317