Whole Genome Assembly of the Snout Otter Clam,  Lutraria rhynchaena , Using Nanopore and Illumina Data, Benchmarked Against Bivalve Genome Assemblies

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Introduction Production of cultured bivalve molluscs was 17.1 million tons in 2016 accounting for 21.4% of global aquaculture production (FAO, 2018). The lack of genomic resources coupled with limited understanding of the molecular basis of gene expression and phenotypic variation have limited advances in aquaculture-based productivity of marine bivalves. Understanding the molecular basis of phenotypic variation and gene function is therefore important for selective breeding programs for traits such as increased growth and disease resistance. Similarly, whole genome assemblies support GWAS studies to identify trait-specific loci and for genomic-based selective breeding. To this end, whole-genome sequencing has been conducted on several commercial bivalve species, including the edible oysters Crassostrea virginica (Gómez-Chiarri et al., 2015), Crassostrea gigas (Gerdol et al., 2015), pearl oysters (Takeuchi et al., 2012) and clams (Mun et al., 2017). However, in general, genomic data for bivalve molluscs, which includes a taxonomically diverse group of species, are sparse (Takeuchi et al., 2012; Zhang et al., 2012; Murgarella et al., 2016; Du et al., 2017; Li et al., 2017; Mun et al., 2017; Sun et al., 2017; Uliano-Silva et al., 2017; Wang et al., 2017; Li et al., 2018; Powell et al., 2018; Renaut et al., 2018; Bai et al., 2019). In this study we present the first genomic resources for a species of clam from the superfamily Mactroidea and for a Vietnamese shellfish species and generate a draft reference genome to form the basis of on-going selective breeding studies. This study also demonstrates the efficacy of using Oxford Nanopore Technology (ONT) reads to scaffold bivalve genome assemblies and shows the value of these relatively inexpensive long reads for spanning large repetitive regions and overcoming complex assembly issues caused by high heterozygosity, which typically confounds short read only assemblies. The quality of our assembly is also benchmarked against other bivalve genome assemblies and we present an initial phylogenomic analysis for the class Bivalvia, which illustrates the value and potential of the increasing number of high quality genomic data sets for phylogenetics. Materials and Methods Extraction, Library Preparation, and Sequencing Muscle tissue was collected from a sample obtained from a snout otter clam farm in Van Dong district, Quang Ninh province in Vietnam. For short read sequencing of the clam genome, genomic DNA (gDNA) was extracted according to Sokolov’s improved method (Sokolov, 2000). Two sequencing libraries were prepared. These include a PCR-free library prepared using NuGen Celero DNA-Seq Library Preparation Kit (Tecan Genomics, San Carlos, CA) according to manufacturer’s instruction and a PCR-based library prepared with NEBNext Ultra DNA Library Preparation Kit (New England Biolabs, Ipwich, MA). For transcriptomic data, RNA was extracted from digestive gland tissue using the Zymo Quick-RNA Miniprep kit, followed by RNA library constructed using the Nugen Universal Plus mRNA-Seq Kit (Tecan Genomics, San Carlos, CA) according to manufacturer’s instruction. Both DNA and RNA libraries were sequenced on an Illumina NovaSeq6000 platform at the Deakin Genomics Centre with 2 × 150 bp configuration. For Nanopore long read sequencing, gDNA was extracted from the muscle tissue of the same individual using the column-based Zymo Quick DNA miniprep plus. Approximately 1 µg of the purified gDNA was used as the input for library preparation using the LSK109 kit followed by sequencing on a FLO-MIN106 revD SpotON R9.4 Flow Cell for 48 h. Genome Size Estimation Short reads generated by the Illumina NovaSeq SGS platform were preprocessed for genome size estimation. The fastp v0.19.4 tool (Chen et al., 2018) was used to trim polyG tails from the 3’ end of short reads (–poly_g_min_len 1), followed by adapter and quality trimming with Trimmomatic v0.36 (Bolger et al., 2014) (ILLUMINACLIP:2:30:10, AVGQUAL:20). Reads of mitochondrial origin were removed via alignment against the Lutraria rhynchaena mitogenome (Gan et al., 2016) using bowtie2 v2.3.3.1 (default parameters) (Langmead and Salzberg, 2012). Jellyfish v2.2.6 (Marçais and Kingsford, 2011) was used to count the frequency of 19-, 21- and 25-mers in the preprocessed reads and finally, these k-mer histograms were uploaded to the GenomeScope webserver (max kmer coverage disabled) (Vurture et al., 2017) for an estimation of the L. rhynchaena haploid genome size. De Novo Hybrid Assembly of the Snout Otter Clam Genome Long reads generated by the ONT MinION sequencing device were basecalled and trimmed (adapter and quality) with Guppy v3.0.3 (high accuracy mode, min_qscore 7), which is available via the ONT community site (https://community.nanoporetech.com). A hybrid de novo assembly approach was performed with the MaSuRCA v3.2.8 assembler (USE_LINKING_MATES = 0, cgwErrorRate = 0.15, KMER_COUNT_THRESHOLD = 1) (Zimin et al., 2013), using the previously trimmed long Nanopore reads and polyG-trimmed short Illumina reads as input. The MaSuRCA assembler first reduces high coverage Illumina reads into longer super-reads, then aligns these to long Nanopore reads to build even longer mega-reads (Zimin et al., 2017). These are then assembled by the CABOG assembler within MaSuRCA. Additionally, this version of MaSuRCA uses the high coverage of the longer more error prone Nanopore reads to build consensus sequences for regions not captured by short accurate Illumina reads. As a result, this approach combines the benefits from the accuracy of Illumina reads and the length of Nanopore reads. Completeness of the assembly was assessed with BUSCO v3.0.2 (Simão et al., 2015), using single-copy orthologs from the Metazoan dataset (metazoa_odb9). As bivalve genomes have been previously shown to be highly heterozygous (Zhang et al., 2012; Wang et al., 2017; Powell et al., 2018), an observation also apparent from the bimodal distribution of k-mer profiles obtained from the GenomeScope analysis in this study ( Data Sheet 1 ), it was necessary to optimize the assembly by removing haplotigs or duplications in the haploid representation of this genome. This was achieved by aligning the long reads back to the assembly with minimap2 v2.15 (-x map_ont, –secondary = no) (Li, 2018) and passing the alignment through the Purge Haplotigs pipeline v1.0.4 (-l 5, -m 20, -h 150) (Roach et al., 2018) to remove artifactual scaffolds. Estimation of Heterozygosity and Repeat Content Presence of single nucleotide polymorphisms (SNP) in the snout otter clam genome was estimated by aligning short reads to the final curated assembly with bowtie2 v2.3.3.1 (Chen et al., 2018) and called with the mpileup (–max-depth 200) and call (-mv) commands within bcftools v1.9 (Li, 2011), retaining high quality calls with sufficient read depth (QUAL ≥ 20, DP ≥ 10, AF ≥ 0.25). Repeat families within the final assembly were identified de novo with RepeatModeler v1.0.11 (default parameters) (Smit and Hubley, 2019), which uses both RECON v1.08 (Bao and Eddy, 2002) and RepeatScout v1.0.5 (Price et al., 2005) to search for repeats within a given assembly. From this, a set of consensus sequences for each identified family was then used to mask repeats within the assembly with RepeatMasker v4.0.9 (-e rmblast) (Smit et al., 2018). Benchmarking Against Other Bivalve Genomes The assembly sizes, scaffold N50 lengths and other characteristics of the L. rhynchaena assembly were compared with 13 other published bivalve genomes representing six orders and eight families. These comprise: Argopecten purpuratus (Li et al., 2018), Bathymodiolus platifrons (Sun et al., 2017), Chlamys farreri (Li et al., 2017), C. gigas (Zhang et al., 2012), Limnoperna fortunei (Uliano-Silva et al., 2017), Modiolus philippinarum (Sun et al., 2017), Mytillus galloprovincialis (Murgarella et al., 2016), Patinopecten yessoensis (Wang et al., 2017), Pinctada fucata (Takeuchi et al., 2012; Du et al., 2017), Ruditapes philippinarum (Mun et al., 2017), Saccostrea glomerata (Powell et al., 2018), Scapharca broughtonii (Bai et al., 2019) and Venustaconcha ellipsiformis (Renaut et al., 2018). Assessments of repeat content and BUSCO completeness (metazoa_odb9) were repeated for these 13 bivalve genomes as described above. Transcriptome Assembly One RNA-seq library was sequenced and a transcriptome was generated to assist in gene prediction. Paired-end, strand-specific RNA reads were polyG-, adapter- and quality-trimmed with the same tools and parameters as outlined for genomic reads. The resulting reads were assembled de novo with the Trinity v2.8.5 assembler (–SS_lib_type FR) (Grabherr et al., 2011). Short reads were subsequently mapped back to the assembly with bowtie2 v2.3.3.1 (Langmead and Salzberg, 2012). Additionally, GMAP v2019-06-10 (default parameters) (Wu and Watanabe, 2005) was used to align the assembled transcriptome to the current genome assembly. Gene Prediction and Annotation Gene prediction was carried out with the MAKER v2.31.10 annotation pipeline (Holt and Yandell, 2011). Providing this pipeline with the previously identified repeat families, the assembled transcript sequences and a set of protein sequences from the 13 published bivalve genomes ( Data Sheet 2 ) (Takeuchi et al., 2012; Zhang et al., 2012; Murgarella et al., 2016; Du et al., 2017; Li et al., 2017; Mun et al., 2017; Sun et al., 2017; Uliano-Silva et al., 2017; Wang et al., 2017; Li et al., 2018; Powell et al., 2018; Renaut et al., 2018; Bai et al., 2019), MAKER identifies repeats and subsequently aligns transcripts and proteins to the genome to produce initial gene models and sequences in its first iteration (est2genome = 1, protein2genome = 1). These gene models were used to train two ab initio gene prediction softwares AUGUSTUS (using BUSCO with –long and metazoa_odb9) (Stanke et al., 2006) and SNAP (-categorize 1000, -export 1000) (Korf, 2004), followed by a second iteration of MAKER incorporating gene models generated from ab initio predictors. Gene models were again retrained with a second iteration of AUGUSTUS and SNAP before MAKER was repeated for a third iteration. Sequences with Annotation Edit Distance (AED) values ≤0.5 were retained. AED values are evidence-based (Eilbeck et al., 2009), whereby a small value suggests a lesser degree of difference between the predicted gene and the protein and/or transcript evidence used during prediction. These protein sequences were further aligned against proteins in UniProtKB (Swiss-Prot and TrEMBL) (Consortium, 2018) for homology using DIAMOND v0.9.24.125 (–max_target_seqs 1, –evalue 1e −10 ) (Buchfink et al., 2014) and searched for protein domains and signatures with InterProScan v5.36-75.0 (-t p) (Jones et al., 2014). Maximum Likelihood and Bayesian Phylogenetic Analyses BUSCO-predicted protein-coding genes were used to construct a multi-gene supermatrix in order to infer the phylogenetic position of L. rhynchaena in relation to bivalve species for which published genomes are available ( Data Sheet 2 ), with the exception of M. galloprovincialis. This was excluded from this analysis since its assembly reported substantially lower BUSCO completeness. The owl limpet (Lottia gigantea), two-spot octopus (Octopus bimaculoides) and fruit fly (Drosophila melanogaster) were included as outgroup species. Each orthologous group was aligned with MAFFT v7.394 (default parameters) (Katoh and Standley, 2013) and trimmed with Gblocks v0.91b (default parameters) (Castresana, 2000), allowing no gaps in the alignments. Only orthologous protein groups with 100% representation (i.e. contains proteins from all 13 bivalves and three outgroups) were incorporated to form the final supermatrix. This supermatrix (22,668 amino acids, 129 orthologous groups), partitioned by proteins (-spp), was supplied to IQ-TREE v1.5.5 (-bb 1000, -alrt 1000, -m TESTMERGE) (Nguyen et al., 2014) for model testing and to infer phylogenetic relationships using maximum likelihood (ML) with nodal support assessed using Ultrafast Bootstrap support (UFBoot) (Minh et al., 2013) and SH-aLRT (Guindon et al., 2010) values. The same supermatrix was supplied to ExaBayes v1.5 (Aberer et al., 2014) to infer a Bayesian tree (BI) from four independent runs of two million generations. After discarding 25% of initial samples as burn-in, convergence was determined when the average standard deviation of split frequencies (asdsf) value fell below 1%. Results and Discussion Genome Size and Sequencing Coverage This study generated a large volume of genomic data made up of 122 Gbp of short paired-end reads (150 bp) and 14 Gbp of long reads (average: 4,960 bp, longest: 58,804 bp, shortest: 200 bp) ( Data Sheet 3 ). In addition, 34 Gbp of transcriptomic reads were generated to facilitate gene prediction. Raw Illumina and Nanopore reads generated in this study are available in the Sequence Read Archive (SRP201027) under BioProject PRJNA548223. The snout otter clam genome size was estimated to range from 545 to 547 Mbp based on 19-, 21- and 25-mers as summarised in Data Sheet 1 , with histograms provided in Data Sheet 4 . Based on these estimates this study generated short and long read data with sequencing depths of over 200× and 25×, respectively. The haploid genome size for the snout otter clam is at the lower end of a large range of estimated genome sizes for the species of the family Mactridae and also within the phylum Bivalvia more generally ( Data Sheet 5 ). Genome and Genes of the Snout Otter Clam The snout otter genome assembly has used the largest volume of ONT data generated for a bivalve species to date and joins the blood clam (S. broughtonii) (Bai et al., 2019) as the only other bivalve genome assembly to incorporate the use of long Nanopore reads and has generated the largest volume of Nanopore data for a bivalve species to date. As the first study to use a combination of only Illumina paired-end reads and Nanopore long reads, it also demonstrates the efficacy of the addition of Nanopore reads as a cost-effective option to achieve a high quality and contiguous genome assembly in combination with Illumina reads. This hybrid assembly approach has generated one of the best bivalve draft genomes currently available consisting of only 1,502 scaffolds (N50 length: 1.84 Mbp) and a total assembly size of 586.5 Mbp ( Table 1 ), slightly exceeding the kmer-based genome size estimates. The assembly displayed a high overall completeness of 95.9%. However, 2.9% of the metazoan BUSCOs detected in this assessment occurred in duplicates within the assembly, likely to be a result of the elevated heterozygosity of the genome, which is not uncommon for marine invertebrate species. These duplicated regions can cause issues in downstream analyses such as variant discovery and therefore were removed from the assembly before its use for other applications. This then produced a final curated assembly of 544 Mbp contained in 622 scaffolds (N50 length: 2.14 Mbp) ( Table 1 ), to which 95.6% of Illumina short reads were successfully aligned using bowtie2 v2.3.3.1 (Langmead and Salzberg, 2012). BUSCO reports a similar overall completeness of 95.8% but with 1.5% BUSCOs detected in duplicates, half of that reported prior to the haplotig purging strategy. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession VIBL00000000 under BioProject PRJNA548223. Table 1 Summary of assembly and annotation of the snout otter clam genome. Lutraria rhynchaena MaSuRCA MaSuRCA + Purge Haplotigs Genome Size Estimation Based on 21-mer counts 546,627,519 bp Genome Assembly Assembly size 586,528,440 bp 543,903,147 bp Number of scaffolds 1,502 622 Scaffold N50 length 1,841,940 bp 2,143,760 bp Average scaffold length 390,498 bp 874,442 bp Longest scaffold 13,022,122 bp 13,022,122 bp Shortest scaffold 1,560 bp 2,218 bp Genome Completeness (Metazoan BUSCOs: 978 total) Complete BUSCOs 95.9% 95.8% Complete and single BUSCOs 93.0% 94.3% Complete and duplicated BUSCOs 2.9% 1.5% Fragmented BUSCOs 0.3% 0.3% Missing BUSCOs 3.8% 3.9% Genome Annotation Number of predicted PCGs – 26,380 Number of annotated PCGs – 23,701 Hits to UniProtKB – 18,595 Hits to protein domains/signatures – 23,114 Average length of PCGs – 379.61 aa Longest PCG – 9,456 aa Variant calling analysis detected a total 4,903,576 SNPs, which translates into a heterozygosity estimate of 0.90%. In contrast, heterozygosity was estimated at 1.60% using the kmer-based approach with GenomeScope ( Data Sheet 1 ). Since variant calling methods are typically more conservative than kmer-based ones, the former approach may have excluded some true positives and, thereby leading to an underestimation of genetic variation. Nevertheless, the heterozygosity rates estimated by both methods (0.90 and 1.60%) is within the range of values that have been reported for other bivalves (0.51 to 2.02%). Repeat modelling and masking of the genome masked 29.4% of the L. rhynchaena assembly. This study also generated 34.6 Gbp of short paired-end, strand specific RNA reads. The resultant transcriptome assembly produced a set of 295,234 transcripts containing 83.9% complete BUSCO genes. A total of 96.4% of these RNA reads were mapped back to this set of assembled transcripts and of these, 79% were aligned to the genome in a splice-aware manner. Both alignment rates indicate that the quality of the set of transcripts is sufficient to improve the gene prediction. Based on hints and evidence from the clam transcripts and protein sequences from other bivalves, gene prediction from the assembled genome resulted in a final set of 26,380 protein-coding genes (AED ≤0.5), 89.8% of which were functionally annotated (i.e. a protein would have at least one associated functional annotation) (Table 1). The assembled transcriptome, predicted protein-coding genes and annotation information are available as Data Sheet 6 and Data Sheet 7 . Comparison Against Other Bivalves A summary of assembly sizes, scaffold N50 lengths, sequencing technologies and other information for all 13 assemblies is available in Data Sheet 2 , with a comparison of repeat content, genome and assembly sizes and BUSCO completeness for all 14 bivalve genomes is visualised in Figure 1 . The assembly for M. galloprovincialis has a high level of missing BUSCO genes (14.8% complete, 29.6% fragmented, 55.6% missing) whereas the V. ellipsiformis assembly contained relatively more fragmented BUSCO genes (22.8%). This potentially points to limitations of the data types used, since the former was assembled with only short Illumina paired-end reads, while the latter had both paired-end (PE) and mate-pair (MP) reads but was assisted by only 0.3× of PacBio long reads. While three other assemblies also had only PE and MP reads, most of the remaining assemblies employed the use of other scaffolding strategies including a greater volume of long PacBio reads, fosmid and bacterial artificial chromosome (BAC) libraries (see Data Sheet 2 for details). Furthermore, there is a large discrepancy in estimated genome size and assembly size for R. philippinarum ( Figure 1 ), where the reported assembly is almost twice the size of the expected genome size. For this assembly, a high proportion of duplicated BUSCO genes was also detected (19.1%), highlighting the potential importance of haplotig purging within bivalve assemblies to remove paralogous scaffolds. Nevertheless, this step should be executed with caution as there is also the potential to “over-purge” and exclude actual parts of the genome. Figure 1 Comparison of Lutraria rhynchaena against other published bivalve assemblies. (A) Comparison of BUSCO completeness, estimated vs assembled genome sizes and repeat content. (B) Evolutionary relationships among bivalve species inferred from maximum likelihood (ML) and Bayesian (BI) methods. Rooted with Drosophila melanogaster, the shown BI topology was inferred from a supermatrix of 22,668 amino acid characters was constructed from the alignments of 129 orthologous protein groups. Closed circles indicate maximum nodal support values (BPP/SH-aLRT/UFBoot). Phylogeny of Bivalvia Based on Nuclear Loci Phylogenies inferred by both methods showed strong UFBoot, SH-aLRT and Bayesian posterior probability (BPP) support for most nodes, the exception being the sister relationship between the bivalve orders Mytilida and Pectinida ( Figure 1 ). The snout otter clam, L. rhynchaena, is the only representative for Mactridae and forms a sister relationship with R. philippinarum from the Veneridae, consistent with their placement in the order Venerida. This initial phylogenomic analysis of the Bivalvia contains species from the three subclasses Imparidentia, Palaeoheterodonta, and Pteriomorpha. All bivalve species are placed within groups consistent with their taxonomic classification at the family and ordinal levels and observed relationships among these three subclasses are consistent with findings from recent studies based on Sanger and transcriptome sequencing (Kocot et al., 2011; González Vanessa et al., 2015; Combosch et al., 2017; Lemer et al., 2019). While these studies include greater taxon sampling from a range of bivalve subclasses, the data matrices used to infer phylogeny are often gappy with a higher level of missing data (∼16 to 75%). In contrast, our study successfully used information from a substantial number of genes (> 100) across across a wide density of bivalve taxa with minimal gaps. Nevertheless, our approach to phylogenomics is limited for the time being by the scarcity in genome level resources for bivalve species generally and specifically in relation to missing representatives from important bivalve subclasses including Protobranchia, Archiheterodonta and Anomalodesmata. This initial phylogenomic analysis for the Bivalvia demonstrates the value and potential of this approach as a greater number of high quality genomic data sets become available that can be used for phylogenetic studies, especially for the resolution of basal relationships in what has been a challenging group (Plazzi et al., 2011; Sharma et al., 2012) Multiple sequence alignment and the resulting Newick tree from our analyses are available as Data Sheet 8. Conclusion We present the first draft genome of the snout otter clam (L. rhynchaena) based on relatively large volumes of short and long genomic reads from Illumina and ONT platforms, respectively, providing sufficient sequencing depth to facilitate the generation of one of the best quality genome assemblies for a bivalve mollusk. The use of long Nanopore reads in a hybrid assembly, presents an effective yet economical approach to achieving a good quality assembly and highlights the importance of long reads in spanning large repeat regions and resolving other complex regions that arises for species with high levels of heterozygosity. This highly contiguous and complete assembly makes this draft genome an important and valuable resource to support ongoing genomic and molecular-based breeding studies for aquaculture. In addition, a transcriptomic data set was generated and assembled to support more refined gene prediction, further adding to the currently scarce transcriptomic resources available for the Mactridae family. Ultimately, we expect results of this study to be used as valuable genomic references for a range of genetic, genomic, phylogenetic and population studies of the snout otter clam and other bivalve species. Data Availability Statement The datasets generated for this study can be found in the NCBI BioProject PRJNA548223 (WGS accession: VIBL00000000, SRA accession: SRP201027). Author Contributions BT and CA conceived the project. TT collected the samples. YL and HG performed the sequencing. MT analyzed the data. MT and CA wrote the manuscript. MT, CA, BT, YL, HG, TT and LC read and reviewed the manuscript. Funding This study was funded through a grant to Binh Thanh Thai from the Government of Vietnam through its Aquaculture Biotechnology programs (01/2017/HÐ-TS-CNSH) and the Deakin Genomics Centre, Deakin University, Australia. Conflict of Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Related collections

Most cited references 27

Record: found
Abstract: found
Article: not found

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Heng Li (2011)

Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty. We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors. http://samtools.sourceforge.net. hengli@broadinstitute.org.

0 comments Cited 2350 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Aleksey V. Zimin, Daniela Puiu, Ming-Cheng Luo … (2017)

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads , which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii , a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

0 comments Cited 188 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

ExaBayes: Massively Parallel Bayesian Tree Inference for the Whole-Genome Era

Andre Aberer, Kassian Kobert, Alexandros Stamatakis (2014)

Modern sequencing technology now allows biologists to collect the entirety of molecular evidence for reconstructing evolutionary trees. We introduce a novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size. Our software introduces a nonblocking parallelization of Metropolis-coupled chains, modifications for efficient analyses of data sets comprising thousands of partitions and memory saving techniques. We report on first experiences with Bayesian inferences at the whole-genome level using the SuperMUC supercomputer and simulated data.

0 comments Cited 184 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Han Ming Gan: URI : https://loop.frontiersin.org/people/59566

Christopher M. Austin: URI : https://loop.frontiersin.org/people/369881

Mun Hua Tan: URI : https://loop.frontiersin.org/people/706837

Journal

Journal ID (nlm-ta): Front Genet

Journal ID (iso-abbrev): Front Genet

Journal ID (publisher-id): Front. Genet.

Title: Frontiers in Genetics

Publisher: Frontiers Media S.A.

ISSN (Electronic): 1664-8021

Publication date (Electronic): 20 November 2019

Publication date Collection: 2019

Volume: 10

Electronic Location Identifier: 1158

Affiliations

[1] ¹Fisheries and Technical Economical College , Tu Son, Vietnam

[2] ²Centre of Integrative Ecology, School of Life and Environmental Sciences, Deakin University , Geelong, VIC, Australia

[3] ³Deakin Genomics Centre, Deakin University , Geelong, VIC, Australia

[4] ⁴Faculty of Biology, Ha Noi National University of Education , Ha Noi, Vietnam

[5] ⁵Science and Technique Department, Hung Vuong University , Viet Tri, Vietnam

Author notes

Edited by: Shikai Liu, Ocean University of China, China

Reviewed by: ShengPing Zhong, Guangxi University of Chinese Medicine, China; Yulin Jin, Emory University, United States

*Correspondence: Mun Hua Tan, mun.tan@ 123456deakin.edu.au

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Article

DOI: 10.3389/fgene.2019.01158

PMC ID: 6880199

PubMed ID: 31824566

SO-VID: 769cc7bc-796b-4c7f-9f16-b8b14b331409

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

History

Date received : 23 August 2019

Date accepted : 22 October 2019

Page count

Figures: 1, Tables: 1, Equations: 0, References: 54, Pages: 8, Words: 3840

Comments

Comment on this article

scite_

Cited by 9

See all cited by

Most referenced authors 1,319

See all reference authors

Whole Genome Assembly of the Snout Otter Clam, Lutraria rhynchaena, Using Nanopore and Illumina Data, Benchmarked Against Bivalve Genome Assemblies

Read this article at

Abstract

Related collections

Genome Engineering using CRISPR

Most cited references 27

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

ExaBayes: Massively Parallel Bayesian Tree Inference for the Whole-Genome Era

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 61

Cited by 9

Most referenced authors 1,319