TeaGVD: A comprehensive database of genomic variations for uncovering the genetic architecture of metabolic traits in tea plants

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Introduction Tea plant (Camellia sinensis (L.) O. Kuntze) is one of the most important nonalcoholic beverage crops. As a result of distinctive metabolites beneficial to health, tea plant is also widely used to uncover the molecular mechanisms underlying the synthesis of specific metabolites, such as catechins and caffeine (Jin et al., 2017; Jiang et al., 2019; Zhu et al., 2021). The rapid development of high-throughput sequencing technologies has led to an exponential increase in the volume of biological sequence data of tea plants over the past decade, providing valuable insights into the diversity and evolution of tea germplasms and the mechanism of important metabolites and agronomic traits in tea plant. Tea genome is large (~3 Gb) and complex, harboring a large number of repetitive sequences and high heterozygosity due to its self-incompatibility. Recently, the completion and availability of genome assemblies of tea plant have accelerated the investigations of evolutionary dynamics of whole-genome duplication, tandem duplication, and long terminal repeat retrotransposons that resulted in the diversification of tea germplasms (Wang et al., 2020; Xia et al., 2020; Chen et al., 2020; Zhang et al., 2020a; Zhang et al., 2020c; Wang et al., 2021; Zhang et al., 2021). Meanwhile, large-scale resequencing and RNA-seq projects of tea germplasms have been performed and enabled novel insights into the diversity, evolution and domestication in tea germplasms (Wang et al., 2020; Xia et al., 2020; Yu et al., 2020; Zhang et al., 2020c; Zhang et al., 2021). Genome-wide linkage study and genome-wide association study (GWAS) have revealed numerous sites and genes controlling relevant agronomical traits of tea plant, such as leaf traits (An et al., 2021; Lu et al., 2021) and metabolites (Zhang et al., 2020b), which provide an important foundation for further decoding the molecular mechanism of traits in tea plant. However, the lack of a standardized data processing and visualizing platform hinders the availability of such data. The construction of a user-friendly web-based platform for big data deposition, integration, accession and visualization has become a crucial issue for maximizing these valuable sequence data. Recently, several specialized web-based databases have been developed for the storage and utilization of biological sequence data in tea plant, such as TPIA (Xia et al., 2019), TeaPGDB (Lei et al., 2021), TeaCoN (Zhang et al., 2020b), and TeaAS (Mi et al., 2021). However, these databases did not comprehensively integrate a large-scale genomic variation of various tea genetic resources and genotype-to-phenotype associations (G2Ps) for understanding the complex traits in tea plants, hindering the availability of big omics data. Here, we collected and identified more than 70 million genomic variations and 17,974 high-quality G2Ps for 464 tea metabolites. A comprehensive and user-friendly database of genomic variations for tea plants, TeaGVD (http://www.teaplant.top/teagvd), has been developed for storage, retrieval, visualization and utilization of these data, which will facilitate understanding of the genetic architecture of metabolic and agronomic traits, molecular assistant breeding, and molecular design breeding in tea plants. Materials and methods Data sources Currently, the raw reads of whole-genome sequencing (WGS), GBS data and RNA-seq data from eight datasets of tea germplasms comprising 1,229 accessions were collected ( Table S1 ). All the species and varieties in Camellia L. Sect. Thea (L.) Dyer were covered, including C. sinensis (L.) O. Kuntze var. sinensis, var. assamica (Masters) Kitamura, var. pubilimba Chang, C. taliensis (W. W. Smith) Melchior, C. tachangensis F. C. Zhang, C. crassicolumna Chang, and C. gymnogyna Chang (Chen et al., 2000). Four datasets of WGS germplasms representing genetic diversity and improvement of tea plants were downloaded from NCBI with BioProject accession numbers PRJNA646044, PRJNA597714, PRJNA665594, and PRJNA716079 (Wang et al., 2020; Xia et al., 2020; Lu et al., 2021; Zhang et al., 2021). GBS data were downloaded from the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences with CRA001438 (Niu et al., 2020). Other datasets were RNA-seq data and downloaded from PRJNA595795 and PRJNA562973 with 217 and 136 tea accessions, respectively (Yu et al., 2020; Zhang et al., 2020c). In addition, GA and eight catechin compounds in three leaf samples of 176 tea accessions (Zhang et al., 2020c) and 437 annotated metabolites detected by UPLC-QTOF MS of 136 tea accessions (Yu et al., 2020) were integrated into the database. Because a high-quality chromosome-level genome assembly is basis for identification of genomic variations and genome-wide association analysis, the reference genome (C. sinensis var. sinensis ‘Shuchazao’), functional annotation and gene expression were downloaded from the Tea Plant Information Archive (http://tpdb.shengxin.ren/) (Xia et al., 2019). Two previously published draft genomes of C. sinensis var. sinensis ‘Shuchazao’ and C. sinensis var. assamica ‘Yunkang 10’ have widely applied in genetic and functional studies in tea plants (Xia et al., 2017; Wei et al., 2018). For users’ convenience, a total of 31,780 orthologous gene sets were identified for the three tea genome assemblies by using BLASTP (Altschul et al., 1997) based on the bidirectional best hit (BBH) method ( Table S2 ). Data processing To identify the genomic variation of tea germplasms accurately, the raw reads were trimmed by Sickle (https://github.com/najoshi/sickle) with default parameters to remove low-quality sequences. In WGS germplasms, the trimmed reads were aligned to the tea pant reference genome using Burrows Wheeler Aligner (BWA) (Li and Durbin, 2009) and PCR duplicates were filtered by Sambamba (Tarasov et al., 2015) with parameters “–overflow-list-size 1000000 –hash-table-size 1000000”. After filtering low-quality alignments, SNP and InDel were identified by SAMtools (Li et al., 2009) and FreeBayes (Garrison and Marth, 2012). In GBS germplasms, the trimmed reads were aligned to the tea pant reference genome using BWA (Li et al., 2009) and SNP and InDel were identified by HaplotypeCaller of GATK with parameters “–minimum-mapping-quality 30 -ERC GVCF –dont-use-soft-clipped-bases” (McKenna et al., 2010). In RNA-seq germplasms, the trimmed RNA-seq reads were mapped to the reference genome using HISAT2 with default parameters (Kim et al., 2019). PCR duplicates were removed by Picard (https://broadinstitute.github.io/picard). SNP and InDel calling was performed by HaplotypeCaller of GATK (McKenna et al., 2010). These SNPs and InDels were further filtered by VCFtools with parameters “–max-missing 0.5 –minQ 30 –maf 0.05” (Danecek et al., 2011). The identified genomic variations were annotated by SnpEff (Cingolani et al., 2012), ANNOVAR (Wang et al., 2010) and VEP (McLaren et al., 2016) based on the gene annotation file of the tea plant genome with default parameters. To explore the genetic diversity of tea germplasms, the SNP density, nucleotide diversity (θπ), and Tajima’s D statistics of 461 WGS germplasms were calculated by VCFtools (Danecek et al., 2011). In addition, GWAS was performed with EMMAX (Kang et al., 2010) and GAPIT (Wang and Zhang, 2021) with GLM, MLM, CMLM and FarmCPU model to find genetic variations or genes associated with a particular metabolic trait. The threshold of significant candidate loci (lead SNPs) was determined by GEC software (Li et al., 2012). The LD Score regression intercept and heritability were estimated by LDSC software (https://github.com/bulik/ldsc). Implementation The interactive web interface of TeaGVD was built based on Flask, a lightweight Python Web framework (https://palletsprojects.com/p/flask/), and it integrated all pre-processed data. The frontend pages were developed and visualized by HTML5, CSS5, jQuery, Bootstrap (https://getbootstrap.com/), ECharts (https://echarts.apache.org/), and Bokeh (https://bokeh.org/). The BLAST tool was implemented using SequenceServer (Priyam et al., 2019). In the PCR primer design tool, Primer3 (Koressaar and Remm, 2007) was used to pick PCR primers based on the reference genome with customization. Database contents To take advantages of omics data in tea plant, the sequencing data of 1,229 accessions of tea germplasms were collected and analyzed using a standardized pipeline. In total, more than 70 million genomic variations (SNPs and InDels) were identified from the sequencing data ( Table 1 ). The missing rate and level of heterozygosity were 20.27% and 16.73%, respectively. Among these, 6,193,642, 30,938, and 944,449 genomic variations were present in or around gene regions (e.g., exon, intron, upstream, and downstream), accounting for 8.74%, 17.66% and 77.15% of these in WGS, GBS, and RNA-seq data, respectively. In addition, 17,974 high-quality G2Ps for 464 tea metabolites have been identified by GWAS. To facilitate the exploration of these data, we developed a comprehensive and user-friendly database of genomic variations in tea plants (TeaGVD) that was built and organized into three functional modules for various data types and applications, including Genotype, Phenotype, and Tools modules ( Figure 1A ). These modules provide user-friendly web interfaces to retrieve and visualize genomic variations and their related information. In the Genotype module, users can retrieve available SNP/InDel information by multiple search strategies with filter parameters. Moreover, TeaGVD can figure out the polymorphic SNPs/InDels between two or more germplasms rapidly by comparison of varieties, which is convenient to develop molecular markers. In the Phenotype module, TeaGVD shows the detailed trait values, value distribution, and GWAS results for each available metabolite. Users also can further explore candidate genes and functional markers associated with the metabolite of interest by the Candidate Region and Lead SNP Genotype submodules, respectively. To better utilize these data, the BLAST, Extract Sequence, Primer Design, and Population Genetic Analysis (SNP density, nucleotide diversity, and Tajima’s D statistics) tools were established in the Tools module. Table 1 Statistics of genomic variations and genotype-to-phenotype associations for metabolites in tea plants. WGS Germplasms GBS Germplasms RNA-seq Germplasms Metabolites SNP InDel SNP InDel SNP InDel G2Ps Chr1 5,222,352 145,700 15,105 638 90,750 5,014 1,760 Chr2 4,969,532 138,453 12,520 585 90,360 5,013 1,558 Chr3 4,482,574 123,313 10,230 436 84,345 3,860 1,427 Chr4 4,646,325 126,954 12,721 557 76,367 4,124 1,464 Chr5 4,673,005 121,241 10,392 450 64,876 3,090 1,634 Chr6 4,086,500 117,347 10,738 513 75,703 4,050 1,754 Chr7 4,473,987 118,746 10,999 473 75,863 3,838 896 Chr8 4,022,025 100,780 10,857 447 49,403 2,284 757 Chr9 3,869,937 105,272 10,179 462 70,129 3,616 966 Chr10 4,006,236 106,247 9,073 394 58,059 2,810 688 Chr11 2,921,987 82,127 8,403 394 59,777 3,451 1,209 Chr12 3,814,285 101,332 9,220 389 48,865 2,520 674 Chr13 3,139,356 87,970 8,158 378 57,964 3,112 762 Chr14 2,985,432 85,252 8,134 381 54,914 2,925 1,310 Chr15 2,811,586 77,317 6,736 280 45,377 2,411 1,115 UN 8,859,214 246,973 14,293 648 162,059 7,287 – Total 68,984,333 1,885,024 167,758 7,425 1,164,811 59,405 17,974 Figure 1 The schematic, screenshots of representative resources in the TeaGVD. (A) The schematic of TeaGVD. (B) The distribution of trait value and detailed value in the available tea accessions. (C) The Manhattan plot, QQ plot, and significant candidate loci of GWAS for EC-GC dimer isomer 4. (D) The distribution, function, and expression of candidate genes in the candidate region (Chr1:190796254-191238806) that identified by GWAS. (E) The detail gene information and variation map of the candidate gene CSS0005646. (F) The genotype distribution of the significant candidate loci TEAV1S01r00071715. (G) Boxplot and statistic analysis of different genotypes in TEAV1S01r00071715 for EC-GC dimer isomer 4. Use cases These data and tools will facilitate understanding of the genetic architecture of metabolic traits and molecular breeding in tea plants. We take EC-GC dimer isomer 4 under NEG mode as an example. Histogram plot of value distribution and table of detail value for each tea germplasm are shown by selecting the corresponding trait in Trait Search ( Figure 1B ). GWAS results present the Multiple GWAS comparison, GWAS Manhattan plot, QQ plot, LDSC analysis, and significant candidate loci (lead SNP) associated with EC-GC dimer isomer 4, which can be dynamically visualized by clicking on given SNP/InDel links to various detailed information pages of variation ( Figure 1C ). On the basis of the GWAS results, we specified genomic coordinate (Chr1:190796254-191238806) in Candidate Region and identified 12 genes in the genomic region. The gene distribution, functional annotation, and expression of these genes are displayed in the web interface ( Figure 1D ). The given gene links direct users to gene detailed information interface, which includes a visualized variation map around the gene, basic gene information, gene annotation (GO, KEGG, and Pfam), and gene expression of eight tissues ( Figure 1E ). Among these, CSS0005646 (also known as CsMYB111) has been reported to be associated with anthocyanin, catechin, and flavanol biosynthesis (Li et al., 2022). In addition, TEAV1S01r00071715 significantly associated with EC-GC dimer isomer 4 (P-value < 2.16e-12) was identified by GWAS. Comparisons of different genotypes in TEAV1S01r00071715 showed that the content of EC-GC dimer isomer 4 of genotypes AA and AG was significantly higher than that of genotype GG (P-value < 0.01, two-sided Wilcoxon test; Figure 1G ) by lead SNP genotyping. We also found that genotype AA was only present in Yunnan, China, which was the center of origin for tea plants ( Figure 1F ). Data availability statement The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary material and http://www.teaplant.top/teagvd. Author contributions LC, M-ZY, J-DC, and W-ZH conceived and designed the study. J-DC, SC, Q-YC, J-QM, J-QJ and C-LM. performed the data analysis and web design. J-DC, W-ZH, LC, M-ZY, D-GM and SE prepared the manuscript. All authors contributed to the article and approved the submitted version. Funding This research was supported by the National Key Research and Development Program of China (2021YFD1200203), the Zhejiang Provincial Natural Science Foundation of China (LQ20C160010), the Zhejiang Science and Technology Major Program on Agricultural New Variety Breeding-Tea Plant (2021C02067) and the Fundamental Research Fund for Tea Research Institute of the Chinese Academy of Agricultural Sciences (1610212022009). Conflict of interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Publisher’s note All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Related collections

Most cited references 37

Record: found
Abstract: found
Article: found

Is Open Access

The Sequence Alignment/Map format and SAMtools

Heng Li, Bob Handsaker, Alec Wysoker … (2009)

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk

0 comments Cited 13782 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Fast and accurate short read alignment with Burrows–Wheeler transform

Heng Li, Richard Durbin (2009)

Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: rd@sanger.ac.uk

0 comments Cited 10293 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Aaron McKenna, Matthew Hanna, Eric R. Banks … (2010)

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

0 comments Cited 5525 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jie-Dan Chen: URI : https://loop.frontiersin.org/people/1875967

Si Chen: URI : https://loop.frontiersin.org/people/2101041

Chun-Lei Ma: URI : https://loop.frontiersin.org/people/445333

Sezai Ercisli: URI : https://loop.frontiersin.org/people/739042

Ming-Zhe Yao: URI : https://loop.frontiersin.org/people/2091007

Liang Chen: URI : https://loop.frontiersin.org/people/369371

Journal

Journal ID (nlm-ta): Front Plant Sci

Journal ID (iso-abbrev): Front Plant Sci

Journal ID (publisher-id): Front. Plant Sci.

Title: Frontiers in Plant Science

Publisher: Frontiers Media S.A.

ISSN (Electronic): 1664-462X

Publication date (Electronic): 28 November 2022

Publication date Collection: 2022

Volume: 13

Electronic Location Identifier: 1056891

Affiliations

[1] ¹ National Center for Tea Improvement, Tea Research Institute of the Chinese Academy of Agricultural Science , Hangzhou, China

[2] ² Tea Research Institute, Lishui Academy of Agricultural and Forestry Sciences , Lishui, China

[3] ³ Research Institute of Climate Change and Agriculture, National Institute of Horticultural and Herbal Science , Jeju, South Korea

[4] ⁴ Department of Horticulture, Faculty of Agriculture, Ataturk University , Erzurum, Turkey

Author notes

Edited by: Fei Shen, Beijing Academy of Agricultural and Forestry Sciences, China

Reviewed by: Tangchun Zheng, Beijing Forestry University, China; Kai Fan, Fujian Agriculture and Forestry University, China

*Correspondence: Liang Chen, liangchen@ 123456tricaas.com ; Ming-Zhe Yao, yaomz@ 123456tricaas.com

†These authors have contributed equally to this work

This article was submitted to Functional and Applied Plant Genomics, a section of the journal Frontiers in Plant Science

Article

DOI: 10.3389/fpls.2022.1056891

PMC ID: 9742251

PubMed ID: 36518520

SO-VID: 1009927f-77f0-4383-bbff-4df6a1174a82

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

History

Date received : 29 September 2022

Date accepted : 08 November 2022

Page count

Figures: 1, Tables: 1, Equations: 0, References: 37, Pages: 6, Words: 2537

Comments

Comment on this article

scite_

Cited by 5

See all cited by

Most referenced authors 1,624

See all reference authors

TeaGVD: A comprehensive database of genomic variations for uncovering the genetic architecture of metabolic traits in tea plants

Read this article at

Abstract

Related collections

Plant MYBs

Most cited references 37

The Sequence Alignment/Map format and SAMtools

Fast and accurate short read alignment with Burrows–Wheeler transform

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 453

Cited by 5

Most referenced authors 1,624