+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genome Sequencing of Fiber Flax Cultivar Atlant Using Oxford Nanopore and Illumina Platforms

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Introduction Flax (Linum usitatissimum L.) has been grown for seeds and fiber since ancient times (Vaisey-Genser and Morris, 2003). Fiber flax is taller than linseed and has branches only in the upper part of the stem. Linseed branches begin from the middle part of the stem, and these plants produce many large seeds (Diederichsen and Richards, 2003). Flax seeds are rich in omega−3 fatty acids and lignans, the health benefits of which have been proven in numerous studies (Caligiuri et al., 2014; Goyal et al., 2014; Kezimana et al., 2018; Parikh et al., 2019). Therefore, linseed is used in the food and pharmaceutical industries, animal feeds, and the production of eco-friendly paints and composites (Singh et al., 2011; Corino et al., 2014; Goyal et al., 2014; Campos et al., 2019; Fombuena et al., 2019). Flax fibers are hollow tubes that mainly consist of cellulose; they have high strength and durability, which allows one to use them in the production of high-quality textiles (Vaisey-Genser and Morris, 2003). Flax fiber has a high absorbent capacity owing to the wicking and movement of moisture along the surface, enabling its use in cloth for hot climates, sails, tents, and rugs (Atton, 1989). However, it is possible to obtain a long fiber only from a part of the flax stem with no branches; therefore, despite high quality, linen fibers have to a large extent been displaced by synthetic fibers (Muir and Westcott, 2003). Nevertheless, awareness of ecological problems has attracted attention to the use of materials that are more sustainable for our planet, and interest in flax fibers is reviving. Additionally, in the last few years, flax fiber has been actively used as a component of composite materials with good potential for automotive, aerospace, and packaging applications in which high fiber length is not very important (Zhu et al., 2013; Mokhothu and John, 2015; Wu et al., 2016; Dhakal and Sain, 2019; Fombuena et al., 2019; Goudenhooft et al., 2019; Zhang et al., 2020a). The genome of linseed cultivar CDC Bethune was sequenced on an Illumina platform in 2012, using paired-end and mate-pair libraries. This resulted in an assembly of 302 Mb with scaffold N50 of about 700 kb, contig N50 of ~20 kb, and 81% coverage of the flax genome estimated at 370 Mb (Wang et al., 2012). Chromosome-level assembly for 15 chromosome pairs of CDC Bethune was obtained in 2018, using BioNano genome optical, BAC-based physical, and genetic mapping (You et al., 2018). Scaffold-level genome assemblies of linseed cultivar Longya-10, fiber cultivar Heiya-14, and pale flax were generated in 2020, based on Illumina sequencing, Hi-C technology, and genetic mapping (Zhang et al., 2020b). These results are extremely important for further progress in molecular studies of flax, the development of genome editing, and marker-assisted and genomic selection (Saha et al., 2019; Morello et al., 2020; You and Cloutier, 2020). A high-quality genome can be used as a reference for genome and transcriptome assemblies of different flax cultivars/lines, and the identification of polymorphisms and differences in gene expression within L. usitatissimum genotypes (Dmitriev et al., 2019, 2020b; Guo et al., 2019; Wu et al., 2019). Genome sequences of flax are necessary for the identification of particular gene families or repeat classes in species of the genus Linum and cultivars/lines of L. usitatissimum (Bolsheva et al., 2019; Novakovskiy et al., 2019; Ali et al., 2020; Dmitriev et al., 2020a). Recent studies have shown that different genotypes of the same crop can diverge greatly at the genome level, not only in terms of SNPs and small indels but also long insertions and deletions, which can be identified by comparing high-quality genome assemblies (Zhang et al., 2019; Song et al., 2020). Next-generation sequencing platforms, such as Illumina, SOLiD, 454, Ion Torrent, and BGISEQ, have enabled the determination of genomic sequences for thousands of plant genotypes using short reads, whereas the development of third-generation sequencing platforms, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), which produce long reads of up to hundreds of thousands of bases, has facilitated accurate genome assembly (Goodwin et al., 2016; Li et al., 2017; Belser et al., 2018; Li and Harkess, 2018). Despite the wide use of third-generation sequencing approaches in studies of plant genomes, we did not find such sequencing data for flax. To fill this gap, we sequenced the genome of fiber flax cultivar Atlant using ONT and Illumina platforms to obtain a combination of long reads with insufficient accuracy and short high-precision reads, which is extremely important for high-quality genome assembly. Materials and Methods Plant Material Fiber flax cultivar Atlant (alias—l. 23-4 Saldo × Mogilevskij) is characterized by high values of parameters that determine the quality of fiber, including flexibility, metric number, linear density, and calculated relative breaking load. Additionally, this cultivar has low variability of morphological and anatomical characteristics under stress conditions, especially unfavorable soil pH, compared to optimal ones (Ryzhov et al., 2012; Rozhmina et al., 2020). These characteristics of cultivar Atlant are important for the guaranteed production of high-quality fibers that meet the requirements of the textile industry. Atlant seeds were obtained from the Institute for Flax (Torzhok, Russia), which is the originator of this cultivar. Seeds were sterilized in 1% sodium hypochlorite for 2 min and planted in 20 cm pots with sterile soil. Plants were grown in a climate chamber (Daihan LabTech, South Korea) for 2 weeks, and then leaves were collected from individual plants, frozen in liquid nitrogen, and stored at −80°C until DNA extraction. DNA Extraction The DNA extraction method included the homogenization of 0.1 g of leaves from a single plant in liquid nitrogen followed by DNA isolation using a DNA-EXTRAN-3 kit (Synthol, Russia), DNA precipitation with CTAB-containing buffer (1% CTAB, 50 mM Tris-HCl pH 8.0, and 10 mM EDTA), and purification in ion-exchange columns from the Blood and Cell Culture DNA Mini Kit (Qiagen, USA). The DNA concentration and quality were evaluated using a Qubit 2.0 fluorometer (Life Technologies, USA) and NanoDrop 2000C spectrophotometer (Thermo Fisher Scientific, USA). The DNA length and control of RNA absence were assessed via electrophoresis using a 0.8% agarose gel. Genome Sequencing on ONT Platform Library preparation was performed using an SQK-LSK109 Ligation Sequencing Kit (ONT, UK) for 1D genomic DNA sequencing. Minor modifications were introduced to the basic protocol for library preparation. The incubation time was increased to 20 min at the DNA recovery step and 60 min at the adaptor ligation step. A MinION (ONT) instrument with an R9.4.1 flow-cell (ONT) was used for sequencing. Genome Sequencing on Illumina Platform DNA was fragmented on an S220 ultrasonic homogenizer (Covaris, USA), and 1 μg of fragmented DNA was used for library preparation using a NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, UK) with a size selection of adaptor-ligated DNA of ~600–800 bp. The DNA library concentration and quality were evaluated on a Qubit 2.0 fluorometer (Life Technologies) and 2100 Bioanalyzer (Agilent Technologies, USA), respectively. Sequencing was performed on a HiSeq 2500 instrument (Illumina, USA) with a read length of 250 + 250 bp. Preliminary Data Analysis For successful Nanopore sequencing, DNA quality is crucial. We developed a protocol for the isolation of long high-purity DNA from a single flax plant and obtained DNA of ~50 kb with A260/A280 of 1.9 and A260/A230 of 2.0. The DNA concentrations measured with a NanoDrop spectrophotometer (Thermo Fisher Scientific) and Qubit fluorometer (Life Technologies) had similar values, which is an important criterion of DNA purity. The sequencing of the obtained DNA on the ONT platform produced 8.4 Gb with N50 of 12 kb, corresponding to ~23 × flax genome coverage. On the Illumina platform, 30 × genome coverage was obtained with 22.6 million 250 + 250 paired-end reads. The raw data were deposited in the NCBI Sequence Read Archive (SRA) under the BioProject accession number PRJNA648016. First, the MinION fast5 files were processed using Guppy 3.6.1 (https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revt_14dec2018) with the high-accuracy flip-flop algorithm (dna_r9.4.1_450bps_hac.cfg configuration file). Then, adapter sequences were removed using Porechop (https://github.com/rrwick/Porechop), and low-quality reads (average Q < 6) were filtered out using Trimmomatic 0.32 (Bolger et al., 2014). Illumina reads were also filtered (minimum read length—50) and trimmed (trailing, minimum Q—28) using the Trimmomatic tool. Genome assemblies based on the Nanopore reads were performed using four assemblers: Canu 2.0 (Koren et al., 2017), Flye 2.7 (Kolmogorov et al., 2019), Shasta 0.5.0 (Shafin et al., 2020), and wtdbg2 2.5 (Ruan and Li, 2020). The default parameters were used, except for the minimum read length for Shasta (was set to 3,000 bp) and expected genome size for Flye and wtdbg2 (was set to 400 Mb). The statistics for the genome assemblies were calculated using QUAST 5.0.2 (Gurevich et al., 2013), and are presented in Table 1. Canu produced the longest assembly (361.7 Mb for contigs and 393.9 Mb for unitigs, which means high-confidence contigs) and largest contig of 5 Mb, and was one of the best in most parameters of Nx and Lx statistics. The highest N50 value of 365 kb was obtained using wtdbg2; however, the total assembly length was only 212.2 Mb, almost less than twice the real size of the flax genome (Wang et al., 2012). Canu was the second in N50 value, resulting in 350 kb for contigs and 225 kb for unitigs. Table 1 QUAST statistics for genome assemblies of flax cultivar Atlant. Feature Canu 2.0 contigs Canu 2.0 unitigs Flye 2.7 Shasta 0.5.0 wtdbg2 2.5 Total assembly length, Mb 361.7 393.9 346.1 290.8 212.2 Number of contigs 2458 4361 8278 5641 2306 Largest contig, Mb 5.0 1.7 3.3 2.0 3.0 GC, % 38.94 38.93 38.98 39.06 39.03 N50, kb 350 225 191 295 365 NG50, kb 412 286 222 264 117 N75, kb 154 84 51 142 112 NG75, kb 220 159 84 106 – L50 261 472 368 278 141 LG50 201 319 295 323 397 L75 648 1191 1205 634 407 LG75 463 685 871 789 – Key parameters are marked in bold. NG50/NG75 is the maximum length for which the subset of contigs of that length or longer covers at least 50%/75% of the reference genome (cultivar CDC Bethune, GenBank: GCA_000224295.2). LG50/LG75 is the number of contigs with a length equal to or greater than NG50/NG75, that is, the minimal number of contigs that cover 50%/75% of the reference genome. Unitigs are high-confidence contigs, according to Canu terminology. The misassembly rates between our assemblies and the NCBI representative genome for L. usitatissimum (cultivar CDC Bethune, GenBank: GCA_000224295.2) were evaluated using QUAST (Supplementary Data 1). Canu resulted in the best coverage of the reference genome (~95% for both contigs and unitigs) and the largest alignment (662 kb for both contigs and unitigs) and lost only to Shasta in one of the key parameters, NA50, which is an analog of N50 for fragments successfully mapped to the reference. Considering the rate of misassemblies larger than 1 kb and duplication ratio, Canu was only third after Shasta and wtdbg2; however, the latter demonstrated very low coverage of the reference genome (only 63.25%). It should be taken into account that we compared Atlant assemblies with the genome of another cultivar; therefore, it is naturally that some under- and misassemblies are present. The aforementioned statistics allowed, firstly, a comparison of the current Atlant assemblies performed with different tools. Thereafter, the assemblies were polished, using Nanopore reads, with Racon 1.4.3 (Vaser et al., 2017) and/or Medaka 1.0.3 (https://github.com/nanoporetech/medaka), and, using Illumina reads, with the POLCA tool from the MaSuRCA 3.3.9 assembler (Zimin et al., 2017) to improve the contig accuracy. The assembly completeness was evaluated as the content of universal single-copy genes inherent to land plants using BUSCO v4, embryophyta_odb10 dataset (Seppey et al., 2019). The results are presented in Figure 1. For assemblies before polishing, the best results were obtained for Canu unitigs (93.74%), Canu contigs (93.62%), and Flye contigs (93.56%), whereas the worst result was shown for contigs assembled by wtdbg2 (59.73%). The highest efficiency of polishing was peculiar to the combination of Racon + Medaka + POLCA, which improved the completeness of the assembly from 93.62 to 97.40% (Canu contigs). This result was the best among those of all variants of assembler–polisher combinations. The totality of the parameters, including Nx and BUSCO statistics, as well as the misassemblies, suggested that the Canu genome assembly of flax cultivar Atlant polished according to Racon + Medaka + POLCA scheme was best, and it was used for further genome annotation. Figure 1 BUSCO assessment results for genome assemblies of flax cultivar Atlant. Results for the following assemblers are presented: Canu 2.0 contigs, Canu 2.0 unitigs, Flye 2.7, Shasta 0.5.0, and wtdbg2 2.5 coupled with Racon 1.4.3, Medaka 1.0.3, and/or POLCA from MaSuRCA 3.3.9 polishers. The large percentage of duplicated BUSCOs (68% for the polished Canu assembly) is noteworthy. This is in good agreement with the statement that L. usitatissimum originated as the result of the hybridization of two diploid Linum species, from each of which it received a whole set of chromosomes (Bolsheva et al., 2017). In the NCBI genome database, assemblies of only three L. usitatissimum genomes are presented: linseed cultivar CDC Bethune (representative genome, chromosome level, GenBank: GCA_000224295.2), linseed cultivar Longya-10 (scaffold level, GenBank: GCA_010665275.1), and fiber flax cultivar Heiya-14 (scaffold level, GenBank: GCA_010665265.1). For all three genomes, annotations have not been submitted that complicates the use of these data in studies of flax. In the present study, we annotated the assembled genome of fiber flax cultivar Atlant using the funannotate 1.8.0 pipeline (https://funannotate.readthedocs.io/en/latest/). Immediately before the annotation, repeat masking was performed with TANTAN (http://cbrc3.cbrc.jp/~martin/tantan/). Approximately 7.6% of the genomic sequence was masked. For the annotation, we used our previously obtained transcriptome sequencing data for five different tissues of cultivar Atlant (NCBI SRA: SRX8380594—shoots of seedlings, SRX8380593—roots of seedlings, SRX8380592—flowers of adult plants, SRX8380591—stems of adult plants, and SRX8380590—leaves of adult plants). To make genome-guided transcriptome assembly, we mapped the RNA-Seq reads to the assembled genome via HISAT2 2.2.0 (Kim et al., 2019). About 96% of reads (54.0M of 56.2M) were successfully mapped. 82,290 transcripts corresponding to 69,143 genomic loci were assembled using Trinity 2.8.5 in genome-guided mode. Based on the transcript data and mapped RNA-Seq reads, a total of 77,522 gene models were predicted using PASA 2.4.1, Augustus 3.3.3, GlimmerHMM 3.0.4, SNAP v. 2006-07-28, GeneMark 4.61, and CodingQuarry 2.0 (the results were combined and analyzed using EvidenceModeller 1.1.1). Among them, 1,182 were referred to as tRNA. In total, 18,946 gene models were successfully annotated using the Pfam database (up-to-date on June 2020), 19,741 using eggNOG (up-to-date on June 2020), 953 using BUSCO embryophyta_odb10 dataset, and 3,725 using UniProt (up-to-date on June 2020). The summary statistics of the functional annotation of predicted genes are presented in Supplementary Data 2. The assembled genome was deposited in the NCBI database under the BioProject accession number PRJNA648016. Conclusions In this study, the genome of fiber flax cultivar Atlant was sequenced for the first time, using both Oxford Nanopore and Illumina platforms. For successful Nanopore sequencing, a protocol for extraction of pure high-molecular-weight DNA from the leaves of a single flax plant was developed. Sequencing of this DNA on the ONT platform resulted in 23 × flax genome coverage (8.4 Gb, N50 = 12 kb). On the Illumina platform, 30 × genome coverage was obtained (22.6 million of 250 + 250 paired-end reads). Genome assemblies were performed using Canu, Flye, Shasta, and wtdbg2. Subsequent polishing by Racon, Medaka, and POLCA was used to improve the contig accuracy. The most complete and accurate assembly was achieved by Canu with the polishing scheme Racon + Medaka + POLCA: total length = 361.7 Mb, N50 = 350 kb, and 97.40% completeness according to BUSCO. The genome was annotated using the funannotate pipeline and our transcriptome sequencing data for 5 different tissues of cultivar Atlant. The obtained results are useful for the evaluation of L. usitatissimum polymorphism at the genome level, the identification of sequences specific to fiber flax, as a reference in studies of fiber flax cultivars, and the development of flax genomic selection and genome editing. These findings can also be used for the analysis of flax DNA methylation at the whole-genome level, as information on this DNA modification can be derived from Nanopore reads. Data Availability Statement The raw sequencing data and the assembled genome are deposited in the NCBI database under the BioProject accession number PRJNA648016. Author Contributions AD, TR, and NM conceived and designed the work. EP, RN, AB, TR, NB, LP, ED, PK, AS, and NM performed the experiments. AD, EP, TR, AZ, OM, AK, GK, and NM analyzed the data. AD, EP, TR, GK, and NM wrote the manuscript. All authors read and approved the final manuscript. Conflict of Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

          Related collections

          Most cited references 51

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Trimmomatic: a flexible trimmer for Illumina sequence data

          Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct handling of paired-end data and high performance. We have developed Trimmomatic as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data. Results: The value of NGS read preprocessing is demonstrated for both reference-based and reference-free tasks. Trimmomatic is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested. Availability and implementation: Trimmomatic is licensed under GPL V3. It is cross-platform (Java 1.5+ required) and available at http://www.usadellab.org/cms/index.php?page=trimmomatic Contact: usadel@bio1.rwth-aachen.de Supplementary information: Supplementary data are available at Bioinformatics online.
            • Record: found
            • Abstract: not found
            • Article: not found

            Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation

              • Record: found
              • Abstract: found
              • Article: not found

              QUAST: quality assessment tool for genome assemblies.

              Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST-a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. http://bioinf.spbau.ru/quast . Supplementary data are available at Bioinformatics online.

                Author and article information

                Front Genet
                Front Genet
                Front. Genet.
                Frontiers in Genetics
                Frontiers Media S.A.
                14 January 2021
                : 11
                1Engelhardt Institute of Molecular Biology, Russian Academy of Sciences , Moscow, Russia
                2Federal Research Center for Bast Fiber Crops , Torzhok, Russia
                3All-Russian Horticultural Institute for Breeding, Agrotechnology and Nursery , Moscow, Russia
                4Moscow Institute of Physics and Technology , Moscow, Russia
                5Peoples' Friendship University of Russia (RUDN University) , Moscow, Russia
                Author notes

                Edited by: Liwu Zhang, Fujian Agriculture and Forestry University, China

                Reviewed by: Kishor Gaikwad, Indian Council of Agricultural Research (ICAR), India; Rajesh Kumar Gazara, Indian Institute of Technology Roorkee, India

                *Correspondence: Nataliya V. Melnikova mnv-4529264@ 123456yandex.ru

                This article was submitted to Plant Genomics, a section of the journal Frontiers in Genetics

                †These authors have contributed equally to this work

                Copyright © 2021 Dmitriev, Pushkova, Novakovskiy, Beniaminov, Rozhmina, Zhuchenko, Bolsheva, Muravenko, Povkhova, Dvorianinova, Kezimana, Snezhkina, Kudryavtseva, Krasnov and Melnikova.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                Page count
                Figures: 1, Tables: 1, Equations: 0, References: 51, Pages: 7, Words: 4843
                Funded by: Russian Science Foundation 10.13039/501100006769
                Award ID: 16-16-00114
                Funded by: Ministry of Science and Higher Education of the Russian Federation 10.13039/501100012190
                Award ID: 075-00853-19-00
                Data Report


                Comment on this article