• Record: found
  • Abstract: found
  • Article: found
Is Open Access

A haplotype-resolved draft genome of the European sardine (Sardina pilchardus)

Read this article at

      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



      The European sardine ( Sardina pilchardus Walbaum, 1792) is culturally and economically important throughout its distribution. Monitoring studies of sardine populations report an alarming decrease in stocks due to overfishing and environmental change, which has resulted in historically low captures along the Iberian Atlantic coast. Important biological and ecological features such as population diversity, structure, and migratory patterns can be addressed with the development and use of genomics resources.


      The genome of a single female individual was sequenced using Illumina HiSeq X Ten 10x Genomics linked reads, generating 113.8 gigabase pairs of data. Three draft genomes were assembled: 2 haploid genomes with a total size of 935 megabase pairs (N50 103 kilobase pairs) each, and a consensus genome of total size 950 megabase pairs (N50 97 kilobase pairs). The genome completeness assessment captured 84% of Actinopterygii Benchmarking Universal Single-Copy Orthologs. To obtain a more complete analysis, the transcriptomes of 11 tissues were sequenced to aid the functional annotation of the genome, resulting in 40,777 genes predicted. Variant calling on nearly half of the haplotype genome resulted in the identification of >2.3 million phased single-nucleotide polymorphisms with heterozygous loci.


      A draft genome was obtained, despite a high level of sequence repeats and heterozygosity, which are expected genome characteristics of a wild sardine. The reference sardine genome and respective variant data will be a cornerstone resource of ongoing population genomics studies to be integrated into future sardine stock assessment modelling to better manage this valuable resource.

      Related collections

      Most cited references 54

      • Record: found
      • Abstract: found
      • Article: not found

      The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

      Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
        • Record: found
        • Abstract: found
        • Article: found
        Is Open Access

        RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

        Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community. Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available. Availability and implementation: The code is available under GNU GPL at Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          BLAST+: architecture and applications

          Background Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. Results We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. Conclusion The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.

            Author and article information

            [1 ]CCMAR Centre of Marine Sciences, University of Algarve, Campus de Gambelas, 8005–139 Faro, Portugal
            [2 ]CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, InBIO, Laboratório Associado, Universidade do Porto, Vairão, Portugal
            Author notes
            Correspondence address. Adelino V. M. Canário, CCMAR Centre of Marine Sciences, University of Algarve, Campus de Gambelas, 8005-139 Faro, Portugal E-mail: acanario@

            Authors contributed equally.

            Oxford University Press
            May 2019
            21 May 2019
            21 May 2019
            : 8
            : 5
            31112613 6528745 10.1093/gigascience/giz059 giz059
            © The Author(s) 2019. Published by Oxford University Press.

            This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

            Pages: 8
            Funded by: Foundation for Science and Technology 10.13039/501100001871
            Award ID: UID/Multi/04326/2016
            Funded by: European Regional Development Fund 10.13039/501100008530
            Award ID: 22153-01/SAICT/2016
            Funded by: National Infrastruture of Distributed Computing of Portugal
            Award ID: ALG-01-0145-FEDER-022121
            Award ID: ALG-01-0145-FEDER-022231
            Award ID: MAR2020
            Funded by: European Maritime and Fisheries Fund
            Award ID: MAR-01.04.02-FEAMP-0024
            Funded by: Horizon 2020 10.13039/100010661
            Award ID: 654008
            Data Note


            Comment on this article