128
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Finding the missing honey bee genes: lessons learned from a genome upgrade

      research-article
      1 , 2 , , 3 , , 2 , 4 , 5 , 2 , 6 , 7 , 8 , 3 ,   8 , 9 , 10 , 11 , 12 , 5 13 , 3 , 14 , 15 , 16 , 3 , 17 , 18 , 3 , 19 , 20 , 21 , 2 , 22 , 23 , 3 , 3 , 2 , 6 , 24 , 25 , 26 , 27 , 13 , 21 , 28 , 7 , 29 , 30 , 31 , 3 , 29 , 3 , 3 , 3
      BMC Genomics
      BioMed Central
      Apis mellifera, GC content, Gene annotation, Gene prediction, Genome assembly, Genome improvement, Genome sequencing, Repetitive DNA, Transcriptome

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee ( Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes.

          Results

          Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data.

          Conclusions

          Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.

          Related collections

          Most cited references60

          • Record: found
          • Abstract: found
          • Article: not found

          Genome sequence of the nematode C. elegans: a platform for investigating biology.

          (1999)
          The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety of repeated sequences, both local and dispersed. The distinctive distribution of some repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons

            Background Transposable elements are abundant in eukaryotic genomes and it is believed that they have a significant impact on the evolution of gene and chromosome structure. While there are several completed eukaryotic genome projects, there are only few high quality genome wide annotations of transposable elements. Therefore, there is a considerable demand for computational identification of transposable elements. LTR retrotransposons, an important subclass of transposable elements, are well suited for computational identification, as they contain long terminal repeats (LTRs). Results We have developed a software tool LTRharvest for the de novo detection of full length LTR retrotransposons in large sequence sets. LTRharvest efficiently delivers high quality annotations based on known LTR transposon features like length, distance, and sequence motifs. A quality validation of LTRharvest against a gold standard annotation for Saccharomyces cerevisae and Drosophila melanogaster shows a sensitivity of up to 90% and 97% and specificity of 100% and 72%, respectively. This is comparable or slightly better than annotations for previous software tools. The main advantage of LTRharvest over previous tools is (a) its ability to efficiently handle large datasets from finished or unfinished genome projects, (b) its flexibility in incorporating known sequence features into the prediction, and (c) its availability as an open source software. Conclusion LTRharvest is an efficient software tool delivering high quality annotation of LTR retrotransposons. It can, for example, process the largest human chromosome in approx. 8 minutes on a Linux PC with 4 GB of memory. Its flexibility and small space and run-time requirements makes LTRharvest a very competitive candidate for future LTR retrotransposon annotation projects. Moreover, the structured design and implementation and the availability as open source provides an excellent base for incorporating novel concepts to further improve prediction of LTR retrotransposons.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

              Background In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence. Results We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e.g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly. Conclusion Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.
                Bookmark

                Author and article information

                Contributors
                Journal
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central
                1471-2164
                2014
                30 January 2014
                : 15
                : 86
                Affiliations
                [1 ]Division of Animal Sciences, Division of Plant Sciences, and MU Informatics Institute, University of Missouri, Columbia, MO 65211, USA
                [2 ]Department of Biology, Georgetown University, Washington, DC 20057, USA
                [3 ]Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, MS BCM226, One Baylor Plaza, Houston, TX 77030, USA
                [4 ]Institute of Evolutionary Genetics, Heinrich Heine University Duesseldorf, Universitaetsstrasse 1, 40225 Duesseldorf, Germany
                [5 ]Center for Genomic Regulation, Universitat Pompeu Fabra, C/Dr. Aiguader 88, E-08003 Barcelona, Catalonia, Spain
                [6 ]Division of Animal Sciences, University of Missouri, Columbia, MO 65211, USA
                [7 ]Laboratory of Zoophysiology, Ghent University, Krijgslaan 281 S2, B-9000 Ghent, Belgium
                [8 ]Laboratory of Protein Biochemistry and Biomolecular Engineering, Ghent University, K.L. Ledeganckstraat 35, B-9000 Ghent, Belgium
                [9 ]Department of Mental Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205-2103, USA
                [10 ]Bee Research Laboratory, BARC-E, USDA-Agricultural Research Service, Beltsville, MD 20705, USA
                [11 ]Department of Biochemistry & Molecular Biology, Centre for High-Throughput Biology, University of British Columbia, 2125 East Mall, Vancouver, BC, Canada
                [12 ]Department of Biology and Biochemistry, University of Houston, Houston, TX 77204-5001, USA
                [13 ]Ernst Moritz Arndt University Greifswald, Institute for Mathematics and Computer Science, Walther-Rathenau-Str. 47, 17487 Greifswald, Germany
                [14 ]Department of Crop Sciences and Institute of Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
                [15 ]Department of Entomology, Purdue University, 901 West State Street, West Lafayette, IN 47907-2089, USA
                [16 ]Department of Obstetrics, Gynecology & Reproductive Sciences, University of Pittsburgh, MAGEE 0000, Pittsburgh, PA 15260, USA
                [17 ]High-Performance Biological Computing (HPCBio), Roy J. Carver Biotechnology Center, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
                [18 ]Softberry Inc., 116 Radio Circle, Suite 400, Mount Kisco, NY 10549, USA
                [19 ]Institute for Genomic Biology and Department of Bioengineering, University of Illinois at Urbana-Champaign, 1270 DCL, MC-278, 1304 W Springfield Ave, Urbana, IL 61801, USA
                [20 ]Research School of Biology, The Australian National University, Canberra ACT 0200, Australia
                [21 ]Institut für Zoologie, Molekulare Ökologie, Martin-Luther-Universität Halle-Wittenberg, Hoher Weg 4, D-06099 Halle (Saale), Germany
                [22 ]Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
                [23 ]National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 45, 8600 Rockville Pike, Bethesda, MD 20894, USA
                [24 ]Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
                [25 ]Institute for Genomic Biology, Department of Entomology, Neuroscience Program, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, USA
                [26 ]Department of Biology, University of North Carolina at Greensboro, 321 McIver Street, Greensboro, NC 27412, USA
                [27 ]Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
                [28 ]Extension Field Operations, Clemson University, 120 McGinty Ct, Clemson, SC 29634, USA
                [29 ]University of Geneva and Swiss Institute of Bioinformatics, CMU, Michel-Servet 1, Geneva CH-1211, Switzerland
                [30 ]Genformatic, 6301 Highland Hills Drive, Austin, TX 78731, USA
                [31 ]Department of Entomology, Neuroscience Program, Program in Ecology and Evolutionary Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
                Author notes
                HGSC production teams
                on behalf of Honey Bee Genome Sequencing Consortium
                Article
                1471-2164-15-86
                10.1186/1471-2164-15-86
                4028053
                24479613
                ccbdf311-5fe6-4617-b34d-2157d24c3b3a
                Copyright © 2014 Elsik et al.; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 11 September 2013
                : 27 January 2014
                Categories
                Research Article

                Genetics
                apis mellifera,gc content,gene annotation,gene prediction,genome assembly,genome improvement,genome sequencing,repetitive dna,transcriptome

                Comments

                Comment on this article