+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      New Assembly, Reannotation and Analysis of the Entamoeba histolytica Genome Reveal New Genomic Features and Protein Content Information


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          In order to maintain genome information accurately and relevantly, original genome annotations need to be updated and evaluated regularly. Manual reannotation of genomes is important as it can significantly reduce the propagation of errors and consequently diminishes the time spent on mistaken research. For this reason, after five years from the initial submission of the Entamoeba histolytica draft genome publication, we have re-examined the original 23 Mb assembly and the annotation of the predicted genes.

          Principal Findings

          The evaluation of the genomic sequence led to the identification of more than one hundred artifactual tandem duplications that were eliminated by re-assembling the genome. The reannotation was done using a combination of manual and automated genome analysis. The new 20 Mb assembly contains 1,496 scaffolds and 8,201 predicted genes, of which 60% are identical to the initial annotation and the remaining 40% underwent structural changes. Functional classification of 60% of the genes was modified based on recent sequence comparisons and new experimental data. We have assigned putative function to 3,788 proteins (46% of the predicted proteome) based on the annotation of predicted gene families, and have identified 58 protein families of five or more members that share no homology with known proteins and thus could be entamoeba specific. Genome analysis also revealed new features such as the presence of segmental duplications of up to 16 kb flanked by inverted repeats, and the tight association of some gene families with transposable elements.


          This new genome annotation and analysis represents a more refined and accurate blueprint of the pathogen genome, and provides an upgraded tool as reference for the study of many important aspects of E. histolytica biology, such as genome evolution and pathogenesis.

          Author Summary

          Entamoeba histolytica is an anaerobic parasitic protozoan that causes amoebic dysentery. The parasites colonize the large intestine, but under some circumstances may invade the intestinal mucosa, enter the bloodstream and lead to the formation of abscesses such amoebic liver abscesses. The draft genome of E. histolytica, published in 2005, provided the scientific community with the first comprehensive view of the gene set for this parasite and important tools for elucidating the genetic basis of Entamoeba pathogenicity. Because complete genetic knowledge is critical for drug discovery and potential vaccine development for amoebiases, we have re-examined the original draft genome for E. histolytica. We have corrected the sequence assembly, improved the gene predictions and refreshed the functional gene assignments. As a result, this effort has led to a more accurate gene annotation, and the discovery of novel features, such as the presence of genome segmental duplications and the close association of some gene families with transposable elements. We believe that continuing efforts to improve genomic data will undoubtedly help to identify and characterize potential targets for amoebiasis control, as well as to contribute to a better understanding of genome evolution and pathogenesis for this parasite.

          Related collections

          Most cited references34

          • Record: found
          • Abstract: found
          • Article: not found

          The TIGRFAMs database of protein families.

          TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, Pfam and InterPro models. These models are designed to support both automated and manually curated annotation of genomes. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. The scope of each model is set by raising or lowering cutoff scores and choosing members of the seed alignment to group proteins sharing specific function (equivalog) or more general properties. The overall goal is to provide information with maximum utility for the annotation process. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains. The database currently contains over 1600 protein families. TIGRFAMs is available for searching or downloading at www.tigr.org/TIGRFAMs.
            • Record: found
            • Abstract: found
            • Article: not found

            A whole-genome assembly of Drosophila.

            We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.
              • Record: found
              • Abstract: found
              • Article: not found

              The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.

              The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60,000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: goa@ebi.ac.uk.

                Author and article information

                Role: Editor
                PLoS Negl Trop Dis
                PLoS Neglected Tropical Diseases
                Public Library of Science (San Francisco, USA )
                June 2010
                15 June 2010
                : 4
                : 6
                : e716
                [1 ]J. Craig Venter Institute, Rockville, Maryland, United States of America
                [2 ]Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
                [3 ]School of Biological Sciences, University of Liverpool, Liverpool, United Kingdom
                New York University School of Medicine, United States of America
                Author notes

                Conceived and designed the experiments: HAL EVC. Performed the experiments: HAL EVC. Analyzed the data: HAL DP JRM LMB NH EVC. Contributed reagents/materials/analysis tools: HAL DP JRM PA NH EVC. Wrote the paper: HAL EVC.

                Lorenzi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                : 5 February 2010
                : 26 April 2010
                Page count
                Pages: 12
                Research Article
                Genetics and Genomics/Bioinformatics
                Genetics and Genomics/Genome Projects
                Genetics and Genomics/Genomics

                Infectious disease & Microbiology
                Infectious disease & Microbiology


                Comment on this article