Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

CD-HIT Suite: a web server for clustering and comparing biological sequences

, , , , *

Bioinformatics

Oxford University Press

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      Summary: CD-HIT is a widely used program for clustering and comparing large biological sequence datasets. In order to further assist the CD-HIT users, we significantly improved this program with more functions and better accuracy, scalability and flexibility. Most importantly, we developed a new web server, CD-HIT Suite, for clustering a user-uploaded sequence dataset or comparing it to another dataset at different identity levels. Users can now interactively explore the clusters within web browsers. We also provide downloadable clusters for several public databases (NCBI NR, Swissprot and PDB) at different identity levels.Availability: Free access at http://cd-hit.orgContact: liwz@sdsc.eduSupplementary information: Supplementary data are available at Bioinformatics online.

      Related collections

      Most cited references 9

      • Record: found
      • Abstract: found
      • Article: not found

      Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

      In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
        Bookmark
        • Record: found
        • Abstract: found
        • Article: not found

        A core gut microbiome in obese and lean twins

        The human distal gut harbors a vast ensemble of microbes (the microbiota) that provide us with important metabolic capabilities, including the ability to extract energy from otherwise indigestible dietary polysaccharides1–6. Studies of a small number of unrelated, healthy adults have revealed substantial diversity in their gut communities, as measured by sequencing 16S rRNA genes6–8, yet how this diversity relates to function and to the rest of the genes in the collective genomes of the microbiota (the gut microbiome) remains obscure. Studies of lean and obese mice suggest that the gut microbiota affects energy balance by influencing the efficiency of calorie harvest from the diet, and how this harvested energy is utilized and stored3–5. To address the question of how host genotype, environmental exposures, and host adiposity influence the gut microbiome, we have characterized the fecal microbial communities of adult female monozygotic and dizygotic twin pairs concordant for leanness or obesity, and their mothers. Analysis of 154 individuals yielded 9,920 near full-length and 1,937,461 partial bacterial 16S rRNA sequences, plus 2.14 gigabases from their microbiomes. The results reveal that the human gut microbiome is shared among family members, but that each person’s gut microbial community varies in the specific bacterial lineages present, with a comparable degree of co-variation between adult monozygotic and dizygotic twin pairs. However, there was a wide array of shared microbial genes among sampled individuals, comprising an extensive, identifiable ‘core microbiome’ at the gene, rather than at the organismal lineage level. Obesity is associated with phylum-level changes in the microbiota, reduced bacterial diversity, and altered representation of bacterial genes and metabolic pathways. These results demonstrate that a diversity of organismal assemblages can nonetheless yield a core microbiome at a functional level, and that deviations from this core are associated with different physiologic states (obese versus lean).
          Bookmark
          • Record: found
          • Abstract: found
          • Article: not found

          UniRef: comprehensive and non-redundant UniProt reference clusters.

          Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. Supplementary data are available at Bioinformatics online.
            Bookmark

            Author and article information

            Affiliations
            California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, CA, USA
            Author notes
            * To whom correspondence should be addressed.

            Present address: Department of Medicine, University of California San Diego, La Jolla, CA, USA.

            Associate Editor: Burkhard Rost

            Journal
            Bioinformatics
            bioinformatics
            bioinfo
            Bioinformatics
            Oxford University Press
            1367-4803
            1367-4811
            1 March 2010
            6 January 2010
            6 January 2010
            : 26
            : 5
            : 680-682
            2828112
            20053844
            10.1093/bioinformatics/btq003
            btq003
            © The Author(s) 2010. Published by Oxford University Press.

            This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

            Categories
            Applications Note
            Sequence Analysis

            Bioinformatics & Computational biology

            Comments

            Comment on this article