479
views
0
recommends
+1 Recommend
0 collections
    8
    shares
      • Record: found
      • Abstract: not found
      • Article: not found

      Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

      ,
      Bioinformatics
      Oxford University Press (OUP)

      Read this article at

      ScienceOpenPublisherPubMed
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.

          Related collections

          Most cited references3

          • Record: found
          • Abstract: found
          • Article: not found

          UniProt: the Universal Protein knowledgebase.

          To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. Our mission is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references). For convenient sequence searches, UniProt also provides several non-redundant sequence databases. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub). The scientific community is encouraged to submit data for inclusion in UniProt.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Clustering of highly homologous sequences to reduce the size of large protein databases.

            We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The distribution and query systems of the RCSB Protein Data Bank.

              The Protein Data Bank (PDB; http://www.pdb.org) is the primary source of information on the 3D structure of biological macromolecules. The PDB's mandate is to disseminate this information in the most usable form and as widely as possible. The current query and distribution system is described and an alpha version of the future re-engineered system introduced.
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                Oxford University Press (OUP)
                1367-4803
                1460-2059
                June 26 2006
                July 01 2006
                May 26 2006
                July 01 2006
                : 22
                : 13
                : 1658-1659
                Article
                10.1093/bioinformatics/btl158
                16731699
                054124a3-97aa-4ecd-9496-0b902215c254
                © 2006
                History

                Comments

                Comment on this article