Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

CD-HIT: accelerated for clustering the next-generation sequencing data

, , , , *

Bioinformatics

Oxford University Press

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions.

      Availability: http://cd-hit.org.

      Contact: liwz@ 123456sdsc.edu

      Supplementary information: Supplementary data are available at Bioinformatics online.

      Related collections

      Most cited references 11

      • Record: found
      • Abstract: found
      • Article: not found

      Search and clustering orders of magnitude faster than BLAST.

       Robert Edgar (2010)
      Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.
        Bookmark
        • Record: found
        • Abstract: found
        • Article: not found

        Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

        In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
          Bookmark
          • Record: found
          • Abstract: not found
          • Article: not found

          A human gut microbial gene catalogue established by metagenomic sequencing.

          To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, approximately 150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively.
            Bookmark

            Author and article information

            Affiliations
            Center for Research in Biological Systems, University of California San Diego, La Jolla, CA 92093, USA
            Author notes
            *To whom correspondence should be addressed.

            Present address: Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania.

            Associate Editor: Inanc Birol

            Journal
            Bioinformatics
            Bioinformatics
            bioinformatics
            bioinfo
            Bioinformatics
            Oxford University Press
            1367-4803
            1367-4811
            1 December 2012
            11 October 2012
            11 October 2012
            : 28
            : 23
            : 3150-3152
            23060610
            3516142
            10.1093/bioinformatics/bts565
            bts565
            © The Author 2012. Published by Oxford University Press.

            This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

            Counts
            Pages: 3
            Categories
            Applications Note
            Sequence Analysis

            Bioinformatics & Computational biology

            Comments

            Comment on this article