Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

FASTA Herder: a web application to trim protein sequence sets

This is not the latest version for this article. If you want to read the latest version, click here.

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near full-length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca.

      Related collections

      Most cited references 11

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      BLAST+: architecture and applications

      Background Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. Results We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. Conclusion The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.
        Bookmark
        • Record: found
        • Abstract: not found
        • Article: not found

        Gapped BLAST and PSI-BLAST a new generation of protein database search programs

          Bookmark
          • Record: found
          • Abstract: found
          • Article: not found

          UniRef: comprehensive and non-redundant UniProt reference clusters.

          Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. Supplementary data are available at Bioinformatics online.
            Bookmark

            Author and article information

            Affiliations
            [ 1 ]Ottawa Hospital Research Institute, 501 Smyth Road, Ottawa, Ontario K1H 8L6, Canada
            [ 2 ]Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, 13125 Berlin, Germany
            [ 3 ]Department of Cellular and Molecular Medicine, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
            Author notes
            [* ]Corresponding author's e-mail address: cpereziratxeta@ 123456gmail.com
            Contributors
            (View ORCID Profile)
            (View ORCID Profile)
            (View ORCID Profile)
            Journal
            SO-LIFE
            ScienceOpen Research
            ScienceOpen
            2199-1006
            28 April 2014
            : 0 (ID: 5df5dc75-0b14-497d-804d-0075d0201d15 )
            : 0
            : 1-3
            1152:XE
            10.14293/A2199-1006.01.SOR-LIFE.67837.v1
            © 2014 Louis-Jeune et al.

            This work has been published open access under Creative Commons Attribution License CC BY 4.0 , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com .

            Counts
            Figures: 0, Tables: 0, References: 11, Pages: 3
            Product
            Categories
            Original Article

            Comments

            Comment on this article