22
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      PFASUM: a substitution matrix from Pfam structural alignments

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities.

          We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space.

          Results

          We show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set.

          Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences.

          Conclusions

          We present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s12859-017-1703-z) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: not found

          Amino acid substitution matrices from protein blocks.

          Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Improved tools for biological sequence comparison.

            We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.

              The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1 allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.
                Bookmark

                Author and article information

                Contributors
                keul@bio.tu-darmstadt.de
                martin.hess@gcc.tu-darmstadt.de
                michael.goesele@gcc.tu-darmstadt.de
                hamacher@bio.tu-darmstadt.de
                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central (London )
                1471-2105
                5 June 2017
                5 June 2017
                2017
                : 18
                : 293
                Affiliations
                [1 ]ISNI 0000 0001 0940 1669, GRID grid.6546.1, , Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, ; Schnittspahnstraße 2, Darmstadt, 64287 Germany
                [2 ]ISNI 0000 0001 0940 1669, GRID grid.6546.1, , Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, ; Rundeturmstraße 12, Darmstadt, 64283 Germany
                Author information
                http://orcid.org/0000-0001-5827-2736
                Article
                1703
                10.1186/s12859-017-1703-z
                5460430
                47613b45-57c1-4a71-bf28-182cfac9b9d5
                © The Author(s) 2017

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 1 March 2017
                : 22 May 2017
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/501100001659, Deutsche Forschungsgemeinschaft;
                Award ID: HA 5261/3-1
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100001659, Deutsche Forschungsgemeinschaft;
                Award ID: HA 5261/3-1
                Award Recipient :
                Funded by: Forum for Interdisciplinary Research at Technische Universität Darmstadt
                Funded by: Forum for Interdisciplinary Research at Technische Universität Darmstadt
                Funded by: Forum for Interdisciplinary Research at Technische Universität Darmstadt
                Funded by: LOEWE Zentrum AdRIA (DE)
                Funded by: LOEWE Zentrum AdRIA (DE)
                Award ID: iNAPO
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100000781, European Research Council;
                Award ID: noMAGIC
                Award Recipient :
                Funded by: Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of Technische Universität Darmstadt
                Funded by: Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of Technische Universität Darmstadt
                Funded by: Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of Technische Universität Darmstadt
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2017

                Bioinformatics & Computational biology
                substitution matrix,pfasum,homologous sequence search,sequence alignment

                Comments

                Comment on this article