33
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A comparison of internal validation techniques for multifactor dimensionality reduction

      research-article
      1 , 2 , 3 , 1 , 2 ,
      BMC Bioinformatics
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          It is hypothesized that common, complex diseases may be due to complex interactions between genetic and environmental factors, which are difficult to detect in high-dimensional data using traditional statistical approaches. Multifactor Dimensionality Reduction (MDR) is the most commonly used data-mining method to detect epistatic interactions. In all data-mining methods, it is important to consider internal validation procedures to obtain prediction estimates to prevent model over-fitting and reduce potential false positive findings. Currently, MDR utilizes cross-validation for internal validation. In this study, we incorporate the use of a three-way split (3WS) of the data in combination with a post-hoc pruning procedure as an alternative to cross-validation for internal model validation to reduce computation time without impairing performance. We compare the power to detect true disease causing loci using MDR with both 5- and 10-fold cross-validation to MDR with 3WS for a range of single-locus and epistatic disease models. Additionally, we analyze a dataset in HIV immunogenetics to demonstrate the results of the two strategies on real data.

          Results

          MDR with 3WS is computationally approximately five times faster than 5-fold cross-validation. The power to find the exact true disease loci without detecting false positive loci is higher with 5-fold cross-validation than with 3WS before pruning. However, the power to find the true disease causing loci in addition to false positive loci is equivalent to the 3WS. With the incorporation of a pruning procedure after the 3WS, the power of the 3WS approach to detect only the exact disease loci is equivalent to that of MDR with cross-validation. In the real data application, the cross-validation and 3WS analyses indicate the same two-locus model.

          Conclusions

          Our results reveal that the performance of the two internal validation methods is equivalent with the use of pruning procedures. The specific pruning procedure should be chosen understanding the trade-off between identifying all relevant genetic effects but including false positives and missing important genetic factors. This implies 3WS may be a powerful and computationally efficient approach to screen for epistatic effects, and could be used to identify candidate interactions in large-scale genetic studies.

          Related collections

          Most cited references20

          • Record: found
          • Abstract: found
          • Article: not found

          A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.

          Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1 : 1, 1 : 2, 1 : 4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets. (c) 2007 Wiley-Liss, Inc.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A perspective on epistasis: limits of models displaying no main effect.

            The completion of a draft sequence of the human genome and the promise of rapid single-nucleotide-polymorphism-genotyping technologies have resulted in a call for the abandonment of linkage studies in favor of genome scans for association. However, there exists a large class of genetic models for which this approach will fail: purely epistatic models with no additive or dominance variation at any of the susceptibility loci. As a result, traditional association methods (such as case/control, measured genotype, and transmission/disequilibrium test [TDT]) will have no power if the loci are examined individually. In this article, we examine this class of models, delimiting the range of genetic determination and recurrence risks for two-, three-, and four-locus purely epistatic models. Our study reveals that these models, although giving rise to no additive or dominance variation, do give rise to increased allele sharing between affected sibs. Thus, a genome scan for linkage could detect genomic subregions harboring susceptibility loci. We also discuss some simple multilocus extensions of single-locus analysis methods, including a conditional form of the TDT.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              A global view of epistasis.

              John Moore (2004)
                Bookmark

                Author and article information

                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2010
                22 July 2010
                : 11
                : 394
                Affiliations
                [1 ]Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
                [2 ]Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA
                [3 ]Department of Genetics, North Carolina State University, Raleigh, NC 27695, USA
                Article
                1471-2105-11-394
                10.1186/1471-2105-11-394
                2920275
                20650002
                f12acf1e-77da-4cc7-ae43-ec96040a0409
                Copyright ©2010 Winham et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 14 August 2009
                : 22 July 2010
                Categories
                Research Article

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article