37
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Screening large-scale association study data: exploiting interactions using random forests

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.

          Results

          Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.

          Conclusions

          In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.

          Related collections

          Most cited references27

          • Record: found
          • Abstract: not found
          • Book: not found

          Categorical Data Analysis

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Boosting the margin: a new explanation for the effectiveness of voting methods

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation.

              Recent advances in genome research have accelerated the process of locating candidate genes and the variable sites within them and have simplified the task of genotype measurement. The development of statistical and computational strategies to utilize information on hundreds -- soon thousands -- of variable loci to investigate the relationships between genome variation and phenotypic variation has not kept pace, particularly for quantitative traits that do not follow simple Mendelian patterns of inheritance. We present here the combinatorial partitioning method (CPM) that examines multiple genes, each containing multiple variable loci, to identify partitions of multilocus genotypes that predict interindividual variation in quantitative trait levels. We illustrate this method with an application to plasma triglyceride levels collected on 188 males, ages 20--60 yr, ascertained without regard to health status, from Rochester, Minnesota. Genotype information included measurements at 18 diallelic loci in six coronary heart disease--candidate susceptibility gene regions: APOA1--C3--A4, APOB, APOE, LDLR, LPL, and PON1. To illustrate the CPM, we evaluated all possible partitions of two-locus genotypes into two to nine partitions (approximately 10(6) evaluations). We found that many combinations of loci are involved in sets of genotypic partitions that predict triglyceride variability and that the most predictive sets show nonadditivity. These results suggest that traditional methods of building multilocus models that rely on statistically significant marginal, single-locus effects, may fail to identify combinations of loci that best predict trait variability. The CPM offers a strategy for exploring the high-dimensional genotype state space so as to predict the quantitative trait variation in the population at large that does not require the conditioning of the analysis on a prespecified genetic model.
                Bookmark

                Author and article information

                Journal
                BMC Genet
                BMC Genetics
                BioMed Central (London )
                1471-2156
                2004
                10 December 2004
                : 5
                : 32
                Affiliations
                [1 ]Oscient Pharmaceuticals, Inc. (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA
                [2 ]Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
                [3 ]Genizon BioSciences Inc., Montreal, Quebec, Canada
                [4 ]Department of Psychiatry, Harvard Medical School, Boston, Massachusetts, USA
                Article
                1471-2156-5-32
                10.1186/1471-2156-5-32
                545646
                15588316
                720e76e7-fd6f-4686-890d-fd9015dd8365
                Copyright © 2004 Lunetta et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 25 June 2004
                : 10 December 2004
                Categories
                Research Article

                Genetics
                Genetics

                Comments

                Comment on this article