76
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.

          Author Summary

          Genetic markers can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as population genetics, and the search for susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Existing methods to identify structure informative markers demand prior knowledge of the membership of the studied individuals to predefined populations. In this paper, based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), we develop a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers. Our method is very fast even when applied to datasets of hundreds of individuals and millions of markers. We evaluate this method on a large dataset of 11 populations from around the world, as well as data from the HapMap project. We show that, in most cases, we can achieve 99% genotyping savings while at the same time recovering the structure of the studied populations. Finally, we show that our algorithm can also be successfully applied for the identification of structure informative markers when studying populations of complex ancestry.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: found
          • Article: not found

          Association mapping in structured populations.

          The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            The genetical structure of populations.

            S. Wright (1951)
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Singular value decomposition for genome-wide expression data processing and modeling.

              We describe the use of singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                pgen
                PLoS Genetics
                Public Library of Science (San Francisco, USA )
                1553-7390
                1553-7404
                September 2007
                21 September 2007
                : 3
                : 9
                : e160
                Affiliations
                [1 ] Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli, Greece
                [2 ] Division of General Internal Medicine, University of California San Francisco, San Francisco, California, United States of America
                [3 ] Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America
                [4 ] Comprehensive Cancer Center, University of California San Francisco, San Francisco, California, United States of America
                [5 ] Department of Biopharmaceutical Sciences, University of California San Francisco, San Francisco, California, United States of America
                [6 ] Department of Medicine, University of California San Francisco, San Francisco, California, United States of America
                [7 ] Lung Biology Center, Department of Medicine, University of California San Francisco, San Francisco, California, United States of America
                [8 ] Pulmonary/CCM Veterans Caribbean Healthcare System, University of Puerto Rico School of Medicine, San Juan, Puerto Rico, United States of America
                [9 ] Yahoo Research, Sunnyvale, California, United States of America
                [10 ] Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York, United States of America
                University of Alabama at Birmingham, United States of America
                Author notes
                * To whom correspondence should be addressed. E-mail: ppaschou@ 123456mbg.duth.gr
                Article
                07-PLGE-RA-0231R3 plge-03-09-12
                10.1371/journal.pgen.0030160
                1988848
                17892327
                774d8083-0ad3-4bc4-99c2-a898fc0389dc
                Copyright: © 2007 Paschou et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 4 April 2007
                : 1 August 2007
                Page count
                Pages: 15
                Categories
                Research Article
                Computer Science
                Genetics and Genomics
                Homo (Human)
                Custom metadata
                Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, et al. (2007) PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 3(9): e160. doi: 10.1371/journal.pgen.0030160

                Genetics
                Genetics

                Comments

                Comment on this article