14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Visualizing the structure of RNA-seq expression data using grade of membership models

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Grade of membership models, also known as “admixture models”, “topic models” or “Latent Dirichlet Allocation”, are a generalization of cluster models that allow each sample to have membership in multiple clusters. These models are widely used in population genetics to model admixed individuals who have ancestry from multiple “populations”, and in natural language processing to model documents having words from multiple “topics”. Here we illustrate the potential for these models to cluster samples of RNA-seq gene expression data, measured on either bulk samples or single cells. We also provide methods to help interpret the clusters, by identifying genes that are distinctively expressed in each cluster. By applying these methods to several example RNA-seq applications we demonstrate their utility in identifying and summarizing structure and heterogeneity. Applied to data from the GTEx project on 53 human tissues, the approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to single-cell expression data from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and highlights genes involved in a variety of relevant processes—from germ cell development, through compaction and morula formation, to the formation of inner cell mass and trophoblast at the blastocyst stage. The methods are implemented in the Bioconductor package CountClust.

          Author summary

          Gene expression profile of a biological sample (either from single cells or pooled cells) results from a complex interplay of multiple related biological processes. Consequently, for example, distal tissue samples may share a similar gene expression profile through some common underlying biological processes. Our goal here is to illustrate that grade of membership (GoM) models—an approach widely used in population genetics to cluster admixed individuals who have ancestry from multiple populations—provide an attractive approach for clustering biological samples of RNA sequencing data. The GoM model allows each biological sample to have partial memberships in multiple biologically-distinct clusters, in contrast to traditional clustering methods that partition samples into distinct subgroups. We also provide methods for identifying genes that are distinctively expressed in each cluster to help biologically interpret the results. Applied to a dataset of 53 human tissues, the GoM approach highlights similarities among biologically-related tissues and identifies distinctively-expressed genes that recapitulate known biology. Applied to gene expression data of single cells from mouse preimplantation embryos, the approach highlights both discrete and continuous variation through early embryonic development stages, and genes involved in a variety of relevant processes. Our study highlights the potential of GoM models for elucidating biological structure in RNA-seq gene expression data.

          Related collections

          Most cited references36

          • Record: found
          • Abstract: found
          • Article: not found

          A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

          We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Defining cell types and states with single-cell genomics

            A revolution in cellular measurement technology is under way: For the first time, we have the ability to monitor global gene regulation in thousands of individual cells in a single experiment. Such experiments will allow us to discover new cell types and states and trace their developmental origins. They overcome fundamental limitations inherent in measurements of bulk cell population that have frustrated efforts to resolve cellular states. Single-cell genomics and proteomics enable not only precise characterization of cell state, but also provide a stunningly high-resolution view of transitions between states. These measurements may finally make explicit the metaphor that C.H. Waddington posed nearly 60 years ago to explain cellular plasticity: Cells are residents of a vast “landscape” of possible states, over which they travel during development and in disease. Single-cell technology helps not only locate cells on this landscape, but illuminates the molecular mechanisms that shape the landscape itself. However, single-cell genomics is a field in its infancy, with many experimental and computational advances needed to fully realize its full potential.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              From RNA-seq reads to differential expression results

              Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                PLoS Genet
                plos
                plosgen
                PLoS Genetics
                Public Library of Science (San Francisco, CA USA )
                1553-7390
                1553-7404
                March 2017
                23 March 2017
                : 13
                : 3
                : e1006599
                Affiliations
                [1 ]Department of Statistics, University of Chicago, Chicago, Illinois, United States of America
                [2 ]Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
                Stanford University, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                • Conceptualization: MS KKD.

                • Data curation: MS.

                • Formal analysis: KKD CJH.

                • Funding acquisition: MS.

                • Investigation: MS.

                • Methodology: MS KKD.

                • Project administration: MS.

                • Resources: MS.

                • Software: KKD CJH MS.

                • Supervision: MS.

                • Validation: KKD CJH MS.

                • Visualization: KKD CJH MS.

                • Writing – original draft: KKD CJH MS.

                • Writing – review & editing: KKD CJH MS.

                Author information
                http://orcid.org/0000-0002-3520-2345
                http://orcid.org/0000-0001-8961-7522
                Article
                PGENETICS-D-16-01158
                10.1371/journal.pgen.1006599
                5363805
                28333934
                eda31989-9df4-4ed0-8906-d002c98907d6
                © 2017 Dey et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 30 May 2016
                : 24 January 2017
                Page count
                Figures: 4, Tables: 3, Pages: 22
                Funding
                The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to The Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F (HHSN261200800001E). The Brain Bank was supported by a supplements to University of Miami grants DA006227 & DA033684 and to contract N01MH000028. Statistical Methods development grants were made to the University of Geneva (MH090941 & MH101814), the University of Chicago (MH090951, MH090937, MH101820, MH101825), the University of North Carolina - Chapel Hill (MH090936 & MH101819), Harvard University (MH090948), Stanford University (MH101782), Washington University St Louis (MH101810), and the University of Pennsylvania (MH101822). The paper is supported by the grant U01CA198933 from the NIH BD2K program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Developmental Biology
                Embryology
                Blastocysts
                Biology and Life Sciences
                Developmental Biology
                Embryology
                Embryos
                Biology and Life Sciences
                Genetics
                Gene Expression
                Biology and life sciences
                Molecular biology
                Macromolecular structure analysis
                RNA structure
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                RNA structure
                Biology and Life Sciences
                Developmental Biology
                Cell Differentiation
                Neuronal Differentiation
                Biology and life sciences
                Molecular biology
                Molecular biology techniques
                Sequencing techniques
                RNA sequencing
                Research and analysis methods
                Molecular biology techniques
                Sequencing techniques
                RNA sequencing
                Biology and Life Sciences
                Evolutionary Biology
                Population Genetics
                Biology and Life Sciences
                Genetics
                Population Genetics
                Biology and Life Sciences
                Population Biology
                Population Genetics
                Biology and Life Sciences
                Cell Biology
                Cellular Types
                Animal Cells
                Neurons
                Biology and Life Sciences
                Neuroscience
                Cellular Neuroscience
                Neurons
                Custom metadata
                Links to the data used in the analysis, along with references, are provided in the main text of the manuscript.

                Genetics
                Genetics

                Comments

                Comment on this article