14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

      brief-report
      1 , 1 , 2 , 3 , 4 , 1 ,
      Genome Biology
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing.

          Results

          We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth.

          Conclusions

          We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.

          Supplementary Information

          The online version contains supplementary material available at (10.1186/s13059-021-02451-7).

          Related collections

          Most cited references45

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

          In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0550-8) contains supplementary material, which is available to authorized users.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Comprehensive Integration of Single-Cell Data

            Single-cell transcriptomics has transformed our ability to characterize cell states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets to better understand cellular identity and function. Here, we develop a strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities. After demonstrating improvement over existing methods for integrating scRNA-seq data, we anchor scRNA-seq experiments with scATAC-seq to explore chromatin differences in closely related interneuron subsets and project protein expression measurements onto a bone marrow atlas to characterize lymphocyte populations. Lastly, we harmonize in situ gene expression and scRNA-seq datasets, allowing transcriptome-wide imputation of spatial gene expression patterns. Our work presents a strategy for the assembly of harmonized references and transfer of information across datasets.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Differential expression analysis for sequence count data

              High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
                Bookmark

                Author and article information

                Contributors
                jan.lause@uni-tuebingen.de
                dmitry.kobak@uni-tuebingen.de
                Journal
                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central (London )
                1474-7596
                1474-760X
                6 September 2021
                6 September 2021
                2021
                : 22
                : 258
                Affiliations
                [1 ]GRID grid.411544.1, ISNI 0000 0001 0196 8249, University of Tübingen, , Institute for Ophthalmic Research, ; Tübingen, Germany
                [2 ]GRID grid.10392.39, ISNI 0000 0001 2190 1447, University of Tübingen, , Institute for Bioinformatics and Medical Informatics, ; Tübingen, Germany
                [3 ]GRID grid.455094.9, University of Tübingen, , Bernstein Center for Computational Neuroscience, ; Tübingen, Germany
                [4 ]GRID grid.10392.39, ISNI 0000 0001 2190 1447, University of Tübingen, , Center for Integrative Neuroscience, ; Tübingen, Germany
                Author information
                https://orcid.org/0000-0002-5639-7209
                Article
                2451
                10.1186/s13059-021-02451-7
                8419999
                34488842
                0eec922d-1f8e-441a-bffb-6914d8615768
                © The Author(s) 2021

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 1 December 2020
                : 2 August 2021
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/501100001659, Deutsche Forschungsgemeinschaft;
                Award ID: BE5601/4-1
                Funded by: FundRef http://dx.doi.org/10.13039/501100001659, Deutsche Forschungsgemeinschaft;
                Award ID: BE5601/6-1
                Funded by: FundRef http://dx.doi.org/10.13039/501100001659, Deutsche Forschungsgemeinschaft;
                Award ID: EXC 2064 ML 390727645
                Funded by: FundRef http://dx.doi.org/10.13039/501100002347, Bundesministerium f?r Bildung und Forschung;
                Award ID: 01GQ1601
                Funded by: FundRef http://dx.doi.org/10.13039/501100002347, Bundesministerium f?r Bildung und Forschung;
                Award ID: 01IS18039A
                Funded by: FundRef http://dx.doi.org/10.13039/100000025, National Institute of Mental Health;
                Award ID: 1U19MH114830-01
                Categories
                Short Report
                Custom metadata
                © The Author(s) 2021

                Genetics
                Genetics

                Comments

                Comment on this article