73
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

      Preprint
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, and there is a need for guidance on current best practices. This protocol presents a "state-of-the-art" computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day using a standard desktop PC.

          Related collections

          Most cited references22

          • Record: found
          • Abstract: found
          • Article: not found

          Small-sample estimation of negative binomial dispersion, with applications to SAGE data.

          We derive a quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution and compare its performance, in terms of bias, to various other methods. Our estimation scheme outperforms all other methods in very small samples, typical of those from serial analysis of gene expression studies, the motivating data for this study. The impact of dispersion estimation on hypothesis testing is studied. We derive an "exact" test that outperforms the standard approximate asymptotic tests.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Moderated statistical tests for assessing differences in tag abundance.

            Digital gene expression (DGE) technologies measure gene expression by counting sequence tags. They are sensitive technologies for measuring gene expression on a genomic scale, without the need for prior knowledge of the genome sequence. As the cost of sequencing DNA decreases, the number of DGE datasets is expected to grow dramatically. Various tests of differential expression have been proposed for replicated DGE data using binomial, Poisson, negative binomial or pseudo-likelihood (PL) models for the counts, but none of the these are usable when the number of replicates is very small. We develop tests using the negative binomial distribution to model overdispersion relative to the Poisson, and use conditional weighted likelihood to moderate the level of overdispersion across genes. Not only is our strategy applicable even with the smallest number of libraries, but it also proves to be more powerful than previous strategies when more libraries are available. The methodology is equally applicable to other counting technologies, such as proteomic spectral counts. An R package can be accessed from http://bioinf.wehi.edu.au/resources/
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Independent filtering increases detection power for high-throughput experiments.

              With high-dimensional data, variable-by-variable statistical testing is often used to select variables whose behavior differs across conditions. Such an approach requires adjustment for multiple testing, which can result in low statistical power. A two-stage approach that first filters variables by a criterion independent of the test statistic, and then only tests variables which pass the filter, can provide higher power. We show that use of some filter/test statistics pairs presented in the literature may, however, lead to loss of type I error control. We describe other pairs which avoid this problem. In an application to microarray data, we found that gene-by-gene filtering by overall variance followed by a t-test increased the number of discoveries by 50%. We also show that this particular statistic pair induces a lower bound on fold-change among the set of discoveries. Independent filtering-using filter/test pairs that are independent under the null hypothesis but correlated under the alternative-is a general approach that can substantially increase the efficiency of experiments.
                Bookmark

                Author and article information

                Journal
                2013-02-15
                2013-06-20
                Article
                10.1038/nprot.2013.099
                23975260
                1302.3685
                527270da-cef3-4c3d-95c1-c9191f6c0e43

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                q-bio.GN

                Genetics
                Genetics

                Comments

                Comment on this article