1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Short H2A histone variants are expressed in cancer

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Short H2A (sH2A) histone variants are primarily expressed in the testes of placental mammals. Their incorporation into chromatin is associated with nucleosome destabilization and modulation of alternate splicing. Here, we show that sH2As innately possess features similar to recurrent oncohistone mutations associated with nucleosome instability. Through analyses of existing cancer genomics datasets, we find aberrant sH2A upregulation in a broad array of cancers, which manifest splicing patterns consistent with global nucleosome destabilization. We posit that short H2As are a class of “ready-made” oncohistones, whose inappropriate expression contributes to chromatin dysfunction in cancer.

          Abstract

          Short H2A variants are testis-specific histones that destabilize nucleosomes during spermatogenesis. In this study, the authors show that these variants are expressed in an array of different cancers and identify splicing changes associated with nucleosome instability in these malignancies.

          Related collections

          Most cited references56

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

          Background RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. Results We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. Conclusions RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

            Rationale Improvements in the efficiency of DNA sequencing have both broadened the applications for sequencing and dramatically increased the size of sequencing datasets. Technologies from Illumina (San Diego, CA, USA) and Applied Biosystems (Foster City, CA, USA) have been used to profile methylation patterns (MeDIP-Seq) [1], to map DNA-protein interactions (ChIP-Seq) [2], and to identify differentially expressed genes (RNA-Seq) [3] in the human genome and other species. The Illumina instrument was recently used to re-sequence three human genomes, one from a cancer patient and two from previously unsequenced ethnic groups [4-6]. Each of these studies required the alignment of large numbers of short DNA sequences ('short reads') onto the human genome. For example, two of the studies [4,5] used the short read alignment tool Maq [7] to align more than 130 billion bases (about 45× coverage) of short Illumina reads to a human reference genome in order to detect genetic variations. The third human re-sequencing study [6] used the SOAP program [8] to align more than 100 billion bases to the reference genome. In addition to these projects, the 1,000 Genomes project is in the process of using high-throughput sequencing instruments to sequence a total of about six trillion base pairs of human DNA [9]. With existing methods, the computational cost of aligning many short reads to a mammalian genome is very large. For example, extrapolating from the results presented here in Tables 1 and 2, one can see that Maq would require more than 5 central processing unit (CPU)-months and SOAP more than 3 CPU-years to align the 140 billion bases from the study by Ley and coworkers [5]. Although using Maq or SOAP for this purpose has been shown to be feasible by using multiple CPUs, there is a clear need for new tools that consume less time and computational resources. Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4 SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351× 67.3 Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9 Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8× 74.7 Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9 Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107× 74.7 The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. Table 2 Bowtie alignment performance versus Maq with filtered read set Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed up Reads aligned (%) Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9 Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4× 78.0 Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9 Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2× 78.0 Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Maq and SOAP take the same basic algorithmic approach as other recent read mapping tools such as RMAP [10], ZOOM [11], and SHRiMP [12]. Each tool builds a hash table of short oligomers present in either the reads (SHRiMP, Maq, RMAP, and ZOOM) or the reference (SOAP). Some employ recent theoretical advances to align reads quickly without sacrificing sensitivity. For example, ZOOM uses 'spaced seeds' to significantly outperform RMAP, which is based on a simpler algorithm developed by Baeza-Yaetes and Perleberg [13]. Spaced seeds have been shown to yield higher sensitivity than contiguous seeds of the same length [14,15]. SHRiMP employs a combination of spaced seeds and the Smith-Waterman [16] algorithm to align reads with high sensitivity at the expense of speed. Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm to align reads. Bowtie uses a different and novel indexing strategy to create an ultrafast, memory-efficient short read aligner geared toward mammalian re-sequencing. In our experiments using reads from the 1,000 Genomes project, Bowtie aligns 35-base pair (bp) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions (see Tables 1 and 2). Bowtie employs a Burrows-Wheeler index based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM. The index is small enough to be distributed over the internet and to be stored on disk and re-used. Multiple processor cores can be used simultaneously to achieve even greater alignment speed. We have used Bowtie to align 14.3× coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores. Bowtie makes a number of compromises to achieve this speed, but these trade-offs are reasonable within the context of mammalian re-sequencing projects. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance. For instance, the '--best' option will guarantee that all alignments reported are best in terms of minimizing mismatches in the seed portion of the read, although this option incurs additional computational cost. With its default options, Bowtie's sensitivity measured in terms of reads aligned is equal to SOAP's and somewhat less than Maq's. Command line options allow the user to increase sensitivity at the cost of greater running time, and to enable Bowtie to report multiple hits for a read. Bowtie can align reads as short as four bases and as long as 1,024 bases. The input to a single run of Bowtie may comprise a mixture of reads with different lengths. Bowtie description and results Bowtie indexes the reference genome using a scheme based on the Burrows-Wheeler transform (BWT) [17] and the FM index [18,19]. A Bowtie index for the human genome fits in 2.2 GB on disk and has a memory footprint of as little as 1.3 GB at alignment time, allowing it to be queried on a workstation with under 2 GB of RAM. The common method for searching in an FM index is the exact-matching algorithm of Ferragina and Manzini [18]. Bowtie does not simply adopt this algorithm because exact matching does not allow for sequencing errors or genetic variations. We introduce two novel extensions that make the technique applicable to short read alignment: a quality-aware backtracking algorithm that allows mismatches and favors high-quality alignments; and 'double indexing', a strategy to avoid excessive backtracking. The Bowtie aligner follows a policy similar to Maq's, in that it allows a small number of mismatches within the high-quality end of each read, and it places an upper limit on the sum of the quality values at mismatched alignment positions. Burrows-Wheeler indexing The BWT is a reversible permutation of the characters in a text. Although originally developed within the context of data compression, BWT-based indexing allows large texts to be searched efficiently in a small memory footprint. It has been applied to bioinformatics applications, including oligomer counting [20], whole-genome alignment [21], tiling microarray probe design [22], and Smith-Waterman alignment to a human-sized reference [23]. The Burrows-Wheeler transformation of a text T, BWT(T), is constructed as follows. The character $ is appended to T, where $ is not in T and is lexicographically less than all characters in T. The Burrows-Wheeler matrix of T is constructed as the matrix whose rows comprise all cyclic rotations of T$. The rows are then sorted lexicographically. BWT(T) is the sequence of characters in the rightmost column of the Burrows-Wheeler matrix (Figure 1a). BWT(T) has the same length as the original text T. Figure 1 Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column). This matrix has a property called 'last first (LF) mapping'. The ith occurrence of character X in the last column corresponds to the same text character as the ith occurrence of X in the first column. This property underlies algorithms that use the BWT index to navigate or search the text. Figure 1b illustrates UNPERMUTE, an algorithm that applies the LF mapping repeatedly to re-create T from BWT(T). The LF mapping is also used in exact matching. Because the matrix is sorted lexicographically, rows beginning with a given sequence appear consecutively. In a series of steps, the EXACTMATCH algorithm (Figure 1c) calculates the range of matrix rows beginning with successively longer suffixes of the query. At each step, the size of the range either shrinks or remains the same. When the algorithm completes, rows beginning with S0 (the entire query) correspond to exact occurrences of the query in the text. If the range is empty, the text does not contain the query. UNPERMUTE is attributable to Burrows and Wheeler [17] and EXACTMATCH to Ferragina and Manzini [18]. See Additional data file 1 (Supplementary Discussion 1) for details. Searching for inexact alignments EXACTMATCH is insufficient for short read alignment because alignments may contain mismatches, which may be due to sequencing errors, genuine differences between reference and query organisms, or both. We introduce an alignment algorithm that conducts a backtracking search to quickly find alignments that satisfy a specified alignment policy. Each character in a read has a numeric quality value, with lower values indicating a higher likelihood of a sequencing error. Our alignment policy allows a limited number of mismatches and prefers alignments where the sum of the quality values at all mismatched positions is low. The search proceeds similarly to EXACTMATCH, calculating matrix ranges for successively longer query suffixes. If the range becomes empty (a suffix does not occur in the text), then the algorithm may select an already-matched query position and substitute a different base there, introducing a mismatch into the alignment. The EXACTMATCH search resumes from just after the substituted position. The algorithm selects only those substitutions that are consistent with the alignment policy and which yield a modified suffix that occurs at least once in the text. If there are multiple candidate substitution positions, then the algorithm greedily selects a position with a minimal quality value. Backtracking scenarios play out within the context of a stack structure that grows when a new substitution is introduced and shrinks when the aligner rejects all candidate alignments for the substitutions currently on the stack. See Figure 2 for an illustration of how the search might proceed. Figure 2 Exact matching versus inexact alignment. Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'. Boxed pairs of numbers denote ranges of matrix rows beginning with the suffix observed up to that point. A red X marks where the algorithm encounters an empty range and either aborts (as in EXACTMATCH) or backtracks (as in the inexact algorithm). A green check marks where the algorithm finds a nonempty range delimiting one or more occurrences of a reportable alignment for the query. In short, Bowtie conducts a quality-aware, greedy, randomized, depth-first search through the space of possible alignments. If a valid alignment exists, then Bowtie will find it (subject to the backtrack ceiling discussed in the following section). Because the search is greedy, the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of number of mismatches or in terms of quality. The user may instruct Bowtie to continue searching until it can prove that any alignment it reports is 'best' in terms of number of mismatches (using the option --best). In our experience, this mode is two to three times slower than the default mode. We expect that the faster default mode will be preferred for large re-sequencing projects. The user may also opt for Bowtie to report all alignments up to a specified number (option -k) or all alignments with no limit on the number (option -a) for a given read. If in the course of its search Bowtie finds N possible alignments for a given set of substitutions, but the user has requested only K alignments where K < N, Bowtie will report K of the N alignments selected at random. Note that these modes can be much slower than the default. In our experience, for example, -k 1 is more than twice as fast as -k 2. Excessive backtracking The aligner as described so far can, in some cases, encounter sequences that cause excessive backtracking. This occurs when the aligner spends most of its effort fruitlessly backtracking to positions close to the 3' end of the query. Bowtie mitigates excessive backtracking with the novel technique of 'double indexing'. Two indices of the genome are created: one containing the BWT of the genome, called the 'forward' index, and a second containing the BWT of the genome with its character sequence reversed (not reverse complemented) called the 'mirror' index. To see how this helps, consider a matching policy that allows one mismatch in the alignment. A valid alignment with one mismatch falls into one of two cases according to which half of the read contains the mismatch. Bowtie proceeds in two phases corresponding to those two cases. Phase 1 loads the forward index into memory and invokes the aligner with the constraint that it may not substitute at positions in the query's right half. Phase 2 uses the mirror index and invokes the aligner on the reversed query, with the constraint that the aligner may not substitute at positions in the reversed query's right half (the original query's left half). The constraints on backtracking into the right half prevent excessive backtracking, whereas the use of two phases and two indices maintains full sensitivity. Unfortunately, it is not possible to avoid excessive backtracking fully when alignments are permitted to have two or more mismatches. In our experiments, we have observed that excessive backtracking is significant only when a read has many low-quality positions and does not align or aligns poorly to the reference. These cases can trigger in excess of 200 backtracks per read because there are many legal combinations of low-quality positions to be explored before all possibilities are exhausted. We mitigate this cost by enforcing a limit on the number of backtracks allowed before a search is terminated (default: 125). The limit prevents some legitimate, low-quality alignments from being reported, but we expect that this is a desirable trade-off for most applications. Phased Maq-like search Bowtie allows the user to select the number of mismatches permitted (default: two) in the high-quality end of a read (default: the first 28 bases) as well as the maximum acceptable quality distance of the overall alignment (default: 70). Quality values are assumed to follow the definition in PHRED [24], where p is the probability of error and Q = -10log p. Both the read and its reverse complement are candidates for alignment to the reference. For clarity, this discussion considers only the forward orientation. See Additional data file 1 (Supplementary Discussion 2) for an explanation of how the reverse complement is incorporated. The first 28 bases on the high-quality end of the read are termed the 'seed'. The seed consists of two halves: the 14 bp on the high-quality end (usually the 5' end) and the 14 bp on the low-quality end, termed the 'hi-half' and the 'lo-half', respectively. Assuming the default policy (two mismatches permitted in the seed), a reportable alignment will fall into one of four cases: no mismatches in seed (case 1); no mismatches in hi-half, one or two mismatches in lo-half (case 2); no mismatches in lo-half, one or two mismatches in hi-half (case 3); and one mismatch in hi-half, one mismatch in lo-half (case 4). All cases allow any number of mismatches in the nonseed part of the read and all cases are also subject to the quality distance constraint. The Bowtie algorithm consists of three phases that alternate between using the forward and mirror indices, as illustrated in Figure 3. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to find alignments for case 3: Phase 2 finds partial alignments with mismatches only in the hi-half and phase 3 attempts to extend those partial alignments into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. Figure 3 The three phases of the Bowtie algorithm for the Maq-like policy. A three-phase approach finds alignments for two-mismatch cases 1 to 4 while minimizing backtracking. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to find alignments for case 3: Phase 2 finds partial alignments with mismatches only in the hi-half, and phase 3 attempts to extend those partial alignments into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. Performance results We evaluated the performance of Bowtie using reads from the 1,000 Genomes project pilot (National Center for Biotechnology Information [NCBI] Short Read Archive:SRR001115). A total of 8.84 million reads, about one lane of data from an Illumina instrument, were trimmed to 35 bp and aligned to the human reference genome [NCBI build 36.3]. Unless specified otherwise, read data are not filtered or modified (besides trimming) from how they appear in the archive. This leads to about 70% to 75% of reads aligning somewhere to the genome. In our experience, this is typical for raw data from the archive. More aggressive filtering leads to higher alignment rates and faster alignment. All runs were performed on a single CPU. Bowtie speedups were calculated as a ratio of wall-clock alignment times. Both wall-clock and CPU times are given to demonstrate that input/output load and CPU contention are not significant factors. The time required to build the Bowtie index was not included in the Bowtie running times. Unlike competing tools, Bowtie can reuse a pre-computed index for the reference genome across many alignment runs. We anticipate most users will simply download such indices from a public repository. The Bowtie site [25] provides indices for current builds of the human, chimp, mouse, dog, rat, and Arabidopsis thaliana genomes, as well as many others. Results were obtained on two hardware platforms: a desktop workstation with 2.4 GHz Intel Core 2 processor and 2 GB of RAM; and a large-memory server with a four-core 2.4 GHz AMD Opteron processor and 32 GB of RAM. These are denoted 'PC' and 'server', respectively. Both PC and server run Red Hat Enterprise Linux AS release 4. Comparison to SOAP and Maq Maq is a popular aligner [1,4,5,26,27] that is among the fastest competing open source tools for aligning millions of Illumina reads to the human genome. SOAP is another open source tool that has been reported and used in short-read projects [6,28]. Table 1 presents the performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. The 'soap.contig' version of the SOAP binary was used. For comparison with SOAP, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values), and with '--maxns 5' to simulate SOAP's default policy of filtering out reads with five or more no-confidence bases. For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches in the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched read positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure that only the forward or mirror index is resident in memory at one time. The number of reads aligned indicates that SOAP (67.3%) and Bowtie -v 2 (67.4%) have comparable sensitivity. Of the reads aligned by either SOAP or Bowtie, 99.7% were aligned by both, 0.2% were aligned by Bowtie but not SOAP, and 0.1% were aligned by SOAP but not Bowtie. Maq (74.7%) and Bowtie (71.9%) also have roughly comparable sensitivity, although Bowtie lags by 2.8%. Of the reads aligned by either Maq or Bowtie, 96.0% were aligned by both, 0.1% were aligned by Bowtie but not Maq, and 3.9% were aligned by Maq but not Bowtie. Of the reads mapped by Maq but not Bowtie, almost all are due to a flexibility in Maq's alignment algorithm that allows some alignments to have three mismatches in the seed. The remainder of the reads mapped by Maq but not Bowtie are due to Bowtie's backtracking ceiling. Maq's documentation mentions that reads containing 'poly-A artifacts' can impair Maq's performance. Table 2 presents performance and sensitivity of Bowtie and Maq when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1, and the same observations about the relative sensitivity of Bowtie and Maq apply here. Read length and performance As sequencing technology improves, read lengths are growing beyond the 30-bp to 50-bp commonly seen in public databases today. Bowtie, Maq, and SOAP support reads of lengths up to 1,024, 63, and 60 bp, respectively, and Maq versions 0.7.0 and later support read lengths up to 127 bp. Table 3 shows performance results when the three tools are each used to align three sets of 2 M untrimmed reads, a 36-bp set, a 50-bp set and a 76-bp set, to the human genome on the server platform. Each set of 2 M is randomly sampled from a larger set (NCBI Short Read Archive: SRR003084 for 36-bp, SRR003092 for 50-bp, SRR003196 for 76-bp). Reads were sampled such that the three sets of 2 M have uniform per-base error rate, as calculated from per-base Phred qualities. All reads pass through Maq's 'catfilter'. Table 3 Varying read length using Bowtie, Maq and SOAP Length Program CPU time Wall clock time Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 36 bp Bowtie 6 m 15 s 6 m 21 s 1,305 - 62.2 Maq 3 h 52 m 26 s 3 h 52 m 54 s 804 36.7× 65.0 Bowtie -v 2 4 m 55 s 5 m 00 s 1,138 - 55.0 SOAP 16 h 44 m 3 s 18 h 1 m 38 s 13,619 216× 55.1 50 bp Bowtie 7 m 11 s 7 m 20 s 1,310 - 67.5 Maq 2 h 39 m 56 s 2 h 40 m 9 s 804 21.8× 67.9 Bowtie -v 2 5 m 32 s 5 m 46 s 1,138 - 56.2 SOAP 48 h 42 m 4 s 66 h 26 m 53 s 13,619 691× 56.2 76 bp Bowtie 18 m 58 s 19 m 6 s 1,323 - 44.5 Maq 0.7.1 4 h 45 m 7 s 4 h 45 m 17 s 1,155 14.9× 44.9 Bowtie -v 2 7 m 35 s 7 m 40 s 1,138 - 31.7 The performance of Bowtie v0.9.6, SOAP v1.10, and Maq versions v0.6.6 and v0.7.1 on the server platform when aligning 2 M untrimmed reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR003084 for 36 base pairs [bp], SRR003092 for 50 bp, and SRR003196 for 76 bp). For each read length, the 2 M reads were randomly sampled from the FASTQ file downloaded from the Archive such that the average per-base error rate as measured by quality values was uniform across the three sets. All reads pass through Maq's "catfilter". Maq v0.7.1 was used for the 76-bp reads because v0.6.6 does not support reads longer than 63 bp. SOAP is excluded from the 76-bp experiment because it does not support reads longer than 60 bp. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Bowtie is run both in its Maq-like default mode and in its SOAP-like '-v 2' mode. Bowtie is also given the '-z' option to ensure that only the forward or mirror index is resident in memory at one time. Maq v0.7.1 was used instead of Maq v0.6.6 for the 76-bp set because v0.6.6 cannot align reads longer than 63 bp. SOAP was not run on the 76-bp set because it does not support reads longer than 60 bp. The results show that Maq's algorithm scales better overall to longer read lengths than Bowtie or SOAP. However, Bowtie in SOAP-like '-v 2' mode also scales very well. Bowtie in its default Maq-like mode scales well from 36-bp to 50-bp reads but is substantially slower for 76-bp reads, although it is still more than an order of magnitude faster than Maq. Parallel performance Alignment can be parallelized by distributing reads across concurrent search threads. Bowtie allows the user to specify a desired number of threads (option -p); Bowtie then launches the specified number of threads using the pthreads library. Bowtie threads synchronize with each other when fetching reads, outputting results, switching between indices, and performing various forms of global bookkeeping, such as marking a read as 'done'. Otherwise, threads are free to operate in parallel, substantially speeding up alignment on computers with multiple processor cores. The memory image of the index is shared by all threads, and so the footprint does not increase substantially when multiple threads are used. Table 4 shows performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Table 4 Bowtie parallel alignment performance CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Speedup Bowtie, one thread 18 m 19 s 18 m 46 s 28.3 1,353 - Bowtie, two threads 20 m 34 s 10 m 35 s 50.1 1,363 1.77× Bowtie, four threads 23 m 9 s 6 m 1 s 88.1 1,384 3.12× Performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Index building Bowtie uses a flexible indexing algorithm [29] that can be configured to trade off between memory usage and running time. Table 5 illustrates this trade-off when indexing the entire human reference genome (NCBI build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. Table 5 Bowtie index building performance Physical memory target (GB) Actual peak memory footprint (GB) Wall clock time 16 14.4 4 h 36 m 8 5.84 5 h 5 m 4 3.39 7 h 40 m 2 1.39 21 h 30 m Performance results and memory footprints of running the Bowtie v0.9.6 indexer on the whole human genome (National Center for Biotechnology Information build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. See Additional data file 1 (Supplementary Discussion 3 and Supplementary Table 1) for details. The reported times compare favorably with alignment times of competing tools that perform indexing during alignment. Less than 5 hours is required for Bowtie to both build and query a whole-human index with 8.84 million reads from the 1,000 Genome project (NCBI Short Read Archive:SRR001115) on a server, more than sixfold faster than the equivalent Maq run. The bottom-most row illustrates that the Bowtie indexer, with appropriate arguments, is memory-efficient enough to run on a typical workstation with 2 GB of RAM. Additional data file 1 (Supplementary discussions 3 and 4) explains the algorithm and the contents of the resulting index. Software Bowtie is written in C++ and uses the SeqAn library [30]. The converter to the Maq mapping format uses code from Maq. Discussion Bowtie exhibits a large performance advantage over both Maq and SOAP when mapping reads to the human genome. Bowtie's sensitivity in terms of reads aligned is comparable to that of SOAP and slightly less than Maq's, although the user may use command-line options to trade slower running time for greater sensitivity. Unlike SOAP, Bowtie's 1.3 GB memory footprint allows it to run on a typical PC with 2 GB of RAM. Bowtie aligns Illumina reads to the human genome at a rate of over 25 million reads per hour. Multiple processor cores can run parallel Bowtie threads to achieve even greater alignment speed; experiments show a speed up of 3.12 for four threads on a typical Opteron server. Unlike many other short-read aligners, Bowtie creates a permanent index of the reference that may be re-used across alignment runs. Building the index is fast - Bowtie outperforms competing tools when aligning lanes of Illumina reads even with index construction time included. At 2.2 GB for the human genome, the on-disk size of a Bowtie index is small enough to distribute over the internet. The Bowtie website hosts pre-built indices for the human genome and several other model organisms including chimp, dog, rat, mouse, and chicken. Bowtie's speed and small memory footprint are due chiefly to its use of the Burrows-Wheeler index in combination with the novel, quality-aware, backtracking algorithm introduced here. Double indexing is used to avoid the performance penalty of excessive backtracking. Bowtie supports standard FASTQ and FASTA input formats, and comes with a conversion program that allows Bowtie output to be used with Maq's consensus generator and single nucleotide polymorphism caller. Bowtie does not yet support paired-end alignment or alignments with insertions or deletions, although both improvements are planned for the future. Paired-end alignment is not difficult to implement in Bowtie's framework, and we expect that Bowtie's performance advantage will be comparable to, though perhaps somewhat less than, that of unpaired alignment mode. Support for insertions and deletions is also a conceptually straightforward addition. Bowtie is free, open source software available from the Bowtie website [25]. Abbreviations bp: base pair; BWT: Burrows-Wheeler transform; CPU: central processing unit; FM: full-text minute-space; GB: gigabytes; LF: last first; NCBI: National Center for Biotechnology Information. Authors' contributions BL developed the algorithms, collected results, and wrote most of the software. CT wrote some of the software. CT and MP contributed to discussions on algorithms. BL, CT, MP, and SLS wrote the manuscript. Additional data files The following additional data are included with the online version of this article: a document containing supplementary discussions, tables, and figures pertaining to algorithms for navigating the Burrows-Wheeler transform, the full four-phase version of the alignment algorithm that incorporates the reverse-complement, index construction, and components of the index (Additional data file 1). Supplementary Material Additional data file 1 Presented are supplementary discussions, tables, and figures pertaining to algorithms for navigating the Burrows-Wheeler transform, the full four-phase version of the alignment algorithm that incorporates the reverse-complement, index construction, and components of the index. Click here for file
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A scaling normalization method for differential expression analysis of RNA-seq data

              Background The transcriptional architecture is a complex and dynamic aspect of a cell's function. Next generation sequencing of steady state RNA (RNA-seq) gives unprecedented detail about the RNA landscape within a cell. Not only can expression levels of genes be interrogated without specific prior knowledge, but comparisons of expression levels between genes within a sample can be made. It has also been demonstrated that splicing variants [1,2] and single nucleotide polymorphisms [3] can be detected through sequencing the transcriptome, opening up the opportunity to interrogate allele-specific expression and RNA editing. An important aspect of dealing with the vast amounts of data generated from short read sequencing is the processing methods used to extract and interpret the information. Experience with microarray data has repeatedly shown that normalization is a critical component of the processing pipeline, allowing accurate estimation and detection of differential expression (DE) [4]. The aim of normalization is to remove systematic technical effects that occur in the data to ensure that technical bias has minimal impact on the results. However, the procedure for generating RNA-seq data is fundamentally different from that for microarray data, so the normalization methods used are not directly applicable. It has been suggested that 'One particularly powerful advantage of RNA-seq is that it can capture transcriptome dynamics across different tissues or conditions without sophisticated normalization of data sets' [5]. We demonstrate here that the reality of RNA-seq data analysis is not this simple; normalization is often still an important consideration. Current RNA-seq analysis methods typically standardize data between samples by scaling the number of reads in a given lane or library to a common value across all sequenced libraries in the experiment. For example, several authors have modeled the observed counts for a gene with a mean that includes a factor for the total number of reads [6-8]. These approaches can differ in the distributional assumptions made for inferring differences, but the consensus is to use the total number of reads in the model. Similarly, for LONG-SAGE-seq data, 't Hoen et al. [9] use the square root of scaled counts or the beta-binomial model of Vencio et al. [10], both of which use the total number of observed tags. For normalization, Mortazavi et al. [11] adjust their counts to reads per kilobase per million mapped (RPKM), suggesting it 'facilitates transparent comparison of transcript levels both within and between samples.' By contrast, Cloonan et al. [12] log-transform the gene length-normalized count data and apply standard microarray analysis techniques (quantile normalization and moderated t-statistics). Sultan et al. [2] normalize read counts by the 'virtual length' of the gene, the number of unique 27-mers in exonic sequence, as well as by the total number of reads. Recently, Balwierz et al. [13] illustrated that deepCAGE (deep sequencing cap analysis of gene expression) data follow an approximate power law distribution and proposed a normalization strategy that equates the read count distributions across samples. Scaling to library size as a form of normalization makes intuitive sense, given it is expected that sequencing a sample to half the depth will give, on average, half the number of reads mapping to each gene. We believe this is appropriate for normalizing between replicate samples of an RNA population. However, library size scaling is too simple for many biological applications. The number of tags expected to map to a gene is not only dependent on the expression level and length of the gene, but also the composition of the RNA population that is being sampled. Thus, if a large number of genes are unique to, or highly expressed in, one experimental condition, the sequencing 'real estate' available for the remaining genes in that sample is decreased. If not adjusted for, this sampling artifact can force the DE analysis to be skewed towards one experimental condition. Current analysis methods [6,11] have not accounted for this proportionality property of the data explicitly, potentially giving rise to higher false positive rates and lower power to detect true differences. The fundamental issue here is the appropriate metric of expression to compare across samples. The standard procedure is to compute the proportion of each gene's reads relative to the total number of reads and compare that across all samples, either by transforming the original data or by introducing a constant into a statistical model. However, since different experimental conditions (for example, tissues) express diverse RNA repertoires, we cannot always expect the proportions to be directly comparable. Furthermore, we argue that in the discovery of biologically meaningful changes in expression, it should be considered undesirable to have under- or oversampling effects (discussed further below) guiding the DE calls. The normalization method presented below uses the raw data to estimate appropriate scaling factors that can be used in downstream statistical analysis procedures, thus accounting for the sampling properties of RNA-seq data. Results and discussion A hypothetical scenario Estimated normalization factors should ensure that a gene with the same expression level in two samples is not detected as DE. To further highlight the need for more sophisticated normalization procedures in RNA-seq data, consider a simple thought experiment. Imagine we have a sequencing experiment comparing two RNA populations, A and B. In this hypothetical scenario, suppose every gene that is expressed in B is expressed in A with the same number of transcripts. However, assume that sample A also contains a set of genes equal in number and expression that are not expressed in B. Thus, sample A has twice as many total expressed genes as sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample is then sequenced to the same depth. Without any additional adjustment, a gene expressed in both samples will have, on average, half the number of reads from sample A, since the reads are spread over twice as many genes. Therefore, the correct normalization would adjust sample A by a factor of 2. The hypothetical example above highlights the notion that the proportion of reads attributed to a given gene in a library depends on the expression properties of the whole sample rather than just the expression level of that gene. Obviously, the above example is artificial. However, there are biological and even technical situations where such a normalization is required. For example, if an RNA sample is contaminated, the reads that represent the contamination will take away reads from the true sample, thus dropping the number of reads of interest and offsetting the proportion for every gene. However, as we demonstrate, true biological differences in RNA composition between samples will be the main reason for normalization. Sampling framework A more formal explanation for the requirement of normalization uses the following framework. Define Y gk as the observed count for gene g in library k summarized from the raw reads, μ gk as the true and unknown expression level (number of transcripts), L g as the length of gene g and N k as total number of reads for library k. We can model the expected value of Y gk as: S k represents the total RNA output of a sample. The problem underlying the analysis of RNA-seq data is that while N k is known, S k is unknown and can vary drastically from sample to sample, depending on the RNA composition. As mentioned above, if a population has a larger total RNA output, then RNA-seq experiments will under-sample many genes, relative to another sample. At this stage, we leave the variance in the above model for Y gk unspecified. Depending on the experimental situation, Poisson seems appropriate for technical replicates [6,7] and Negative Binomial may be appropriate for the additional variation observed from biological replicates [14]. It is also worth noting that, in practice, the L g is generally absorbed into the μ gk parameter and does not get used in the inference procedure. However, it has been well established that gene length biases are prominent in the analysis of gene expression [15]. The trimmed mean of M-values normalization method The total RNA production, S k , cannot be estimated directly, since we do not know the expression levels and true lengths of every gene. However, the relative RNA production of two samples, fk = Sk/Sk' , essentially a global fold change, can more easily be determined. We propose an empirical strategy that equates the overall expression levels of genes between samples under the assumption that the majority of them are not DE. One simple yet robust way to estimate the ratio of RNA production uses a weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM)). For sequencing data, we define the gene-wise log-fold-changes as: and absolute expression levels: To robustly summarize the observed M values, we trim both the M values and the A values before taking the weighted average. Precision (inverse of the variance) weights are used to account for the fact that log fold changes (effectively, a log relative risk) from genes with larger read counts have lower variance on the logarithm scale. See Materials and methods for further details. For a two-sample comparison, only one relative scaling factor (f k ) is required. It can be used to adjust both library sizes (divide the reference by and multiply non-reference by ) in the statistical analysis (for example, Fisher's exact test; see Materials and methods for more details). Normalization factors across several samples can be calculated by selecting one sample as a reference and calculating the TMM factor for each non-reference sample. Similar to two-sample comparisons, the TMM normalization factors can be built into the statistical model used to test for DE. For example, a Poisson model would modify the observed library size to an effective library size, which adjusts the modeled mean (for example, using an additional offset in a generalized linear model; see Materials and methods for further details). A liver versus kidney data set We applied our method to a publicly available transcriptional profiling data set comparing several technical replicates of a liver and kidney RNA source [6]. Figure 1a shows the distribution of M values between two technical replicates of the kidney sample after the standard normalization procedure of accounting for the total number of reads. The distribution of M values for these technical replicates is concentrated around zero. However, Figure 1b shows that log ratios between a liver and kidney sample are significantly offset towards higher expression in kidney, even after accounting for the total number of reads. Also highlighted (green line) is the distribution of observed M values for a set of housekeeping genes, showing a significant shift away from zero. If scaling to the total number of reads appropriately normalized RNA-seq data, then such a shift in the log-fold-changes is not expected. The explanation for this bias is straightforward. The M versus A plot in Figure 1c illustrates that there exists a prominent set of genes with higher expression in liver (black arrow). As a result, the distribution of M values (liver to kidney) is skewed in the negative direction. Since a large amount of sequencing is dedicated to these liver-specific genes, there is less sequencing available for the remaining genes, thus proportionally distorting the M values (and therefore, the DE calls) towards being kidney-specific. Figure 1 Normalization is required for RNA-seq data. Data from [6] comparing log ratios of (a) technical replicates and (b) liver versus kidney expression levels, after adjusting for the total number of reads in each sample. The green line shows the smoothed distribution of log-fold-changes of the housekeeping genes. (c) An M versus A plot comparing liver and kidney shows a clear offset from zero. Green points indicate 545 housekeeping genes, while the green line signifies the median log-ratio of the housekeeping genes. The red line shows the estimated TMM normalization factor. The smear of orange points highlights the genes that were observed in only one of the liver or kidney tissues. The black arrow highlights the set of prominent genes that are largely attributable for the overall bias in log-fold-changes. The application of TMM normalization to this pair of samples results in a normalization factor of 0.68 (-0.56 on log2 scale; shown by the red line in Figure 1b, c), reflecting the under-sampling of the majority of liver genes. The TMM factor is robust for lower coverage data where more genes with zero counts may be expected (Figure S1a in Additional file 1) and is stable for reasonable values of the trim parameters (Figure S1b in Additional file 1). Using TMM normalization in a statistical test for DE (see Materials and methods) results in a similar number of genes significantly higher in liver (47%) and kidney (53%). By contrast, the standard normalization (to the total number of reads as originally used in [6]) results in the majority of DE genes being significantly higher in kidney (77%). Notably, less than 70% of the genes identified as DE using standard normalization are still detected after TMM normalization (Table 1). In addition, we find the log-fold-changes for a large set of housekeeping genes (from [16]) are, on average, offset from zero very close to the estimated TMM factor, thus giving credibility to our robust estimation procedure. Furthermore, using the non-adjusted testing procedure, 8% and 70% of the housekeeping genes are significantly up-regulated in liver and kidney, respectively. After TMM adjustment, the proportion of DE housekeeping genes changes to 26% and 41%, respectively, which is a lower total number and more symmetric between the two tissues. Of course, the bias in log-ratios observed in RNA-seq data is not observed in microarray data (from the same sources of RNA), assuming the microarray data have been appropriately normalized (Figure S2 in Additional file 1). Taken together, these results indicate a critical role for the normalization of RNA-seq data. Table 1 Number of genes called differentially expressed between liver and kidney at a false discovery rate <0.001 using different normalization methods Library size normalization TMM normalization Overlap  Higher in liver 2,355 4,293 2,355  Higher in kidney 8,332 4,935 4,935  Total 10,867 9,228 7,290 House keeping genes (545)  Higher in liver 45 137 45  Higher in kidney 376 220 220  Total 421 357 265 TMM, trimmed mean of M values. Other datasets The global shift in log-fold-change caused by RNA composition differences occurs at varying degrees in other RNA-seq datasets. For example, an M versus A plot for the Cloonan et al. [12] dataset (Figure S3 in Additional file 1) gives an estimated TMM scaling factor of 1.04 between the two samples (embryoid bodies versus embryonic stem cells), sequenced on the SOLiD™ system. The M versus A plot for this dataset also highlights an interesting set of genes that have lower overall expression, but higher in embryoid bodies. This explains the positive shift in log-fold-changes for the remaining genes. The TMM scale factor appears close to the median log-fold-changes amongst a set of approximately 500 mouse housekeeping genes (from [17]). As another example, the Li et al. [18] dataset, using the llumina 1G Genome Analyzer, exhibits a shift in the overall distribution of log-fold-changes and gives a TMM scaling factor of 0.904 (Figure S4 in Additional file 1). However, there are sequencing-based datasets that have quite similar RNA outputs and may not need a significant adjustment. For example, the small-RNA-seq data from Kuchenbauer et al. [19] exhibits only a modest bias in the log-fold-changes (Figure S5 in Additional file 1). Spike-in controls have the potential to be used for normalization. In this scenario, small but known amounts of RNA from a foreign organism are added to each sample at a specified concentration. In order to use spike-in controls for normalization, the ratio of the concentration of the spike to the sample must be kept constant throughout the experiment. In practice, this is difficult to achieve and small variations will lead to biased estimation of the normalization factor. For example, using the spiked-in DNA from the Mortazavi et al. data set [11] would lead to unrealistic normalization factor estimates (Figure S6 in Additional file 1). As with microarrays, it is generally more robust to carefully estimate normalization factors using the experimental data (for example, [20]). Simulation studies To investigate the range in utility of the TMM normalization method, we developed a simulation framework to study the effects of RNA composition on DE analysis of RNA-seq data. To start, we simulate data from just two libraries. We include parameters for the number of genes expressed uniquely to each sample, and parameters for the proportion, magnitude and direction of differentially expressed genes between samples (see Material and methods). Figure 2a shows an M versus A plot for a typical simulation including unique genes and DE genes. By simulating different total RNA outputs, the majority of non-DE genes have log-fold-changes that are offset from zero. In this case, using TMM normalization to account for the underlying RNA composition leads to a lower number of false detections using a Fisher's exact test (Figure 2b). Repeating the simulation a large number of times across a wide range of simulation parameters, we find good agreement when comparing the true normalization factors from the simulation with those estimated using TMM normalization (Figure S7 in Additional file 1). Figure 2 Simulations show TMM normalization is robust and outperforms library size normalization. (a) An example of the simulation results showing the need for normalization due to genes expressed uniquely in one sample (orange dots) and asymmetric DE (blue dots). (b) A lower false positive rate is observed using TMM normalization compared with standard normalization. To further compare the performance of the TMM normalization with previously used methods in the context of the DE analysis of RNA-seq data, we extend the above simulation to include replicate sequencing runs. Specifically, we compare three published methods: length-normalized count data that have been log transformed and quantile normalized, as implemented by Cloonan et al. [12], a Poisson regression [6] with library size and TMM normalization and a Poisson exact test [8] with library size and TMM normalization. We do not compare directly with the normalization proposed in Balwierz et al. [13] since the liver and kidney dataset do not appear to follow a power law distribution and have quite distinct count distributions (Figure S8 in Additional file 1). Furthermore, in light of the RNA composition bias we observe, it is not clear whether equating the count distributions across samples is the most logical procedure. In addition, we do not directly compare the normalization to virtual length [2] or RPKM [11] normalization, since a statistical analysis of the transformed data was not mentioned. However, we illustrate with M versus A plots that their normalization does not completely remove RNA composition bias (Figures S9 and S10 in Additional file 1). For the simulation, we used an empirical joint distribution of gene lengths and counts, since the Cloonan et al. procedure requires both. We made the simulation data Poisson-distributed to mimic technical replicates (Figure S11 in Additional file 1). Figure 3a shows false discovery plots amongst the genes that are common to both conditions, where we have introduced 10% unique-to-group expression for the first condition, 5% DE at a 2-fold level, 80% of which is higher in the first condition. The approach that uses methodology developed for microarray data performs uniformly worse, as one might expect since the distributional assumptions for these methods are quite different. Among the remaining methods (Poisson likelihood ratio statistic, Poisson exact statistic), performance is very similar; again, the TMM normalization makes a dramatic improvement to both. Figure 3 False discovery plots comparing several published methods. The red line depicts the length-normalized moderated t-statistic analysis. The solid and dashed lines show the library size normalized and TMM normalized Poisson model analysis, respectively. The blue and black lines represent the LR test and exact test, respectively. It can be seen that the use of TMM normalization results in a much lower false discovery rate. Conclusions TMM normalization is a simple and effective method for estimating relative RNA production levels from RNA-seq data. The TMM method estimates scale factors between samples that can be incorporated into currently used statistical methods for DE analysis. We have shown that normalization is required in situations where the underlying distribution of expressed transcripts between samples is markedly different. The assumptions behind the TMM method are similar to the assumptions commonly made in microarray normalization procedures such as lowess normalization [21] and quantile normalization [22]. Therefore, adequately normalized array data do not show the effects of different total RNA output between samples. In essence, both microarray and TMM normalization assume that the majority of genes, common to both samples, are not differentially expressed. Our simulation studies indicate that the TMM method is robust against deviations to this assumption up to about 30% of DE in one direction. For many applications, this assumption will not be violated. One notable difference with TMM normalization for RNA-seq is that the data themselves do not need to be modified, unlike microarray normalization and some implemented RNA-seq strategies [11,12]. Here, the estimated normalization factors are used directly in the statistical model used to test for DE, while preserving the sampling properties of the data. Because the data themselves are not modified, it can be used in further applications such as comparing expression between genes. Normalization will be crucial in many other applications of high throughput sequencing where the DNA or RNA populations being compared differ in their composition. For example, chromatin immunoprecipitation (ChIP) followed by next generation sequencing (ChIP-seq) may require a similar adjustment to compare between samples containing different repertoires of bound targets. Interestingly, the PeakSeq method [23] uses a linear regression on binned counts across the genome to estimate a scaling factor between two ChIP populations to account for the different coverages. This is similar in principle to what is proposed here, but possibly less robust. We demonstrated that there are numerous biological situations where a composition adjustment will be required. In addition, technical artifacts that are not fully captured by the library size adjustment can be accounted for with the empirical adjustment. Furthermore, it is not clear that DNA spiked-in at known concentrations will allow robust estimation of normalization factors. Similar to previous high throughput technologies such as microarrays, normalization is an essential step for inferring true differences in expression between samples. The number of reads for a gene is dependent not only on the gene's expression level and length, but also on the population of RNA from which it originates. We present a straightforward and effective empirical method for normalization of RNA-seq data. Materials and methods TMM normalization details A trimmed mean is the average after removing the upper and lower x% of the data. The TMM procedure is doubly trimmed, by log-fold-changes (sample k relative to sample r for gene g) and by absolute intensity (A g ). By default, we trim the M g values by 30% and the A g values by 5%, but these settings can be tailored to a given experiment. The software also allows the user to set a lower bound on the A value, for instances such as the Cloonan et al. dataset (Figure S1 in Additional file 1). After trimming, we take a weighted mean of M g , with weights as the inverse of the approximate asymptotic variances (calculated using the delta method [24]). Specifically, the normalization factor for sample k using reference sample r is calculated as: The cases where Y gk = 0 or Y gr = 0 are trimmed in advance of this calculation since log-fold-changes cannot be calculated; G* represents the set of genes with valid M g and A g values and not trimmed, using the percentages above. It should be clear that . As Figure 2a indicates, the variances of the M values at higher total count are lower. Within a library, the vector of counts is multinomial distributed and any individual gene is binomial distributed with a given library size and proportion. Using the delta method, one can calculate an approximate variance for the M g , as is commonly done with log relative risk, and the inverse of these is used to weight the average. We compared the weighted with the unweighted trimmed mean as well as an alternative robust estimator (robust linear model) over a range of simulation parameters, as shown in Figure S4 in Additional file 1. Housekeeping genes Human housekeeping genes, as described in [16], were downloaded from [25] and matched to the Ensembl gene identifiers using the Bioconductor [26] biomaRt package [27]. Similarly, mouse housekeeping genes were taken to be the approximately 500 genes with lowest coefficient of variation, as calculated by de Jonge et al. [17]. Statistical testing For a two-library comparison, we use the sage.test function from the CRAN statmod package [28] to calculate a Fisher exact P-value for each gene. To apply TMM normalization, we replace the original library sizes with 'effective' library sizes. For two libraries, the effective library sizes are calculated by multiplying/dividing the square root of the estimated normalization factor with the original library size. For comparisons with technical replicates, we followed the analysis procedure used in the Marioni et al. study [6]. Briefly, it is assumed that the counts mapping to a gene are Poisson-distributed, according to: where represents the fraction of total reads for gene g in experimental condition z k . Their analysis utilizes an offset to account for the library size and a likelihood ratio (LR) statistic to test for differences in expression between libraries (that is, H0:μ g1 = μ g2). In order to use TMM normalization, we augment the original offset with the estimated normalization factor. The same LR testing framework is then used to calculate P-values for DE between tissues. We modified this analysis to use an exact Poisson test for testing the difference between two replicated groups. The strategy is similar in principle to the Fisher's exact test: conditioning on the total count, we calculated the probability of observing group counts as or more extreme than what we actually observed. The total and group total counts are all Poisson distributed. We re-implemented the method from Cloonan et al. [12] for the analysis of simulated data using a custom R [29] script. Simulation details The simulation is set up to sample a dataset from a given empirical distribution of read counts (that is, from a distribution of observed Y g ). The mean is calculated from the sampled read counts divided by the sum S k and multiplied by a specified library size N k (according to the model). The simulated data are then randomly sampled from a Poisson distribution, given the mean. We have parameters specifying the number of genes common to both libraries and the number of genes unique to each sample. Additional parameters specify the amount, direction and magnitude of DE as well as the depth of sequencing (that is, range of total numbers of reads). Since we have inserted known differentially expressed genes, we can rank genes according to various statistics and plot the number of false discoveries as a function of the ranking. Table S1 in Additional file 1 gives the parameter settings used for the simulations presented in Figures 2 and 3. Software Software implementing our method was released within the edgeR package [30] in version 2.5 of Bioconductor [26] and is available from [31]. Scripts and data for our analyses, including the simulation framework, have been made available from [32]. Abbreviations ChIP: chromatin immunoprecipation; DE: differential expression; LR: likelihood ratio; RPKM: reads per kilobase per million mapped; TMM: trimmed mean of M values. Authors' contributions MDR and AO conceived of the idea, analyzed the data and wrote the paper. Supplementary Material Additional file 1 A Word document with supplementary materials, including 11 supplementary figures and one supplementary table. Click here for file
                Bookmark

                Author and article information

                Contributors
                antoine.molaro@uca.fr
                jsarthy@fredhutch.org
                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group UK (London )
                2041-1723
                20 January 2021
                20 January 2021
                2021
                : 12
                : 490
                Affiliations
                [1 ]GRID grid.4280.e, ISNI 0000 0001 2180 6431, The Cancer Science Institute of Singapore, , National University of Singapore, ; Singapore, Singapore
                [2 ]GRID grid.270240.3, ISNI 0000 0001 2180 1622, Clinical Research Division, , Fred Hutchinson Cancer Research Center, ; Seattle, WA USA
                [3 ]GRID grid.270240.3, ISNI 0000 0001 2180 1622, Computational Biology Program, Public Health Sciences Division, , Fred Hutchinson Cancer Research Center, ; Seattle, WA USA
                [4 ]GRID grid.270240.3, ISNI 0000 0001 2180 1622, Basic Sciences Division, , Fred Hutchinson Cancer Research Center, ; Seattle, WA USA
                [5 ]GRID grid.34477.33, ISNI 0000000122986657, Department of Genome Sciences, , University of Washington, ; Seattle, WA USA
                [6 ]GRID grid.270240.3, ISNI 0000 0001 2180 1622, Howard Hughes Medical Institute, , Fred Hutchinson Cancer Research Center, ; Seattle, WA USA
                [7 ]GRID grid.494717.8, ISNI 0000000115480420, Present Address: Genetics, Reproduction and Development (GReD) Institute, , Université Clermont Auvergne, ; Clermont-Ferrand, France
                Author information
                http://orcid.org/0000-0003-4357-1404
                http://orcid.org/0000-0002-8046-1063
                http://orcid.org/0000-0002-7621-8685
                http://orcid.org/0000-0002-3756-3793
                http://orcid.org/0000-0001-5244-7865
                Article
                20707
                10.1038/s41467-020-20707-x
                7817690
                33473122
                667337d6-8235-4155-9adf-3de3300aed1d
                © The Author(s) 2021

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 17 September 2020
                : 9 December 2020
                Categories
                Article
                Custom metadata
                © The Author(s) 2021

                Uncategorized
                cancer epigenetics,cancer genomics
                Uncategorized
                cancer epigenetics, cancer genomics

                Comments

                Comment on this article