3
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Downregulation of Three Novel miRNAs in the Lymph Nodes of Sheep Immunized With the Brucella suis Strain 2 Vaccine

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Ovine and caprine brucellosis, both caused by Brucella melitensis, lead to substantial economic losses in the animal industry and health problems in human populations. Brucella suis strain 2 ( B.suis S2), as a live attenuated vaccine, is used extensively in China to prevent brucellosis. It has been proven that microRNA (miRNAs) are involved in the immunopathogenesis of brucellosis; however, the miRNA-driven mechanism of immune response to B.suis S2 in vivo remains unknown. To determine which new miRNAs are involved in the host immune response to B.suis S2 and elucidate the function of these miRNAs, we performed a comprehensive analysis of miRNA expression profiles in sheep immunized with B.suis S2 using the high-throughput sequencing approach. The submandibular lymphatic nodes from sheep seropositive for Brucella were collected at 7, 14, 21, 30, 60 and 90 days post-immunization. MiRNA sequencing analysis revealed that 282 differentially expressed miRNAs (|log 2 fold-change |>0.5 and p < 0.05) were significantly enriched in the immune pathways, including the NF-kappa B signaling pathway, B cell receptor signaling pathway, p53 signaling pathway and complement and coagulation cascades. Increasing the threshold to |log 2 fold change|>1 and p < 0.01 revealed 48 differentially expressed miRNAs, 31 of which were novel miRNAs. Thirteen of these novel miRNAs, which were differentially expressed for at least two time points, were detected via RT-qPCR assays. The novel_229, novel_609, novel_973 and oar-miR-181a assessed by RT-qPCR were detectable and consistent with the expression patterns obtained by miRNA sequencing. Functional analyses of these miRNAs demonstrated that their target genes participated in the immune response pathways, including the innate and adaptive immunity pathways. The immune-related target genes of novel_229 included ENSOARG00000000649 and TMED1, as well as LCN2, PDPK1 and LPO were novel_609 target genes. The immune-related target genes of novel_973 included C6orf58, SPPL3, BPIFB1, ENSOARG00000021083, MPTX1, CCL28, FGB, IDO1, OLR1 and ENSOARG00000020393. The immune-related target genes of oar-miR-181a included ENSOARG00000002722, ARHGEF2, MFAP4 and DOK2. These results will deepen our understanding of the host miRNA-driven defense mechanism in sheep immunized with B.suis S2 vaccine, and provide the valuable information for optimizing vaccines and developing molecular diagnostic targets.

          Related collections

          Most cited references51

          • Record: found
          • Abstract: found
          • Article: not found

          Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

          Rationale Improvements in the efficiency of DNA sequencing have both broadened the applications for sequencing and dramatically increased the size of sequencing datasets. Technologies from Illumina (San Diego, CA, USA) and Applied Biosystems (Foster City, CA, USA) have been used to profile methylation patterns (MeDIP-Seq) [1], to map DNA-protein interactions (ChIP-Seq) [2], and to identify differentially expressed genes (RNA-Seq) [3] in the human genome and other species. The Illumina instrument was recently used to re-sequence three human genomes, one from a cancer patient and two from previously unsequenced ethnic groups [4-6]. Each of these studies required the alignment of large numbers of short DNA sequences ('short reads') onto the human genome. For example, two of the studies [4,5] used the short read alignment tool Maq [7] to align more than 130 billion bases (about 45× coverage) of short Illumina reads to a human reference genome in order to detect genetic variations. The third human re-sequencing study [6] used the SOAP program [8] to align more than 100 billion bases to the reference genome. In addition to these projects, the 1,000 Genomes project is in the process of using high-throughput sequencing instruments to sequence a total of about six trillion base pairs of human DNA [9]. With existing methods, the computational cost of aligning many short reads to a mammalian genome is very large. For example, extrapolating from the results presented here in Tables 1 and 2, one can see that Maq would require more than 5 central processing unit (CPU)-months and SOAP more than 3 CPU-years to align the 140 billion bases from the study by Ley and coworkers [5]. Although using Maq or SOAP for this purpose has been shown to be feasible by using multiple CPUs, there is a clear need for new tools that consume less time and computational resources. Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4 SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351× 67.3 Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9 Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8× 74.7 Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9 Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107× 74.7 The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. Table 2 Bowtie alignment performance versus Maq with filtered read set Platform CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Bowtie speed up Reads aligned (%) Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9 Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4× 78.0 Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9 Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2× 78.0 Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Maq and SOAP take the same basic algorithmic approach as other recent read mapping tools such as RMAP [10], ZOOM [11], and SHRiMP [12]. Each tool builds a hash table of short oligomers present in either the reads (SHRiMP, Maq, RMAP, and ZOOM) or the reference (SOAP). Some employ recent theoretical advances to align reads quickly without sacrificing sensitivity. For example, ZOOM uses 'spaced seeds' to significantly outperform RMAP, which is based on a simpler algorithm developed by Baeza-Yaetes and Perleberg [13]. Spaced seeds have been shown to yield higher sensitivity than contiguous seeds of the same length [14,15]. SHRiMP employs a combination of spaced seeds and the Smith-Waterman [16] algorithm to align reads with high sensitivity at the expense of speed. Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm to align reads. Bowtie uses a different and novel indexing strategy to create an ultrafast, memory-efficient short read aligner geared toward mammalian re-sequencing. In our experiments using reads from the 1,000 Genomes project, Bowtie aligns 35-base pair (bp) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions (see Tables 1 and 2). Bowtie employs a Burrows-Wheeler index based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM. The index is small enough to be distributed over the internet and to be stored on disk and re-used. Multiple processor cores can be used simultaneously to achieve even greater alignment speed. We have used Bowtie to align 14.3× coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores. Bowtie makes a number of compromises to achieve this speed, but these trade-offs are reasonable within the context of mammalian re-sequencing projects. If one or more exact matches exist for a read, then Bowtie is guaranteed to report one, but if the best match is an inexact one then Bowtie is not guaranteed in all cases to find the highest quality alignment. With its highest performance settings, Bowtie may fail to align a small number of reads with valid alignments, if those reads have multiple mismatches. If the stronger guarantees are desired, Bowtie supports options that increase accuracy at the cost of some performance. For instance, the '--best' option will guarantee that all alignments reported are best in terms of minimizing mismatches in the seed portion of the read, although this option incurs additional computational cost. With its default options, Bowtie's sensitivity measured in terms of reads aligned is equal to SOAP's and somewhat less than Maq's. Command line options allow the user to increase sensitivity at the cost of greater running time, and to enable Bowtie to report multiple hits for a read. Bowtie can align reads as short as four bases and as long as 1,024 bases. The input to a single run of Bowtie may comprise a mixture of reads with different lengths. Bowtie description and results Bowtie indexes the reference genome using a scheme based on the Burrows-Wheeler transform (BWT) [17] and the FM index [18,19]. A Bowtie index for the human genome fits in 2.2 GB on disk and has a memory footprint of as little as 1.3 GB at alignment time, allowing it to be queried on a workstation with under 2 GB of RAM. The common method for searching in an FM index is the exact-matching algorithm of Ferragina and Manzini [18]. Bowtie does not simply adopt this algorithm because exact matching does not allow for sequencing errors or genetic variations. We introduce two novel extensions that make the technique applicable to short read alignment: a quality-aware backtracking algorithm that allows mismatches and favors high-quality alignments; and 'double indexing', a strategy to avoid excessive backtracking. The Bowtie aligner follows a policy similar to Maq's, in that it allows a small number of mismatches within the high-quality end of each read, and it places an upper limit on the sum of the quality values at mismatched alignment positions. Burrows-Wheeler indexing The BWT is a reversible permutation of the characters in a text. Although originally developed within the context of data compression, BWT-based indexing allows large texts to be searched efficiently in a small memory footprint. It has been applied to bioinformatics applications, including oligomer counting [20], whole-genome alignment [21], tiling microarray probe design [22], and Smith-Waterman alignment to a human-sized reference [23]. The Burrows-Wheeler transformation of a text T, BWT(T), is constructed as follows. The character $ is appended to T, where $ is not in T and is lexicographically less than all characters in T. The Burrows-Wheeler matrix of T is constructed as the matrix whose rows comprise all cyclic rotations of T$. The rows are then sorted lexicographically. BWT(T) is the sequence of characters in the rightmost column of the Burrows-Wheeler matrix (Figure 1a). BWT(T) has the same length as the original text T. Figure 1 Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column). This matrix has a property called 'last first (LF) mapping'. The ith occurrence of character X in the last column corresponds to the same text character as the ith occurrence of X in the first column. This property underlies algorithms that use the BWT index to navigate or search the text. Figure 1b illustrates UNPERMUTE, an algorithm that applies the LF mapping repeatedly to re-create T from BWT(T). The LF mapping is also used in exact matching. Because the matrix is sorted lexicographically, rows beginning with a given sequence appear consecutively. In a series of steps, the EXACTMATCH algorithm (Figure 1c) calculates the range of matrix rows beginning with successively longer suffixes of the query. At each step, the size of the range either shrinks or remains the same. When the algorithm completes, rows beginning with S0 (the entire query) correspond to exact occurrences of the query in the text. If the range is empty, the text does not contain the query. UNPERMUTE is attributable to Burrows and Wheeler [17] and EXACTMATCH to Ferragina and Manzini [18]. See Additional data file 1 (Supplementary Discussion 1) for details. Searching for inexact alignments EXACTMATCH is insufficient for short read alignment because alignments may contain mismatches, which may be due to sequencing errors, genuine differences between reference and query organisms, or both. We introduce an alignment algorithm that conducts a backtracking search to quickly find alignments that satisfy a specified alignment policy. Each character in a read has a numeric quality value, with lower values indicating a higher likelihood of a sequencing error. Our alignment policy allows a limited number of mismatches and prefers alignments where the sum of the quality values at all mismatched positions is low. The search proceeds similarly to EXACTMATCH, calculating matrix ranges for successively longer query suffixes. If the range becomes empty (a suffix does not occur in the text), then the algorithm may select an already-matched query position and substitute a different base there, introducing a mismatch into the alignment. The EXACTMATCH search resumes from just after the substituted position. The algorithm selects only those substitutions that are consistent with the alignment policy and which yield a modified suffix that occurs at least once in the text. If there are multiple candidate substitution positions, then the algorithm greedily selects a position with a minimal quality value. Backtracking scenarios play out within the context of a stack structure that grows when a new substitution is introduced and shrinks when the aligner rejects all candidate alignments for the substitutions currently on the stack. See Figure 2 for an illustration of how the search might proceed. Figure 2 Exact matching versus inexact alignment. Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'. Boxed pairs of numbers denote ranges of matrix rows beginning with the suffix observed up to that point. A red X marks where the algorithm encounters an empty range and either aborts (as in EXACTMATCH) or backtracks (as in the inexact algorithm). A green check marks where the algorithm finds a nonempty range delimiting one or more occurrences of a reportable alignment for the query. In short, Bowtie conducts a quality-aware, greedy, randomized, depth-first search through the space of possible alignments. If a valid alignment exists, then Bowtie will find it (subject to the backtrack ceiling discussed in the following section). Because the search is greedy, the first valid alignment encountered by Bowtie will not necessarily be the 'best' in terms of number of mismatches or in terms of quality. The user may instruct Bowtie to continue searching until it can prove that any alignment it reports is 'best' in terms of number of mismatches (using the option --best). In our experience, this mode is two to three times slower than the default mode. We expect that the faster default mode will be preferred for large re-sequencing projects. The user may also opt for Bowtie to report all alignments up to a specified number (option -k) or all alignments with no limit on the number (option -a) for a given read. If in the course of its search Bowtie finds N possible alignments for a given set of substitutions, but the user has requested only K alignments where K < N, Bowtie will report K of the N alignments selected at random. Note that these modes can be much slower than the default. In our experience, for example, -k 1 is more than twice as fast as -k 2. Excessive backtracking The aligner as described so far can, in some cases, encounter sequences that cause excessive backtracking. This occurs when the aligner spends most of its effort fruitlessly backtracking to positions close to the 3' end of the query. Bowtie mitigates excessive backtracking with the novel technique of 'double indexing'. Two indices of the genome are created: one containing the BWT of the genome, called the 'forward' index, and a second containing the BWT of the genome with its character sequence reversed (not reverse complemented) called the 'mirror' index. To see how this helps, consider a matching policy that allows one mismatch in the alignment. A valid alignment with one mismatch falls into one of two cases according to which half of the read contains the mismatch. Bowtie proceeds in two phases corresponding to those two cases. Phase 1 loads the forward index into memory and invokes the aligner with the constraint that it may not substitute at positions in the query's right half. Phase 2 uses the mirror index and invokes the aligner on the reversed query, with the constraint that the aligner may not substitute at positions in the reversed query's right half (the original query's left half). The constraints on backtracking into the right half prevent excessive backtracking, whereas the use of two phases and two indices maintains full sensitivity. Unfortunately, it is not possible to avoid excessive backtracking fully when alignments are permitted to have two or more mismatches. In our experiments, we have observed that excessive backtracking is significant only when a read has many low-quality positions and does not align or aligns poorly to the reference. These cases can trigger in excess of 200 backtracks per read because there are many legal combinations of low-quality positions to be explored before all possibilities are exhausted. We mitigate this cost by enforcing a limit on the number of backtracks allowed before a search is terminated (default: 125). The limit prevents some legitimate, low-quality alignments from being reported, but we expect that this is a desirable trade-off for most applications. Phased Maq-like search Bowtie allows the user to select the number of mismatches permitted (default: two) in the high-quality end of a read (default: the first 28 bases) as well as the maximum acceptable quality distance of the overall alignment (default: 70). Quality values are assumed to follow the definition in PHRED [24], where p is the probability of error and Q = -10log p. Both the read and its reverse complement are candidates for alignment to the reference. For clarity, this discussion considers only the forward orientation. See Additional data file 1 (Supplementary Discussion 2) for an explanation of how the reverse complement is incorporated. The first 28 bases on the high-quality end of the read are termed the 'seed'. The seed consists of two halves: the 14 bp on the high-quality end (usually the 5' end) and the 14 bp on the low-quality end, termed the 'hi-half' and the 'lo-half', respectively. Assuming the default policy (two mismatches permitted in the seed), a reportable alignment will fall into one of four cases: no mismatches in seed (case 1); no mismatches in hi-half, one or two mismatches in lo-half (case 2); no mismatches in lo-half, one or two mismatches in hi-half (case 3); and one mismatch in hi-half, one mismatch in lo-half (case 4). All cases allow any number of mismatches in the nonseed part of the read and all cases are also subject to the quality distance constraint. The Bowtie algorithm consists of three phases that alternate between using the forward and mirror indices, as illustrated in Figure 3. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to find alignments for case 3: Phase 2 finds partial alignments with mismatches only in the hi-half and phase 3 attempts to extend those partial alignments into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. Figure 3 The three phases of the Bowtie algorithm for the Maq-like policy. A three-phase approach finds alignments for two-mismatch cases 1 to 4 while minimizing backtracking. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to find alignments for case 3: Phase 2 finds partial alignments with mismatches only in the hi-half, and phase 3 attempts to extend those partial alignments into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. Performance results We evaluated the performance of Bowtie using reads from the 1,000 Genomes project pilot (National Center for Biotechnology Information [NCBI] Short Read Archive:SRR001115). A total of 8.84 million reads, about one lane of data from an Illumina instrument, were trimmed to 35 bp and aligned to the human reference genome [NCBI build 36.3]. Unless specified otherwise, read data are not filtered or modified (besides trimming) from how they appear in the archive. This leads to about 70% to 75% of reads aligning somewhere to the genome. In our experience, this is typical for raw data from the archive. More aggressive filtering leads to higher alignment rates and faster alignment. All runs were performed on a single CPU. Bowtie speedups were calculated as a ratio of wall-clock alignment times. Both wall-clock and CPU times are given to demonstrate that input/output load and CPU contention are not significant factors. The time required to build the Bowtie index was not included in the Bowtie running times. Unlike competing tools, Bowtie can reuse a pre-computed index for the reference genome across many alignment runs. We anticipate most users will simply download such indices from a public repository. The Bowtie site [25] provides indices for current builds of the human, chimp, mouse, dog, rat, and Arabidopsis thaliana genomes, as well as many others. Results were obtained on two hardware platforms: a desktop workstation with 2.4 GHz Intel Core 2 processor and 2 GB of RAM; and a large-memory server with a four-core 2.4 GHz AMD Opteron processor and 32 GB of RAM. These are denoted 'PC' and 'server', respectively. Both PC and server run Red Hat Enterprise Linux AS release 4. Comparison to SOAP and Maq Maq is a popular aligner [1,4,5,26,27] that is among the fastest competing open source tools for aligning millions of Illumina reads to the human genome. SOAP is another open source tool that has been reported and used in short-read projects [6,28]. Table 1 presents the performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. The 'soap.contig' version of the SOAP binary was used. For comparison with SOAP, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values), and with '--maxns 5' to simulate SOAP's default policy of filtering out reads with five or more no-confidence bases. For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches in the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched read positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure that only the forward or mirror index is resident in memory at one time. The number of reads aligned indicates that SOAP (67.3%) and Bowtie -v 2 (67.4%) have comparable sensitivity. Of the reads aligned by either SOAP or Bowtie, 99.7% were aligned by both, 0.2% were aligned by Bowtie but not SOAP, and 0.1% were aligned by SOAP but not Bowtie. Maq (74.7%) and Bowtie (71.9%) also have roughly comparable sensitivity, although Bowtie lags by 2.8%. Of the reads aligned by either Maq or Bowtie, 96.0% were aligned by both, 0.1% were aligned by Bowtie but not Maq, and 3.9% were aligned by Maq but not Bowtie. Of the reads mapped by Maq but not Bowtie, almost all are due to a flexibility in Maq's alignment algorithm that allows some alignments to have three mismatches in the seed. The remainder of the reads mapped by Maq but not Bowtie are due to Bowtie's backtracking ceiling. Maq's documentation mentions that reads containing 'poly-A artifacts' can impair Maq's performance. Table 2 presents performance and sensitivity of Bowtie and Maq when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1, and the same observations about the relative sensitivity of Bowtie and Maq apply here. Read length and performance As sequencing technology improves, read lengths are growing beyond the 30-bp to 50-bp commonly seen in public databases today. Bowtie, Maq, and SOAP support reads of lengths up to 1,024, 63, and 60 bp, respectively, and Maq versions 0.7.0 and later support read lengths up to 127 bp. Table 3 shows performance results when the three tools are each used to align three sets of 2 M untrimmed reads, a 36-bp set, a 50-bp set and a 76-bp set, to the human genome on the server platform. Each set of 2 M is randomly sampled from a larger set (NCBI Short Read Archive: SRR003084 for 36-bp, SRR003092 for 50-bp, SRR003196 for 76-bp). Reads were sampled such that the three sets of 2 M have uniform per-base error rate, as calculated from per-base Phred qualities. All reads pass through Maq's 'catfilter'. Table 3 Varying read length using Bowtie, Maq and SOAP Length Program CPU time Wall clock time Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 36 bp Bowtie 6 m 15 s 6 m 21 s 1,305 - 62.2 Maq 3 h 52 m 26 s 3 h 52 m 54 s 804 36.7× 65.0 Bowtie -v 2 4 m 55 s 5 m 00 s 1,138 - 55.0 SOAP 16 h 44 m 3 s 18 h 1 m 38 s 13,619 216× 55.1 50 bp Bowtie 7 m 11 s 7 m 20 s 1,310 - 67.5 Maq 2 h 39 m 56 s 2 h 40 m 9 s 804 21.8× 67.9 Bowtie -v 2 5 m 32 s 5 m 46 s 1,138 - 56.2 SOAP 48 h 42 m 4 s 66 h 26 m 53 s 13,619 691× 56.2 76 bp Bowtie 18 m 58 s 19 m 6 s 1,323 - 44.5 Maq 0.7.1 4 h 45 m 7 s 4 h 45 m 17 s 1,155 14.9× 44.9 Bowtie -v 2 7 m 35 s 7 m 40 s 1,138 - 31.7 The performance of Bowtie v0.9.6, SOAP v1.10, and Maq versions v0.6.6 and v0.7.1 on the server platform when aligning 2 M untrimmed reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR003084 for 36 base pairs [bp], SRR003092 for 50 bp, and SRR003196 for 76 bp). For each read length, the 2 M reads were randomly sampled from the FASTQ file downloaded from the Archive such that the average per-base error rate as measured by quality values was uniform across the three sets. All reads pass through Maq's "catfilter". Maq v0.7.1 was used for the 76-bp reads because v0.6.6 does not support reads longer than 63 bp. SOAP is excluded from the 76-bp experiment because it does not support reads longer than 60 bp. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Bowtie is run both in its Maq-like default mode and in its SOAP-like '-v 2' mode. Bowtie is also given the '-z' option to ensure that only the forward or mirror index is resident in memory at one time. Maq v0.7.1 was used instead of Maq v0.6.6 for the 76-bp set because v0.6.6 cannot align reads longer than 63 bp. SOAP was not run on the 76-bp set because it does not support reads longer than 60 bp. The results show that Maq's algorithm scales better overall to longer read lengths than Bowtie or SOAP. However, Bowtie in SOAP-like '-v 2' mode also scales very well. Bowtie in its default Maq-like mode scales well from 36-bp to 50-bp reads but is substantially slower for 76-bp reads, although it is still more than an order of magnitude faster than Maq. Parallel performance Alignment can be parallelized by distributing reads across concurrent search threads. Bowtie allows the user to specify a desired number of threads (option -p); Bowtie then launches the specified number of threads using the pthreads library. Bowtie threads synchronize with each other when fetching reads, outputting results, switching between indices, and performing various forms of global bookkeeping, such as marking a read as 'done'. Otherwise, threads are free to operate in parallel, substantially speeding up alignment on computers with multiple processor cores. The memory image of the index is shared by all threads, and so the footprint does not increase substantially when multiple threads are used. Table 4 shows performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Table 4 Bowtie parallel alignment performance CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Speedup Bowtie, one thread 18 m 19 s 18 m 46 s 28.3 1,353 - Bowtie, two threads 20 m 34 s 10 m 35 s 50.1 1,363 1.77× Bowtie, four threads 23 m 9 s 6 m 1 s 88.1 1,384 3.12× Performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Index building Bowtie uses a flexible indexing algorithm [29] that can be configured to trade off between memory usage and running time. Table 5 illustrates this trade-off when indexing the entire human reference genome (NCBI build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. Table 5 Bowtie index building performance Physical memory target (GB) Actual peak memory footprint (GB) Wall clock time 16 14.4 4 h 36 m 8 5.84 5 h 5 m 4 3.39 7 h 40 m 2 1.39 21 h 30 m Performance results and memory footprints of running the Bowtie v0.9.6 indexer on the whole human genome (National Center for Biotechnology Information build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. See Additional data file 1 (Supplementary Discussion 3 and Supplementary Table 1) for details. The reported times compare favorably with alignment times of competing tools that perform indexing during alignment. Less than 5 hours is required for Bowtie to both build and query a whole-human index with 8.84 million reads from the 1,000 Genome project (NCBI Short Read Archive:SRR001115) on a server, more than sixfold faster than the equivalent Maq run. The bottom-most row illustrates that the Bowtie indexer, with appropriate arguments, is memory-efficient enough to run on a typical workstation with 2 GB of RAM. Additional data file 1 (Supplementary discussions 3 and 4) explains the algorithm and the contents of the resulting index. Software Bowtie is written in C++ and uses the SeqAn library [30]. The converter to the Maq mapping format uses code from Maq. Discussion Bowtie exhibits a large performance advantage over both Maq and SOAP when mapping reads to the human genome. Bowtie's sensitivity in terms of reads aligned is comparable to that of SOAP and slightly less than Maq's, although the user may use command-line options to trade slower running time for greater sensitivity. Unlike SOAP, Bowtie's 1.3 GB memory footprint allows it to run on a typical PC with 2 GB of RAM. Bowtie aligns Illumina reads to the human genome at a rate of over 25 million reads per hour. Multiple processor cores can run parallel Bowtie threads to achieve even greater alignment speed; experiments show a speed up of 3.12 for four threads on a typical Opteron server. Unlike many other short-read aligners, Bowtie creates a permanent index of the reference that may be re-used across alignment runs. Building the index is fast - Bowtie outperforms competing tools when aligning lanes of Illumina reads even with index construction time included. At 2.2 GB for the human genome, the on-disk size of a Bowtie index is small enough to distribute over the internet. The Bowtie website hosts pre-built indices for the human genome and several other model organisms including chimp, dog, rat, mouse, and chicken. Bowtie's speed and small memory footprint are due chiefly to its use of the Burrows-Wheeler index in combination with the novel, quality-aware, backtracking algorithm introduced here. Double indexing is used to avoid the performance penalty of excessive backtracking. Bowtie supports standard FASTQ and FASTA input formats, and comes with a conversion program that allows Bowtie output to be used with Maq's consensus generator and single nucleotide polymorphism caller. Bowtie does not yet support paired-end alignment or alignments with insertions or deletions, although both improvements are planned for the future. Paired-end alignment is not difficult to implement in Bowtie's framework, and we expect that Bowtie's performance advantage will be comparable to, though perhaps somewhat less than, that of unpaired alignment mode. Support for insertions and deletions is also a conceptually straightforward addition. Bowtie is free, open source software available from the Bowtie website [25]. Abbreviations bp: base pair; BWT: Burrows-Wheeler transform; CPU: central processing unit; FM: full-text minute-space; GB: gigabytes; LF: last first; NCBI: National Center for Biotechnology Information. Authors' contributions BL developed the algorithms, collected results, and wrote most of the software. CT wrote some of the software. CT and MP contributed to discussions on algorithms. BL, CT, MP, and SLS wrote the manuscript. Additional data files The following additional data are included with the online version of this article: a document containing supplementary discussions, tables, and figures pertaining to algorithms for navigating the Burrows-Wheeler transform, the full four-phase version of the alignment algorithm that incorporates the reverse-complement, index construction, and components of the index (Additional data file 1). Supplementary Material Additional data file 1 Presented are supplementary discussions, tables, and figures pertaining to algorithms for navigating the Burrows-Wheeler transform, the full four-phase version of the alignment algorithm that incorporates the reverse-complement, index construction, and components of the index. Click here for file
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Analyzing real-time PCR data by the comparative C(T) method.

            Two different methods of presenting quantitative gene expression exist: absolute and relative quantification. Absolute quantification calculates the copy number of the gene usually by relating the PCR signal to a standard curve. Relative gene expression presents the data of the gene of interest relative to some calibrator or internal control gene. A widely used method to present relative gene expression is the comparative C(T) method also referred to as the 2 (-DeltaDeltaC(T)) method. This protocol provides an overview of the comparative C(T) method for quantitative gene expression studies. Also presented here are various examples to present quantitative gene expression data using this method.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Gene ontology analysis for RNA-seq: accounting for selection bias

              Background Next generation sequencing of RNA (RNA-seq) gives unprecedented detail about the transcriptional landscape of an organism. Not only is it possible to accurately measure expression levels of transcripts in a sample [1], but this new technology promises to deliver a range of additional benefits, such as the investigation of alternative splicing [2], allele specific expression [3] and RNA editing [4]. However, in order to accurately make use of the data, it is vital that analysis techniques are developed to take into account the technical features of RNA-seq output. As many of the specific technical properties of RNA-seq data are not present in previous technologies such as microarrays, naive application of the same analysis methodologies, developed for these older technologies, may lead to bias in the results. In RNA-seq experiments the expression level of a transcript is estimated from the number of reads that map to that transcript. In many applications, the expected read count for a transcript is proportional to the gene's expression level multiplied by its transcript length. Therefore, even when two transcripts are expressed at the same level, differences in length will yield differing numbers of total reads. One consequence of this is that longer transcripts give more statistical power for detecting differential expression between samples [5]. Similarly, more highly expressed transcripts have a greater number of reads and greater power to detect differential expression. Hence, long or highly expressed transcripts are more likely to be detected as differentially expressed compared with their short and/or lowly expressed counterparts. The fact that statistical power increases with the number of reads is an unavoidable property of count data, which cannot be removed by normalization or re-scaling. Consequently, it is unsurprising that this selection bias has been shown to exist in a range of different experiments performed using different analysis methods, experimental designs and sequencing platforms [5]. When performing systems biology analyses, failure to account for this effect will lead to biased results. One simple, but extremely widely used, systems biology technique for highlighting biological processes is gene category over-representation analysis. In order to perform this analysis, genes are grouped into categories by some common biological property and then tested to find categories that are over represented amongst the differentially expressed genes. Gene Ontology (GO) categories are commonly used in this technique and there are many tools available for performing GO analysis - for example, EasyGO [6], GOminer [7], GOstat [8], GOToolBox [9], topGO [10], GSEA [11], DAVID [12] (see supplementary data 1 in Huang da et al. [12] for a more complete list). Although these tools have some differences in methodology [12], they all rely on similar underlying assumptions about the distribution of differentially expressed (DE) genes. Specifically, it is assumed that all genes are independent and equally likely to be selected as DE, under the null hypothesis. It is this assumption that makes the standard GO analysis methods unsuitable for RNA-seq data due to the bias in detecting differential expression for genes with a high number of reads. It follows then that when using a standard analysis, any category containing a preponderance of long genes will be more likely to show up as over-represented than a category with genes of average length. Similarly, categories with highly expressed genes are also more likely to be found as over-represented than categories of lowly expressed genes. Having identified this issue, it is possible to compensate for the effect of selection bias in the statistical test of a category's enrichment among differentially expressed genes. This paper will be concerned with developing a statistical methodology that enables the application of GO analysis to RNA-seq data by properly incorporating the effect of selection bias. Using published RNA-seq data, we will show that accounting for this effect leads to significantly different results, which agree much better with previous microarray studies and the known biology than the results of an uncorrected analysis. Results Because it is so widely used, we choose to focus our category testing on GO analysis. The techniques developed here apply more generally, however, to any analysis that looks for over-representation of some category of interest amongst differentially expressed genes. For example, alternative analyses might look for over-representation of KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, gene sets in the Molecular Signatures Database [11], or gene lists derived from earlier microarray experiments. To illustrate the methodology, the GOseq technique was applied to an experiment examining the effects of androgen stimulation on a human prostate cancer cell line, LNCaP [13]. This published data set includes more than 17 million short cDNA reads obtained for both the treated and untreated cell line and sequenced on Illumina's 1G genome analyzer. These reads were re-mapped back to the reference genome and the number of reads per gene was recorded (see Additional file 1 for details). Examples of the methodology throughout the paper are made using this data set and we show that the GOseq method makes a substantial difference to the results of the GO analysis, which are more consistent with prior knowledge of the system. GO analysis Standard methods for testing over-representation of a GO category assume that, under the null hypothesis, each gene has equal probability of being detected as DE. Under this assumption, the number of genes associated with a category that overlap with the set of DE genes follows a hypergeometric distribution. Hence the GO test can be conducted using Fisher's exact test, which uses the hypergeometric distribution, or Pearson's chi-square test, which is a computationally convenient approximation [12]. Because of the existence of selection bias, genes with more reads are more likely to be detected as DE, violating the assumption behind the hypergeometric distribution. Therefore, these standard test distributions should not be used for GO analysis with RNA-seq data. Instead, we need a new test that corrects for selection bias. Selection bias Transcript length bias will affect GO analysis if the sets of genes associated with GO categories contain a non-random set of genes, with either a preponderance of short or long genes. To test individual categories for a length distribution bias, we assessed the length of genes within 7,873 unique GO categories associated with human genes in the GO consortium database [14]. The length distribution of the genes in these categories varied widely, with some categories containing an over-representation of long genes and some with relatively short genes (Figure 1a). A Mann-Whitney U test (also known as a Wilcoxon rank-sum test) was performed on each category to test whether the median length of genes in that GO category differed from the global median length across all categories. Figure 1b shows a histogram of the Mann-Whitney P-values. The clear excess of low P-values indicates the existence of GO categories with many long or short genes. Therefore, we expect that a standard analysis of GO category enrichment among DE genes, which ignores transcript length, to be significantly affected by the transcript length bias inherent to RNA-seq. Figure 1 Length distribution of genes in Gene Ontology categories. (a) The distribution of average gene lengths in GO categories on a log10 scale. The GO category gene length is given by the median length of the genes within the category. (b) P-values for the two-sided Mann-Whitney U test comparing the median length of genes in a GO category with the overall distribution of genes for 7,873 GO categories. The excess of low P-values shows that there are many GO categories that contain a set of significantly long or short genes. The GOseq method In order to correct for selection bias in category testing, we propose the following three-step methodology. First, the genes that are significantly DE between conditions are identified. The GOseq method works with any procedure for identifying DE genes. Second, the likelihood of DE as a function of transcript length is quantified. This is obtained by fitting a monotonic function to DE versus transcript length data. Finally, the DE versus length function is incorporated into the statistical test of each category's significance. This final step takes into account the lengths of the genes that make up each category. The first step in the GOseq procedure is to determine which genes are differentially expressed. For the prostate cancer data set, a P-value for differential expression between the treated and untreated cells was obtained for each gene using a Poisson exact test, equivalent to Fisher's exact test [15-17]. These P-values were then corrected for multiple testing [18] and the false discovery rate was set at 10-4. Figure 2a shows a plot of the proportion of DE genes as a function of length. A strong trend towards a higher rate of differential expression for genes with longer transcripts is evident. Figure 2b shows a similar trend towards more differential expression for genes with a higher total number of reads. As expected, the trend is stronger for total read count compared to transcript length. Other statistical methods for determining differential expression between samples show a similar trend of increasing proportion of DE genes with gene length, even when working with data normalized by dividing by transcript length, such as RPKM transformed data [19] (Figure S1b,c in Additional file 1). Non-statistical methods for determining DE, such as using a fold-change cutoff, can show a decreasing trend in the proportion of DE as a function of gene length (Figure S1a in Additional file 1). Any trend observed in the data is modeled by the second step of the GOseq procedure. Figure 2 Differential expression as a function of gene length and read count. (a) The proportion of DE genes is plotted as a function of the transcript length. Each point represents the percentage of genes called DE in a bin of 300 genes plotted against the median gene length for the bin. The green line is the probability weighting function obtained by fitting a monotonic cubic spline with 6 knots to the binary data. A clear trend towards more detected differential expression at longer gene lengths can be seen. (b) The same, except instead of transcript length, the total number of reads for each gene was used. Again, a trend towards more DE for genes with more reads can be seen. Note the greater range of probabilities compared to (a). In the second step, a probability weighting function (PWF) is estimated from the data. The PWF quantifies how the probability of a gene selected as DE changes as a function of its transcript length. To estimate this trend function, each gene is assigned a binary value (zero or one), according to whether or not the gene is found to be DE. A monotonic spline with 6 knots is then fitted to this binary data series using the transcript length of each gene as predictor (see Materials and methods). Monotonicity is imposed as the power to detect DE using statistical tests increases monotonically with an increasing number of reads. Figure 2 also shows the resulting probability weighting function for the prostate cancer data set. The PWF then forms the null hypothesis for our enrichment test. Unlike GO analysis for microarray data, the null probability distribution does not conform to a standard distribution, precluding an analytical solution for determining the probability of a category being over-represented among DE genes. However, the P-value for each category can be computed using a resampling strategy. For the final step of the GOseq method, resampling was performed by randomly selecting a set of genes, the same size as the set of DE genes, and counting the number of genes associated with the GO category of interest. This random selection weights the chance of choosing a gene by its length or read count, from the previously fitted probability weighting function. The resampling is repeated many times and the resulting distribution of GO category membership is taken to approximate the shape of the true probability distribution. This sampling distribution allows calculation of a P-value for each GO category being over-represented in the set of DE genes while taking selection bias into account. The Wallenius approximation Although accurate, random sampling is very computationally intensive, particularly when a fine granularity in P-value is needed to distinguish between large numbers of categories. In order to make the calculation in the final step of the GOseq procedure computationally manageable, we implemented an alternative approximation technique, based on an extension of the hypergeometric distribution known as the Wallenius non-central hypergeometric distribution [20]. This distribution extends the hypergeometric distribution to the case where the probability of success and failure differ. The GOseq implementation of the approximation assumes that all genes within a category have the same probability of being chosen, but this probability is different from the probability of choosing genes outside this category. The mean of the probability weightings for each gene within/outside the category is defined as the common probability of choosing a gene within/outside that category. While the Wallenius approximation is obviously a simplification, it is significantly closer to the true distribution than the standard hypergeometric distribution. Although the accuracy lost by using the Wallenius approximation is not negligible, the gain in computational efficiency is dramatic. Furthermore, the ability to differentiate the most highly over-represented categories from one another (Additional file 1) makes the Wallenius approximation an attractive alternative, particularly when the range of the probability weighting function is moderate. Comparisons of GOseq with the standard GO analysis To compare the results from the GOseq procedure to the standard GO analysis used on microarray data, we applied both methods to the prostate cancer data set. For each method a list of GO categories ordered by significance was generated. Figure 3 compares the ranks of the GO categories between the GOseq method and hypergeometric method as a function of gene length (Figure 3a) or total read count (Figure 3b) within categories. As expected, categories with shorter than average length move up in rank (become more significant) when length bias has been taken into account with the GOseq method. Similarly, categories with longer than average genes are ranked lower by GOseq than the standard method. Furthermore, when accounting for gene length bias, of the 25 categories most significantly over-represented using GOseq and the standard method, 8 are discrepant between the methods (Tables 1 and 2). This highlights the fact that accounting for biases in detecting DE makes a significant difference to the biology identified from the results. Figure 3 Change in Gene Ontology category rank between the standard and GOseq methodologies. (a) Change in rank of GO categories going from the hypergeometric method to GOseq correcting for length bias plotted against the log of the average gene length of the category. (b) Change in rank of GO categories going from the hypergeometric method to GOseq correcting for total read count plotted against the log of the average number of counts of each gene in the category. A trend for the standard method to underestimate significance for GO categories containing short (or highly expressed) genes and overestimate significance for GO categories containing long (or underexpressed) genes can be clearly seen. Table 1 Gene Ontology categories ranked in the top 25 in the standard method but not in length bias adjusted GOseq GOID Term Ontology Rank standard Rank GOseq Average gene length in category GO:0005622 Intracellular CC 9 38 2,710 GO:0005524 ATP binding MF 12 113 3,133 GO:0008270 Zinc ion binding MF 13 145 2,817 GO:0016020 Membrane CC 16 702 2,571 GO:0046872 Metal ion binding MF 17 255 2,734 GO:0006355 Regulation of transcription, DNA-dependent BP 19 266 2,752 GO:0000139 Golgi membrane CC 22 28 2,855 GO:0016740 Transferase activity MF 25 116 2,721 The median gene length of all categories are significantly longer than average using a Mann-Whitney test. BP, biological process; CC, cellular component; MF, molecular function. Table 2 Gene Ontology categories ranked in the top 25 in length bias adjusted GOseq but not in the standard method GOID Term Ontology Rank standard Rank GOseq Average gene length in category GO:0006414 Translational elongation BP 92 8 708* GO:0000786 Nucleosome CC 64 14 1,345* GO:0006334 nucleosome assembly BP 70 17 1,362* GO:0043687 Post-translational protein modification BP 50 20 1,830 GO:0019787 Small conjugating protein ligase activity MF 65 25 1,928 GO:0016192 Vesicle-mediated transport BP 28 22 2,765 GO:0006412 Translation BP 104 23 1,472 GO:0051246 Regulation of protein metabolic process BP 66 24 1,609 Three of these categories (marked with asterisks) have genes significantly shorter than average length (Mann-Whitney P-value < 0.05). BP, biological process; CC, cellular component; MF, molecular function. The standard GOseq analysis was also compared to both the random sampling strategy and the more computationally efficient Wallenius approximation. P-values for over-representation of DE genes for each GO category were generated using random sampling with a high number of repetitions (200,000). These P-values were then compared to P-values calculated using the standard hypergeometric test and GOseq utilizing the Wallenius approximation. Comparison of the categories' P-values demonstrate a large discrepancy between the GOseq method and the hypergeometric method, as well as very little difference between GOseq using sampling or Wallenius (Figure 4a; Figure S3 in Additional file 1). Furthermore, we compared the top ranked lists of enriched GO categories between two methods by plotting the number of discrepancies between the methods for a given list size (Figure 4b). This plot shows that for the prostate cancer data set, approximately 20% of GO categories that appear using the standard analysis are not present when GOseq is used and vice versa. The high number of discrepancies for short lists shows that failure to account for length bias impairs analysis, even if we are only interested in a small number of categories. Reassuringly, the Wallenius approximation closely approximates GOseq using high repetition sampling with very few changes in P-values or rankings of categories. Similar results are seen for other RNA-seq datasets (data not shown). Figure 4 Comparison of GOseq and the standard hypergeometric methods. (a) The P-values generated with GOseq using the Wallenius approximation and the standard hypergeometric method are plotted against the P-values calculated with GOseq using random sampling (200k repeats). The Wallenius method (green crosses) shows good agreement with the high resolution (200,000 repeats) random sampling. A large discrepancy in P-values is seen between GOseq and the hypergeometric method. (a) The number of discrepancies between lists is shown for a given list size. The black line compares GOseq using high resolution sampling with the hypergeometric method. The red line compares GOseq using high resolution sampling with the Wallenius approximation. Again, GOseq using the Wallenius method shows little difference from GOseq using the random sampling method (with 200k repeats) while the hypergeometric method shows a large number (approximately 20%) of discrepancies. Measuring GOseq's accuracy by comparison with microarray data The GOseq method clearly makes a substantial difference to the categories selected when performing a GO analysis. In order to demonstrate that accounting for length bias produces more reliable results, we compared the results of GOseq and the standard test to the GO analysis from a microarray experiment that does not show any gene length bias [5]. For the comparison of RNA-seq and microarrays, a data set was used that compares the exact same liver and kidney samples on the two platforms [21] (see Materials and methods). Figure 5 plots the fraction of microarray GO categories recovered from the RNA-seq data using the hypergeometric and GOseq methods, as a function of the number of GO categories considered. It can be seen that GOseq gives categories more consistent with the microarray platform (P = 0.067), indicating that accounting for length bias gives a GO analysis with better performance. Figure 5 A comparison of Gene Ontology analysis using RNA-seq and microarrays on the same samples. The fraction of GO categories identified by RNA-seq data that overlap with the microarray GO analysis are shown as a function of the number of categories selected. RNA-seq data have been analyzed using GOseq and hypergeometric methods. The GOseq categories have a consistently higher overlap with the microarray GO categories than the standard method. Transcript length bias versus read count bias As transcript length bias is a technical effect, it is always necessary to correct for it when performing category testing on RNA-seq data. However, the bias in power to detect DE in longer genes arises from an increase in the total number of reads for each gene, where the number of reads is given by transcript length multiplied by expression level. Therefore, there may be circumstances where it is desirable to correct for the effect of expression level on power to detect DE, in addition to the contribution from transcript length, that is, total read count bias (for further discussion see Additional file 1). The GOseq method is capable of handling both types of bias. We have mainly focused on transcript length bias as it will always need to be accounted for and because the decision to account for expression level or not ultimately depends on the questions the user wishes to answer. To assess the impact of total read count bias, GOseq was used to analyze the prostate cancer data set accounting for read count bias. As expected, correcting for read count bias results in even greater differences compared to the standard hypergeometric method than just correcting for length bias alone. When read count bias is corrected for in the prostate cancer data set, more than 50% of significant GO categories are different from the list of significant GO categories obtained using the standard hypergeometric method (Figure 6). This is true even for the very top GO categories - there is only one GO category that appears in the top ten in both the standard and read count adjusted lists (Tables 3 and 4). Figure 6 Change in most significant categories when correcting for total read count bias or length bias versus the standard method. A plot of the number of discrepancies in the most significant GO categories generated using different methods. This plot compares the length bias correcting version of GOseq to the standard hypergeometric method (green line) and the total read count bias correcting version of GO-seq to the standard hypergeometric method (black line). Table 3 Top ten Gene Ontology categories using the standard method GOID Term Ontology Rank standard Rank GOseq Median read count of genes in category GO:0005515 Protein binding MF 1 11 206 GO:0005737 Cytoplasm CC 2 111 238 GO:0005634 Nucleus CC 3 4,303 228 GO:0000166 Nucleotide binding MF 4 2,546 252 GO:0005829 Cytosol CC 5 1,192 349 GO:0005783 Endoplasmic reticulum CC 6 6 197 GO:0003677 DNA binding MF 7 2,109 189 GO:0005794 Golgi apparatus CC 8 26 213 GO:0005622 Intracellular CC 9 3,592 187 GO:0016874 Ligase activity MF 10 88 353 BP, biological process; CC, cellular component; MF, molecular function. Table 4 Top ten Gene Ontology categories using GOseq adjusted for total read count bias GOID Term Ontology Rank standard Rank GOseq Median read count of genes in category GO:0016020 Membrane CC 16 1 91 GO:0005886 Plasma membrane CC 96 2 50 GO:0005887 Integral to plasma membrane CC 313 3 36 GO:0005576 Extracellular region CC 4,310 4 18 GO:0015020 Glucuronosyl transferase activity MF 388 5 10 GO:0005783 Endoplasmic reticulum CC 6 6 197 GO:0016021 Integral to membrane CC 179 7 68 GO:0006470 Protein amino acid dephosphorylation BP 31 8 224 GO:0045944 Positive regulation of transcription from RNA polymerase II promoter BP 39 9 133 GO:0007049 Cell cycle BP 14 10 202 BP, biological process; CC, cellular component; MF, molecular function. Discussion Biological relevance of selected categories To determine the effect of GOseq on the ability to draw biologically meaningful conclusions, the top ten GO categories using the standard hypergeometric method and the read count adjusted GOseq method were compared in the prostate cancer data set (Tables 3 and 4). We found that categories identified by GOseq are more consistent with previous studies looking at the relationship of androgens with prostate cancer. The role of androgens in prostate cancer is well supported, with androgen required in rodent prostate cancer induction models, and castration prior to puberty being protective against prostate cancer. Androgen is thought to be responsible for promotion of prostate cancer progression through enhancing the androgen regulated processes of growth and cellular activity. In normal prostate, androgen supports the secretary epithelial, which turns over at a rate of 1 to 2% of cells per day, and most prostate cancers are derived from these cells [22]. Based on this biological knowledge the prior expectation is that there will be an increase in cellular activity, proliferation and secretion in LNCaP cells in response to androgen. Previous microarray experiments have shown that LNCaP cells retain androgen responsiveness and that most genes upregulated are involved in the production of seminal fluid [23]. In the standard analysis, the top ten GO categories indicate a change in intracellular genes, including nuclear and DNA binding genes (Table 3). The top ten categories using GOseq with total read count adjustment indicate significant changes at membranes and extracellular space, transcriptional upregulation, and in cell cycle genes (Table 4). The categories identified as most significant by GOseq therefore better match the known biology of androgen response in prostate cancer. The genes that are significant only with the length bias adjustment (Table 2) include four categories consistent with increased translation and protein production, vesical mediated transport consistent with secretion, and two categories related to nucleosomes consistent with increased replication. The category of small conjugating protein ligase activity is supported by the previously reported up-regulation of ubiquitin ligases UBE2C and HSPC150 [23]. Possibility of true biological trends in gene length A key assumption of GOseq is that longer genes are not of biologically greater interest than shorter genes, per se. This assumption is supported by microarray data, where no systematic trend between gene length and differential expression has been observed [5]. The authors find it hard to imagine that any genuine biological process could induce a trend in differential expression versus gene length comparable in magnitude to the technical trend that is removed by GOseq. Nevertheless, users should be aware that any biological trend in DE versus gene length will be adjusted for by GOseq. Using the Wallenius approximation The reduction in computation afforded by the Wallenius approximation is predicated on the assumption that the variance in the probability weighting within categories is low. Hence, the approximation will perform better for a probability weighting function with a low range of probabilities. When the range in probabilities as a function of gene length or read count is large, random sampling performs better than the Wallenius approximation, even at a relatively small number of replicates. The probability weighting function for read count bias has a larger range in probabilities compared to the PWF for length bias in the prostate cancer data set (Figure 2) and so random sampling may be more appropriate when accounting for total read count bias. GOseq and other technologies GOseq with total read count adjustment is relevant for other tag-based next generation expression profiling technologies, such as SAGE or CAGE, which have no transcript length bias [24]. Even without length bias, statistical power to detect differential expression still depends on the expression level of each transcript and correcting for this bias will generally be desirable in this context. GO analysis of microarray expression data has so far ignored the possibility of selection bias, but such bias clearly does exist. It is well known that fold changes are more precisely estimated for microarray probes at higher intensity levels, so selection bias is likely to exist as a function of intensity. Furthermore, most microarray platforms have multiple probes for some genes. Genes with more probes will have a great chance of being selected as DE, assuming the analysis is conducted probe-wise. The methodology developed here for RNA-seq could easily be adapted to GO analysis of microarray data, and would likely yield benefits in terms of biological relevance. GOseq software In order to implement the GOseq method, we developed a set of freely available R functions, which includes functions for calculating the significance of over-representation of each GO category amongst DE genes. These functions give the user the option of selecting which type of bias they wish to compensate for (transcript length bias or total read count bias). The option of using random sampling or the Wallenius approximation is also available. The ability of the user to supply their own categories for unbiased testing is also included. The software is freely available and can be downloaded from our website [25]. Conclusions Here we have developed a statistical framework for GO analysis for use with RNA-seq data. It is mathematically indisputable that all commonly used criteria for judging DE interact with gene length and read count. This provides a well understood causal model of why length bias exists and why it needs to be accounted for. The GOseq method is able to account for such biases when performing GO analysis. We find that the new method makes a substantial difference to the categories identified as the most significant. We show that the GOseq method is able to recover well established microarray results more readily than existing methods of GO analysis of RNA-seq data. Furthermore, using an androgen treated prostate cancer data set we find that the most significant categories identified using GOseq match the known biology better than existing methods. Materials and methods All the statistical analyses were performed in R [26]. The methods are described in detail in Additional file 1 and outlined briefly here. The prostate cancer data set The LNCap cell line was treated with androgen. Mock treated and treated cell lines were sequenced using the Illumina GA 1 [13]. Raw 35-bp RNA-seq reads were provided and mapped to the human genome using Bowtie. Each mapped read was associated with an ENSEMBL gene. A Poisson exact test [15,16] was used to determine differential expression between treated and mock-treated LNCap cells. The liver versus kidney data set Genome-wide expression was measured in liver and kidney using RNA-seq on the Illumina GA I and hybridization of the same samples to Affymetrix HG-U133 Plus 2.0 arrays. The sample preparation and data analysis was designed to maximize the similarity between the microarray and RNA-seq experiments (see Marioni et al. [21]). Differential expression between kidney and liver was determined using an empirical Bayes modified t-statistic on the microarray platform and P-values for DE were downloaded from their website. For the RNA-seq experiment, the data were normalized using TMM normalization [27] and a negative binomial exact test was used to determine DE [16]. To test the GOseq method, we used the genes called DE from the microarray experiment to calculate the significance of over-representation of each GO category using the standard GO analysis methods. We also calculated P-values for each GO category being over-represented among genes that were DE in the RNA-seq data, using both the GOseq and hypergeometric methods. GOseq's ability to outperform the hypergeometric method, as measured by its ability to reproduce the results of the microarray GO analysis, was quantified by calculating a P-value for the difference in the two methods being due to chance. To do this, a NULL was chosen under which both methods were equally likely to correctly recover each microarray GO category, with this likelihood given by a binomial distribution. The probability weighting function To calculate the PWF, a cubic spline with a montonicity constraint is fitted to the binary data series where a value of 1 refers to a DE gene and 0 refers to a non-DE gene. This fit can be calculated against either the length of the gene or the read counts of a gene. We used the R function pcls in the mgcv package to generate the fit. Calculating significance of categories Random samples of genes are created by selecting a subset of genes from the experiment, with each gene weighted by the probability derived from the PWF. Each random sample contains the same number of genes as the set of DE genes. For each sample the number of genes with a given GO category is calculated. Many samples are generated in order to produce a null distribution from which the P-value for the significance of a category can be estimated. To implement the Wallenius approximation, we used the BiasedUrn package in R. The 'odds' parameter is defined as the relative probability of genes within a category to the genes outside the category. The 'odds' ratio is calculated by taking the mean of the values from the PWF for each gene in the set and dividing by the mean of the values from the PWF for genes outside the set. Abbreviations DE: differentially expressed; GO: Gene Ontology; PWF: probability weighting function. Authors' contributions AO, GKS, and MJW conceived of the idea. AO, MDY, and GKS conceived and performed the analysis. MDY wrote the software. MJW interpreted biological results. MDY, MJW, GKS, and AO wrote the paper. All authors read and approved the final manuscript. Supplementary Material Additional file 1 A Word document containing supplementary methods, discussion and results. Particular attention is given to technical details of the GOseq method. Click here for file
                Bookmark

                Author and article information

                Contributors
                Journal
                Front Vet Sci
                Front Vet Sci
                Front. Vet. Sci.
                Frontiers in Veterinary Science
                Frontiers Media S.A.
                2297-1769
                22 February 2022
                2022
                : 9
                : 813170
                Affiliations
                [1] 1Hainan Key Lab of Tropical Animal Reproduction, Animal Genetic Engineering Key Lab of Haikou, Breeding and Epidemic Disease Research, College of Animal Science and Technology, Hainan University , Haikou, China
                [2] 2Jinyu Baoling Bio-Pharmaceutical Co., Ltd. , Hohhot, China
                [3] 3Inner Mongolia University , Hohhot, China
                [4] 4Inner Mongolia Academy of Agricultural and Animal Husbandry Sciences , Hohhot, China
                [5] 5College of Animal Science, Inner Mongolia Agricultural University , Hohhot, China
                Author notes

                Edited by: Changyong Cheng, Zhejiang A & F University, China

                Reviewed by: Samira Tarashi, Pasteur Institute of Iran (PII), Iran; Chun Fang, Yangtze University, China

                *Correspondence: Fengyang Wang fywang68@ 123456163.com

                This article was submitted to Veterinary Infectious Diseases, a section of the journal Frontiers in Veterinary Science

                Article
                10.3389/fvets.2022.813170
                8902169
                35274021
                90c7a454-e093-4553-b14e-44c96b8c598a
                Copyright © 2022 Chen, Wang, Chen, Zhao, Liu, Zhao, Fu, He, Yang, Zhao, An, Zhang, Cheng, Man, Liu, Wei, Zhang, Du and Wang.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 11 November 2021
                : 13 January 2022
                Page count
                Figures: 9, Tables: 3, Equations: 0, References: 51, Pages: 15, Words: 8011
                Categories
                Veterinary Science
                Original Research

                brucella,sheep,lymphatic nodes,mirnas,downregulation
                brucella, sheep, lymphatic nodes, mirnas, downregulation

                Comments

                Comment on this article