16
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Doxycycline Alters Metabolism and Proliferation of Human Cell Lines

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The tetracycline antibiotics are widely used in biomedical research as mediators of inducible gene expression systems. Despite many known effects of tetracyclines on mammalian cells–including inhibition of the mitochondrial ribosome–there have been few reports on potential off-target effects at concentrations commonly used in inducible systems. Here, we report that in human cell lines, commonly used concentrations of doxycycline change gene expression patterns and concomitantly shift metabolism towards a more glycolytic phenotype, evidenced by increased lactate secretion and reduced oxygen consumption. We also show that these concentrations are sufficient to slow proliferation. These findings suggest that researchers using doxycycline in inducible expression systems should design appropriate controls to account for potential confounding effects of the drug on cellular metabolism.

          Related collections

          Most cited references17

          • Record: found
          • Abstract: found
          • Article: not found

          Mitochondrial evolution.

          The serial endosymbiosis theory is a favored model for explaining the origin of mitochondria, a defining event in the evolution of eukaryotic cells. As usually described, this theory posits that mitochondria are the direct descendants of a bacterial endosymbiont that became established at an early stage in a nucleus-containing (but amitochondriate) host cell. Gene sequence data strongly support a monophyletic origin of the mitochondrion from a eubacterial ancestor shared with a subgroup of the alpha-Proteobacteria. However, recent studies of unicellular eukaryotes (protists), some of them little known, have provided insights that challenge the traditional serial endosymbiosis-based view of how the eukaryotic cell and its mitochondrion came to be. These data indicate that the mitochondrion arose in a common ancestor of all extant eukaryotes and raise the possibility that this organelle originated at essentially the same time as the nuclear component of the eukaryotic cell rather than in a separate, subsequent event.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Studies on the propagation in vitro of poliomyelitis viruses. IV. Viral multiplication in a stable strain of human malignant epithelial cells (strain HeLa) derived from an epidermoid carcinoma of the cervix.

            The cells of a human epithelial cancer cultivated en masse have been shown to support the multiplication of all three types of poliomyelitis virus. These cells (strain HeLa of Gey) have been maintained in vitro since their derivation from an epidermoid carcinoma of the cervix in February, 1951. As the virus multiplied it caused in from 12 to 96 hours degeneration and destruction of the cancer cells. The specific destructive effect of the virus was prevented by adding homotypic antibody to the cultures but not by adding heterotypic antibodies. Methods for the preparation of large numbers of replicate cultures with suspensions of strain HeLa cells were described. The cells in suspension were readily quantitated by direct counts in a hemocytometer. A synthetic solution that maintains cellular viability was employed for viral propagation. The experimental results demonstrate the usefulness of strain HeLa cells for (a) the quantitation of poliomyelitis virus, (b) the measurement of poliomyelitis antibodies, and (c) the production of virus.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              U87MG Decoded: The Genomic Sequence of a Cytogenetically Aberrant Human Cancer Cell Line

              Introduction Grade IV glioma, also called glioblastoma multiforme (GBM), is the most common primary malignant brain tumor with about 16,000 new diagnoses each year in the United States. While the number of cases is relatively small, comprising only 1.35% of primary malignant cancers in the US [1], GBMs have a one-year survival rate of only 29.6%, making it one of the most deadly types of cancer [2]. Recent clinical studies demonstrate improved survival with a combination of radiation and Temozolomide chemotherapy, but median survival time for GBM patients who receive therapy is only 15 months [3]. Due to its highly aggressive nature and poor therapeutic options, understanding the genetic etiology of GBM is of great interest and therefore, GBM has been selected as one of the three initial cancer types to be thoroughly studied in the TCGA program [4]. To that end, numerous cell line models of GBM have been established and used in vast numbers of studies over the years. It is well recognized that cell line models of human disorders, especially cancers, are an important resource. While these cell lines are the basis of substantial biological insight, experiments are currently performed in the absence of genome-wide mutational status as no cell line that models a human disease has yet had its genome fully sequenced. Here, we have sequenced the genome of U87MG, a long established cell line derived from a human grade IV glioma used in over 1,700 publications [5]. A wide range of biological information is known about this cell line. The U87MG cell line is known to have a highly aberrant genomic structure based on karyotyping, SKY [6], and FISH [7]. However, these methods neither provide the resolution required to visualize the precise breakpoint of a translocation event, nor are they generally capable of identifying genomic microdeletions (deletions on the order of a megabase or less in size) in a whole genome survey of structural variation. SNP genotyping microarrays can be used to detect regions of structural variation in the forms of loss of heterozygosity (LOH) and copy number (CN) based on probe intensity, but do not reveal chromosomal joins. To assess the genomic stability of U87MG, the genome was genotyped by Illumina Human 1M-Duo BeadChip microarray. In spite of being cultured independently for several years, the regions of LOH and the CN state of our U87MG genome matched exactly with data retrieved from the Sanger COSMIC database for U87MG [8], which had been assayed on an Affymetrix Genome-Wide Human SNP Array 6.0. This suggests that although U87MG bears a large number of large-scale chromosomal aberrations, it has been relatively stable for years and is not rapidly changing. This suggests that prior work on U87MG may be reinterpreted based on the whole genome sequence data presented here. The first draft of the consensus sequence of the human genome was reported in 2001 [9],[10]. The first individual human diploid sequence was sequenced using capillary-based Sanger sequencing [11]. Since then, a few additional diploid human genomes have been published utilizing a variety of massively parallel sequencing techniques to sequence human genomes to varying degrees of coverage, variant discovery, and quality typically costing well over $200,000 and several machine months of operation [12]–[16]. For the sequencing of U87MG, we utilized ABI SOLiD technology, which uses a ligation-based assay with two-base color-encoded oligonucleotides that has been demonstrated to allow highly accurate single nucleotide variant (SNV) and insertion/deletion (indel) detection [17]. Additionally, long mate-paired genomic libraries with a mean insert size of 1–2kb allowed higher clone coverage of the genome, which improved our ability to identify genomic structural variations such as interchromosomal translocations and large deletions. While longer insert sizes would improve resolution of some structural variants, during genomic shearing the highest density of large fragments occurs at 1.5kb, allowing a sufficiently complex library to be generated from only 10 micrograms of genomic DNA while still being well powered to identify structural variations. Here, we demonstrate that aligning the two-base color-encoding data with BFAST software and decoding during alignment allows for highly sensitive detection of indels, which have in the past been difficult to detect by short read massively parallel sequencing. For cancer sequencing, it is important to assess not only SNVs, but indels, structural variations and translocations, and it is preferable to extract this information from a common assay platform. A major characteristic of the U87MG cell line that differentiates it from the samples used in other whole genome sequencing projects published thus far is its highly aberrant genomic structure. Due to its heavily rearranged state, we thoroughly and accurately assessed each of these major classes of mutations and demonstrated that small indels, large microdeletions and interchromosomal translocations are actually the major categories of mutations that affect known genes in this cancer cell line. These analyses provide a model for other genome sequencing projects outside major genome centers of how to both thoroughly sequence and assess the mutational state of whole genomes. Results Data Production From ten micrograms of input genomic DNA, we performed two and a half full sequencing runs on the ABI SOLiD Sequencing System, for a total of five full slides of data [17]. Utilizing the ABI long mate-pair protocol, we produced 1,014,984,286 raw 50bp mate-paired reads (101.5Gb). In some cases the bead was recognized by the imaging software for only one read, thereby producing an additional 120,691,623 single end reads (6.0Gb). In aggregate, we generated a total of 107.5Gb of raw data (Table 1). 10.1371/journal.pgen.1000832.t001 Table 1 Genome sequencing summary. Sequencing Libraries 1 SOLiD Runs (Slides) 2.5 (5) Strategy 2×50 Mate-paired reads passing quality filter (total bases) 1,014,984,286 (101.5Gb) Single-end reads passing quality filter (total bases) 120,691,623 (6.0Gb) Mate-paired reads uniquely aligned by BFAST (bases) 390,064,184 (39.06Gb) Unpaired reads uniquely aligned by BFAST (bases) 266,635,829 (13.33Gb) Single-end reads uniquely aligned by BFAST (bases) 62,336,824 (3.12Gb) Total bases uniquely aligned by BFAST 55.51Gb We also performed an exon capture approach designed to sequence the exons of 5,253 genes (10.7Mb) annotated in the Wellcome Trust Sanger Institute Catalogue of Somatic Mutations in Cancer (COSMIC) V38 [8], Cancer Gene Census, Cancer Genome Project Planned Studies and The Cancer Genome Atlas (TCGA) [4] GBM gene list using a custom-created Agilent array. This approach used the Illumina GAII sequencing system [18] to sequence captured DNA fragments using a paired end sequencing protocol. This resulted in 9,948,782 raw 76bp paired end reads (1.51Gb), and a mean base coverage of 29.5×. These reads were used to calculate concordance rates with the larger whole genome sequence dataset. The Blat-like Fast Accurate Search Tool (BFAST) [19] version 0.5.3 was used to align 107.5Gb of raw color space reads to the color space conversion of the human genome assembly hg18 from University of California, Santa Cruz (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/, based on the March 2006 NCBI build 36.1). Duplicate reads, typically from the same initial PCR fragment during genomic library construction, were inevitable and accounted for 16.4% of the total aligned data. These were removed using the alignment filtering utility in the DNAA package (http://dnaa.sourceforge.net). A total of 390,604,184 paired end reads (39.06Gb), 266,635,829 (13.33Gb) unpaired reads, and 62,336,824 (3.12Gb) single end reads were successfully mapped to a unique location in the reference genome with high confidence for a total of 55.51Gb of aligned sequence (Table 1). For the exon capture dataset, we uniquely aligned 8,142,874 paired end reads (1.2Gb) and 1,097,000 (83Mb) unpaired reads for a total of 1.32Gb of raw aligned sequence (Table 2). Using the ABI SOLiD reads, we identified small insertions and deletions (indels), single nucleotide variants (SNVs), and structural variants such as large-scale microdeletions and translocation events. The exon capture Solexa reads were used to validate SNVs identified in the SOLiD sequencing. 10.1371/journal.pgen.1000832.t002 Table 2 Exon capture sequencing summary. Sequencing Libraries 1 Illumina Runs (Lanes) 1/8 (1) Strategy 2×76 Number of bases targeted 10752923 Mate-paired reads passing quality filter (total bases) 9,948,782 (1.51Gb) Mate-paired reads uniquely aligned by BFAST (bases) 8,142,874 (1.2Gb) Unpaired reads uniquely aligned by BFAST (bases) 1,097,000 (83Mb) Total bases uniquely aligned by BFAST 1.32Gb Total targeted bases sequenced 317,017,503 Mean coverage within targeted bases 29.5× The overall pattern of base sequence coverage from the shotgun reads changes across the genome, and as expected is highly concordant with the copy number state as determined by Illumina 1M Duo and Affymetrix 6.0 SNP analysis (Figure 1). Regions of two normal copies, such as chromosome 3, showed even base sequence coverage across their entire length (12.4 reads/base, excluding centromeric and telomeric regions which are not represented accurately in hg18). Meanwhile, regions with one-copy state according to the SNP chip, such as the distal q-arm of chr11 and the distal p-arm of chr6, show about half the base sequence coverage (7.2 reads/base) as a predicted two-copy region. Likewise, predicted three-copy state regions, such as the distal q-arm of chr13, show about 1.5 times the base sequence coverage of a predicted two-copy region. A complete deletion spanning the region on chromosome 9 that includes the CDKN2A gene is also seen in both the SNP chip and ABI SOLiD base sequence coverage. These data show at a very large scale that sequence placement is generally correct and supports the copy number state calls from the array based data. 10.1371/journal.pgen.1000832.g001 Figure 1 SNP chip and base sequence coverage compared. Circos [36] was used to plot base sequence coverage from ABI sequencing as a histogram (the outermost plot, green and black) alongside dot plots of the LogR-ratio (middle plot, blue) and B-allele frequency (innermost plot, red) from the Illumina Human 1M Duo BeadChip microarray. LogR-ratio represents copy number (CN) and B-allele frequency represents loss of heterozygosity (LOH). Variant Discovery Single nucleotide variants (SNVs) and small insertions and deletions ranging from 1 to 20 bases (indels) were identified from the alignment data using the MAQ consensus model [20] as implemented in the SAMtools software suite [21]. SAMtools produced variant calls, zygosity predictions, and a Phred-scaled probability that the consensus is identical to the reference. To improve the reliability of our variant calls, variants were required to have a Phred score of at least 10 and further needed to be observed greater than or equal to 4 but less than 60 times and at least once on each strand. In total, we identified 2,384,470 SNVs meeting our filtering criteria. Of these, 2,140,848 (89.8%) were identified as exact matches to entries in dbSNP129 [22]. Exact matches had both the variant and observed alleles in the dbSNP entry, allowing for the discovery of novel alleles at known SNP locations. In total, 243,622 SNVs (10.2%) were identified as novel events not previously recorded in dbSNP 129. This rate of novel variant discovery is consistent with other whole human normal genome sequences of European ancestry relative to dbSNP [12]. These SNVs were further characterized based on zygosity predictions from the MAQ consensus model, separating SNVs into homozygous or heterozygous categories (Table 3). The observed diversity value for SNVs (θSNV, number of heterozygous SNVs/number of base pairs) across autosomal chromosomes was 4.4×10−4, which is generally consistent with the normal human genome variation rate. 10.1371/journal.pgen.1000832.t003 Table 3 SNV filtering and quantification. SNV Classification Total In dbSNP 129 Not in dbSNP 129 Variants meeting filter criteria 2,384,470 2,140,848 243,622 Synonymous or not coding region 2,375,812 2,133,226 242,586 Coding region & non-synonymous 8,658 7,622 1,036 Splice site mutations 151 132 19 Heterozygous splice site mutations 62 47 15 Homozygous splice site mutations 89 85 4 Premature stop 134 93 41 Heterozygous premature stop 82 48 34 Homozygous premature stop 52 45 7 Non-synonymous 8,538 7,518 1,020 Heterozygous non-synonymous 4,005 3,134 871 Homozygous non-synonymous 4,533 4,384 149 For small ( G) or pyrimidine for pyrimidine (C T). Previous studies have observed that two out of every three SNPs are transitions as opposed to transversions [27], and we observed that 67.4% of our SNVs were transitions, while 32.6% were transversions, a 2.07∶1 ratio. (Figure 3) However, in coding regions, there appears to be an increase in C->T/G->A transitions and a decrease in T->C/A->G transitions, whereas genome-wide these transitions were approximately equivalent. 10.1371/journal.pgen.1000832.g003 Figure 3 Base substitution frequencies. Ratio of specific nucleotide substitutions as a percent of total single nucleotide variants, comparing SNVs in coding regions (blue) to SNVs genome-wide (red). Estimation of Genomic Coverage To assess the coverage depth of the U87MG genome sequence, we followed Ley at al. [13] and required detection of both alleles at most positions in the genome. We utilized the Illumina 1M-Duo BeadChip to find reliably sequenced positions in the genome with an understanding that this may lead to bias towards more unique regions of the genome. In order to best use the SNP genotyping array data, we included only those regions that are diploid based on normal frequency of heterozygous calls and copy number assessment. This effectively permitted us to use the heterozygous calls for assessing accuracy of the short read data for variant calling (Figure 1). Only SNPs both observed to be heterozygous and that the Illumina genotyping chip called ‘high quality’ were used, which provided a total of 219,187 high quality heterozygous SNPs for comparison. 99.71% of these were sequenced at least once. After applying variant detection filtering criteria (see Materials and Methods) and assessing concordance between the sequence calls and genotyping array calls, 93.71% of the genome was sequenced at sufficient depth to call both alleles of the diploid genome. This is roughly equivalent to the likelihood of sufficient sampling of the whole genome when repeats and segmental duplications are excluded. Notably, a variant allele was observed at every position called heterozygous by SNP chip, while a reference allele was observed at 201,414 (97.94%) positions. In other words, the SNV detection algorithm uniformly miscalled the homozygous variant allele. Filtering for quality causes a bias toward identifying SNVs at sites that have higher coverage. That said, after SNV quality filtering, diploid coverage of the cytogenetically normal portions of the genome was 10.85× for each allele, which is clearly adequate for calling over 90% of the base variant positions on each allele at high accuracy. Because the positions of the genome included on SNP arrays is not a random sampling of the genome, we also assessed mapping coverage genome-wide. Of all bases in the haploid genome, 78.9% of the whole reference genome was covered by at least one reliably placed read. Of that portion of the genome, 91.9% of all bases were effectively sequenced based on passing variant calling filters (Phred>10, >4× coverage, TA transitions in coding regions (Figure 3). It is well established that the most common source of point mutations and SNPs in primates is deamination of methyl-cytosine (meC), causing transition to a thymine (T) [16],[29], and there is circumstantial evidence of that in U87MG's genome as well. The resolution of genome-wide chromosomal rearrangements is substantially improved by the mate-pair strategy, coupled with sensitive and independent alignment of the short 50-base reads (Figure 5). Based on published SKY data, we anticipated 7 interchromosomal breakpoints [6]. However, whole-genome mate-paired sequence data revealed the precise chromosomal joins of 35 interchromosomal events, which account for previously observed chromosomal abnormalities in U87MG but at additional finer scale resolution (Figure 5, Figure 6, Figure 7). The translocation events were enriched in genic regions with 32/35 (91.4%) occurring within 1kb of genes. A weaker, but still noticeable enrichment over genes occurs with microdeletions as well, which are generally missed by other experimental techniques like DNA microarrays. Thus, within the overall mutational landscape of this cancer cell line, translocations and structural variants preferentially occurred over genes, supporting a model where cancer mutations occur via structural instability rather than novel point mutations. 10.1371/journal.pgen.1000832.g007 Figure 7 Increased resolution of structural variations by sequencing. Resolution of karyotyping and SKY approaches is not high enough to see the complex nature of this translocation event between chr1 and chr16. With high-resolution whole-genome sequencing, the true structure of the translocation is revealed as mutual translocations between a small fragment of chr2 with chr1 and chr16 on either end. Delving into the functional effects of the mutations in U87MG through gene ontology and cross-referencing the literature, we found a large number of known and predicted cancer mutations present in the cell line. There is always a concern when dealing with a cancer cell line that mutations will be more related to its status as a cell line than to the cancer it was derived from. While this remains a concern, the large number of predicted and known cancer genes present in U87MG suggests other genes mutated in it have relevance to cancer as well. Using GISTIC to find regions with common deletions in glioma samples, we highlight 60 genes that are mutated in U87MG and are located in regions that are commonly deleted in GBMs that are not included within the Cancer Gene Census list as potential candidate mutational targets in GBMs (Table S5). Cancer cell lines are commonly used as laboratory resources to study basic molecular and cellular biology. It is clearly preferable to have complete genomic sequence for these valuable resources. U87MG is the most commonly studied brain cancer cell line and is highly cytogenetically aberrant. While this made the sequencing and mutational analysis more challenging, it serves as a model for future cultured cell line genomic sequencing. Through custom analyses, we found that the mutational landscape of the U87MG genome is vastly more complicated than we would have expected based on the variants discovered in previously published genomes. It is our hope that the increased genomic resolution presented here will direct researchers and clinicians in their work with this brain cancer cell line to create more effective experiments and lead to a greater ability to draw meaningful conclusions in the future. Materials and Methods Data Sources The NCBI reference genome (build 36.1, hg18, March 2006), genome annotations, and dbSNP version 129 were downloaded from the UCSC genome database located at http://genome.ucsc.edu. A local mirror of the UCSC genome database (hg18) was used for the subsequent analysis of variants using included gene models and annotations. The Watson genome variants were downloaded from Cold Spring Harbor Laboratory (http://jimwatsonsequence.cshl.edu) with bulk data files available from ftp://jimwatsonsequence.cshl.edu/jimwatsonsequence/. The YanHuang variants were downloaded from the Beijing Genomics Institute at Shenzhen (http://yh.genomics.org.cn/) with bulk data files available from http://yh.genomics.org.cn/download.jsp. Sample Preparation U87MG cells were ordered from ATCC (HTB-14) and cultured in a standard way. Genomic DNA was isolated from cultured U87MG cells using Qiagen Gentra Puregene reagents. DNA was stored at −20C until library generation. ABI Sequencing Long-Mate-Paired Library Construction: The U87MG genomic DNA 2× 50bp long mate-paired library construction was carried out using the reagents and protocol provided by Applied Biosystems (SOLiD 3 System Library Preparation Guide). A similar protocol was reported previously [17]. Briefly, 45ug of genomic DNA was fragmented by HydroShear (Digilab Genomic Solutions Inc) to 1.0–2.5kb. The fragmented DNA was repaired by the End-It DNA End-Repair Kit (Epicentre). Subsequently, the LMP CAP adaptor was ligated to the ends. DNA Fragments between 1.2–1.7kb were selected by 1.0% agarose gel to avoid concatamers and circularized with a biotinylated internal adaptor. Non-circularized DNA fragments were eliminated by Plasmid-Safe ATP-Dependent DNase (Epicentre) and 3ug of circularized DNA was recovered after purification. Original DNA nicks at the LMP CAP oligo/genomic insert border were translated into the target genomic DNA about 100bp by nick translation using E. coli DNA polymerase I. Fragments containing the target genomic DNA and adaptors were cleaved from the circularized DNA by single-strand specific S1 nuclease. P1 and P2 adaptors were ligated to the fragments and the ligated mixture was used to create two separate libraries with 10 cycles of PCR amplification. Finally, 250–300bp fragments were selected to generate mate paired sequencing libraries with average target genomic DNA on each end around 90bp by excision from PAGE gel and use as emulsion PCR template. Templated Beads Preparation: The templated beads preparation was performed using the reagents and protocol from the manufacturer (Applied Biosystems SOLiD 3 Templated Beads Preparation Guide). SOLiD 3 Sequencing: The 2×50b mate-paired sequencing was performed exactly according to the Applied Biosystems SOLiD 3 System Instrument Operation Guide and using the reagents from Applied Biosystems. Exon Pull-Down Capture Sequencing with Illumina GAII We used an array pull-down capture strategy established in our lab [30]. An Agilent custom array for capturing 5,253 “cancer-related” genes was designed through Agilent e-array system (www.agilent.com). Only the amino acid encoding regions were targeted with 60mer oligos spaced center-to-center 20–30bp. The probes were randomly distributed across two separate 244K arrays. The library for cancer gene capture sequencing was generated following the standard Illumina paired-end library preparation protocol. 5ug of genomic DNA was used for the starting material and 250–300bp fragments were size-selected during the gel-extraction step. In the last step, 18 cycles of PCR were performed in multiple tubes to yield 4ug of product and mixed with 50ug of Human Cot-1 DNA (Invitrogen), 52ul of Agilent 10× Blocking Agent, 260ul of Agilent 2× Hybridization Buffer and 10× molar concentration of unpurified Illumina paired-end primer pairs custom made according to the sequences provided by Illumina (Oligonucleotide sequences, 2008, Illumina, Inc: available on request from Illumina). The mix was then diluted with elution buffer for the final volume of 520ul and then incubated at 95°C for 3 min and 37°C for 30min. 490ul of the hybridization mix was added to the array and hybridized in the Agilent hybridization oven (Robins Scientific) for 65 hrs at 65°C, 20rpm. After hybridization, the array was washed according to the Agilent wash procedure A protocol. The second wash was extended to 5 minutes to increase the wash stringency. After washing, the array was stripped by incubating it in the Agilent hybridization oven at 95°C for 10min, 20rpm with 1.09× Titanium Taq PCR Buffer (Clonetech). After the incubation and collection of the solution, 4 tubes of PCR were performed with each tube containing 96ul of the collected solution, 1ul of dNTPs (10mM each), 1ul of Titanium Taq (Clonetech) and Solexa primers, 1ul each. 15 cycles of PCR was performed at the following condition: 30sec at 95°C, (10 sec at 95°C, 30 sec at 65°C, 30 sec at 72°C)×18 cycles, 5 min at 72°C and hold at 4°C. The amplified product was purified using QIAquick PCR Purification Kit and eluted in 30ul of EB. After confirming the size of the amplicon on 2% agarose gel and measuring the concentration, the amplicon was diluted to 10nM, the working concentration for cluster generation. The Illumina flowcell was prepared according to the manufacturer's protocol and the Genome Analyzer was run using standard manufacturer's recommended protocols. The image data produced were converted to intensity files and were processed through the Firecrest and Bustard algorithms (1.3.2) provided by Illumina to call the individual sequence reads. ABI SOLiD Sequence Alignment and Consensus Base Calling We used Blat-like Fast Accurate Search Tool version 0.5.3 (BFAST http://bfast.sourceforge.net) [19] to perform sequence alignment of the two-base encoded reads off the ABI SOLiD to the NCBI human reference genome (build 36.1). Utilizing the local alignment algorithm included in BFAST [31], we were able to simultaneously decode the short reads, while searching for color errors (encoding errors), base changes, insertions, and deletions. We found candidate alignment locations (CALs) for each end independently. We utilized ten indexes to be robust to up to six color errors, equating to a 12% per-read error rate: 1111111111111111111111 111110100111110011111111111 10111111011001100011111000111111 1111111100101111000001100011111011 111111110001111110011111111 11111011010011000011000110011111111 1111111111110011101111111 111011000011111111001111011111 1110110001011010011100101111101111 111111001000110001011100110001100011111 We also set parameters to use only informative keys when looking up reads in each index (BFAST parameter -K 8), and to ignore reads with too many CALs aggregated across all indexes (BFAST parameter -M 384). If reads mapped to greater than 384 locations, then they were categorized as ‘unmapped’. We then performed local alignment for each of the returned CALs, simultaneously decoding the read from color space searching for color errors (encoding errors), base changes, insertions, and deletions [31]. We choose the “best scoring” alignment, accepting an alignment only if it was at least the equivalent edit distance of two color errors away from the next best alignment. This is approximately similar to a ‘mapping quality’ of 20 or better from the MAQ program output, for reference. We removed duplicate reads using the alignment filtering utility found in DNAA (http://dnaa.sourceforge.net). For single-end and mate-paired reads where only one end mapped, we removed duplicates based on reads having identical stat positions. For mate-paired reads, we removed duplicates where both ends had the same start position. Illumina Genome Analyzer Sequence Alignment Illumina generated sequence was aligned to the NCBI human reference genome (build 36.1) using BFAST with the following parameters applied. Each end of the fragment library was mapped independently to identify CALs, utilizing ten indexes to be robust to errors and variants in the short (typically 36bp) reads: 1111111111111111111111 1111101110111010100101011011111 1011110101101001011000011010001111111 10111001101001100100111101010001011111 11111011011101111011111111 111111100101001000101111101110111 11110101110010100010101101010111111 111101101011011001100000101101001011101 1111011010001000110101100101100110100111 1111010010110110101110010110111011 We also set parameters to use only informative keys when looking up reads in each index (BFAST parameter -K 8), and to ignore reads with too many CALs aggregated across all indexes (BFAST parameter -M 1280). We then performed a standard local alignment for each CAL. Reads were declared mapped if a single unique best scoring alignment was identified within the genome. Duplicate reads were filtered out in the same manner as for the ABI SOLiD data. Single Nucleotide Variant and Small Insertion and Deletion Detection To find SNVs including SNPs and small indels, we assumed the MAQ consensus-calling model [20] utilizing the implementation in SAMtools [21]. We used a value of 0.0000007 for the prior of a difference between two haplotypes (-r parameter). This was chosen based on ROC analysis of a test dataset (data not shown). Structural Variation Detection Structural variations were detected using custom algorithms designed to comprehensively search for groups of mate-pair reads with aberrant paired-end insert size distributions that are consistently identifying a unique structural variant in the genome. We utilized the “dtranslocations” utility in the DNAA package (http://dnaa.sourceforge.net) for the primary structural variation candidate search. The utility first selected all pairs for which each end is uniquely mapped to a single location in the human genome and for which the mate-pair reads are not positioned in the expected size range relative to the consensus genome. Then we filter out false positives that are not consistent with a chromosomal difference on an allele. Briefly, the genome was divided into 500-base bins sequentially stepped 100-bases apart from their start positions. Each bin was then paired with other bins on the basis of containing similar ‘mismapped’ mate-pair reads. The aberrant mate-paired reads were defined as reads that were mapping less than 1000 or greater than 2000 bases apart within the reference genome sequence, which is selected based on the insert size distribution calculated from the aggregate dataset (Figure S2). These were then rank-ordered based on the number of mate-pairs meeting criteria, and the destination bin with the most reads within it was paired with a given source bin to create a ‘binset’. Binsets containing less than 4 reads were filtered out, removing 98.3% of the candidates based on having too little evidence supporting them. The resulting list of filtered binsets was then scanned for clusters of binsets. Binset clusters are groups of binsets where the source bins occur within 2000 bases of each other and the destination bins occur within 2000 bases of each other. Redundant binsets were combined and those binset clusters that contain too few (less than 9 binsets spanning at least 1000 bases) or too many binsets (greater than 29 binsets spanning at most 3000 bases—higher is impossible given our insert size distribution) were removed as artifacts. The resulting binset clusters represent the reads immediately flanking structural breakpoint events. This detection process is currently being automated as Breakway (http://breakway.sourceforge.net), but was done using custom scripts at the time of analysis. The structural variations were then separated into interchromosomal and intrachromosomal events. Intrachromosomal events of less than 1Mb are assessed for deletion status by averaging base coverage within the bounds of the event and comparing it to base coverage 200kb outside the event on both sides. Those that have average interior base coverage less than 25% of the average exterior base coverage are classified as “complete” deletions. Those with average interior base coverage between 25% and 75% that of average exterior base coverage are classified as “heterozygous deletions” (deletions of at least one copy of the region, but with at least one copy remaining). Genes Affected by Mutations in Coding Sequence Variant calls from the SAMtools pileup tool were first loaded into a SeqWare QueryEngine database and subsequently filtered to produce BED files. This filtering criteria required that a variant be seen at least 4 times and at most 60 times with an observation occurring on each strand at least once. For SNVs we further enforced the criteria that SNVs should only be called in reads lacking indels and the last 5 bases of the reads were also ignored. This reduced the likelihood that spurious mismappings were used to predict SNVs and eliminated the lowest quality bases from consideration. For small indels (<21bp) we enforced a slightly different filter by requiring that any reads supporting an indel were only allowed to contain one contiguous indel and these reads were not considered if the indel occurred on either the beginning or end of the read. These criteria, like the SNV criteria, were used to reduce the likelihood of using mismapped reads or locally misaligned reads in the variant calling algorithm. The elimination of reads with indels at the beginning or end of the read was intended to remove potential alignment artifacts caused by ambiguous gap introduction due to lack of information at the ends to guide proper alignment. Together, these filtering criteria reduced the likelihood that sequencing errors were identified as SNV or indel variants. We used scripts available in the BFAST toolset and SeqWare Pipeline to filter and annotate the variant calls. Variants passing these filters were further annotated by their overlap with dbSNP version 129. Variants were required to share the same genomic position as a dbSNP entry along with matching the allele present in the database to be considered overlapping. Mapping to dbSNP allowed us to filter out known SNPs from de novo variants. Filtered SNV and indel variants were then analyzed for their affect within the genome that is annotated with gene models. This analysis used scripts from the SeqWare Pipeline project and gene models downloaded from the UCSC hg18 human genome annotation database. Six different gene model sets from hg18 were considered: UCSC genes (knownGene), RefSeq genes (refGene, http://www.ncbi.nlm.nih.gov/RefSeq), Consensus Coding Sequence genes (ccdsGene, http://www.ncbi.nlm.nih.gov/CCDS), Mammalian Gene Collection genes (mgcGenes, http://mgc.nci.nih.gov), Vertebrate Genome Annotation genes (vegaGene, http://vega.sanger.ac.uk), and Ensembl genes (ensGene, http://www.ensembl.org). Each variant was evaluated for overlap with genes from each of the 6 gene models. If overlap was detected the variant was examined and tagged with one or more of the following terms depending on the nature of the event: “utr-mutation”, “coding-nonsynonymous”, “coding-synonymous”, “abnormal-ref-gene-model-lacking-stop-codon”, “abnormal-ref-gene-model-lacking-start-codon”, “frameshift”, “early-termination”, “inframe-indel”, “intron-splice-site-mutation”, “stop-codon-loss”, and/or “start-codon-loss”. The variant was also tagged with the gene symbol and other accessions to facilitate lookups. This information was loaded into a SeqWare QueryEngine database to allow for querying and filtering of the variants as needed. Genes affected by structural variations were assessed in two ways depending on the structural variation type. For interchromosomal translocation events, a gene was considered “affected” when either end of an interchromosomal translocation event fell in a genic region (including the entire coding region plus 1kb up- or down-stream of the gene's coding region). The same criteria were used for all intrachromosomal translocation events. For events that were classified as complete or heterozygous deletions, a gene was considered affected if all or part of a coding exon was deleted. Annotation of Relevant Mutated Genes Homozygous SNVs, small indels, large deletions, and translocation events for variants that included predicted coding sequence changes were tallied. This became a reference list of variants with serious homozygous mutations that likely completely disrupted, or “knocked out”, the normal function or synthesis of the target protein. For the SNVs and small indels, a “knockout” variant was defined as a homozygous call by the SAMtools variant caller where the variant was predicted by the SeqWare Pipeline scripts to change coding sequence with one or more of the following annotations: “early-termination”, “frameshift”, “intron-splice-site-mutation”, “start-codon-loss”, and/or “stop-codon-loss”. The “early-termination” event represented a stop codon introduced upstream of the annotated stop codon. The “frameshift” represented an indel that resulted in a shifting of the reading frame of the gene resulting in, typically, early termination and non-sense coding sequence. The “intron-splice-site-mutation” referred to a mutation in the two consensus splice site intronic bases flanking exons (GT at the 5′ splice site and AG at the 3′ splice site). Finally, “stop-codon-loss” and “start-codon-loss” simply refer to variants that interrupt the stop or start codons. We chose to not include “coding-nonsynonymous” and “inframe-indel” annotations in this list of knocked out variants because, while potentially serious as these mutations are, they are not guaranteed to result in an unexpressed or non-functional protein. However, homozygous frameshift, early termination, splice site, and stop/start codon loss mutations are very likely to interrupt a gene's expression and translation to functional protein. As described above, large microdeletions that removed all or part of an exon and interchromosomal translocation events that fell within 1kb of a gene's coding region were also classified as mutated genes. Once suspect knockout variants were identified, a mapping process was used to translate one or more variants to the gene symbol. This mapping allowed us to condense multiple variants affecting multiple gene models to a more abbreviated list of gene symbols likely to be affected by these knockout mutations. The mapping from variants to gene symbols used variants identified with gene models from the refGene and the knownGene tables in the UCSC hg18 database and mapped these variants to gene symbols using queries against the name field of the knownGene table and the alias field of the kgAlias table. The UCSC table browser was used to accomplish these queries and map the knownGene identifiers to gene symbols via the kgXref table. A similar approach was used for homozygous large-scale microdeletions and translocation events. DAVID/EASE Analysis The list of knockout genes was uploaded to the Database for Annotation, Visualization, and Integrated Discovery (DAVID, version 2008) to identify enriched Gene Ontology (GO) terms [32]–[33]. Overlap with GO terms from the biological process, cellular component, and molecular function ontologies were considered. The default parameters were used and a p-value cutoff of < = 0.01 was considered significant. Cancer Gene Census The overlap between the Cancer Gene Census genes and those identified as knockouts in U87MG were compared. The Cancer Gene Census project is an ongoing effort to catalog genes with mutations that have been implicated in cancer [34]. It is a highly curated list that includes annotations for each gene including tumor types, class of mutations, and other genetic properties. We used the gene symbol list from the September 30th, 2009 complete working list, which includes 412 gene symbols. TCGA The overlap between mutations in the Cancer Genome Atlas (TCGA) and those identified as knockouts in U87MG was analyzed. TCGA is an ongoing effort to understand the molecular basis of cancer through large-scale copy number analysis, expression profiling, genome sequencing, and methylation studies among other techniques [4]. It provides information on mutations found by Sanger sequencing on many patient samples. For glioblastoma this includes sequence data aberrations detected in 158 patient samples and 1,177 genes. Genomic Identification of Significant Targets in Cancer The Genomic Identification of Significant Targets in Cancer (GISTIC) method was used to find significant areas of deletion in 293 samples from the TCGA [24]. The GISTIC technique was designed to identify and analyze chromosomal aberrations across a set of cancer samples, based on the amplitude of the aberrations as well as the frequency across samples. This approach produced a series of commonly deleted regions across the set of TCGA GBMs. To calculate the areas of deletion, we used 293 Affymetrix SNP 6.0 samples segmented using the GLAD SNP analysis module [35]. Default parameters of GISTIC were used. GISTIC produces peak limits, wide peak limits, and in addition broader region limits. These commonly deleted broader regions were then scanned for predicted knockout genes in U87MG. Indel Size Distribution and Nucleotide Substitution Frequencies The distribution of small indel sizes was examined for both deletions and insertions. Indels classified as affecting coding-sequence by the SeqWare Pipeline (see above) were compared to those outside coding regions. Raw counts were collected, recalculated as percents of total, and compared directly. Similarly, nucleotide substitution frequency was examined for SNVs from U87MG both genome-wide and only in coding regions. Once binned appropriately, the SNV nucleotide substitutions were counted, tallied in a table, and graphed as percents of total. Individual Genome Comparison Variants from the Watson and Yan Huang genome were downloaded from each respective project from the following URLs: ftp://jimwatsonsequence.cshl.edu/jimwatsonsequence/watson-454-snp-v01.txt.gz and http://yh.genomics.org.cn/do.downServlet?file=data/snps/yhsnp_add.gff. These files contained variant calls for each genome along with annotations describing the variant as novel or occurring in dbSNP. The Watson genome only contained SNV calls so our comparison was limited to just SNVs. The Yan Huang genome also contained calls indicating heterozygous or homozygous. However, a variant was considered to match between genomes regardless of zygosity state. We compared the overlap of the U87MG genome, dbSNP and each of these genomes in turn. SNVs from U87MG that were considered for comparison had to meet our criteria; variants had to be observed at least 4 times, at most 60 times, at least once per strand, and with a minimum phred score of 10. SNVs in the three-way comparison were said to match if the position and allele matched between the genomes. If both variants matched between U87MG and the other genome and one was annotated in dbSNP, then the other was considered in dbSNP as well. If neither contained annotations from dbSNP the variant was considered novel. A similar process was carried out for variants distinct to each genome. The results were recorded as Venn diagrams showing the overlap between dbSNP, U87MG, and the Watson or Yan Huang genome. Illumina SNP Chip Genomic DNA from U87MG was submitted to the Southern California Genotyping Consortium to be run on the Illumina Human 1M-Duo BeadChip, which consists of 1,199,187 probes scattered across the human genome. The Illumina Beadstudio program was used to analyze the resulting intensity data. Loss of heterozygosity was determined by analyzing B-allele frequency as determined by the Beadstudio program. Normal two-copy regions of the genome are represented by long stretches of probes with B-allele frequencies of 0, 0.5 or 1. Regions of LOH, on the other hand, deviate from this pattern significantly. Copy number was determined by looking at probe intensity. Sanger Sequencing Validation Primers for validation were designed by targeting regions immediately flanking the event predicted by our whole genome sequence analysis using the Primer3 tool (http://frodo.wi.mit.ed/primer3/). Polymerase chain reaction was performed following standard protocols using Finnzymes Phusion Hot-Start High Fidelity polymerase. Products were run on 2% agarose gel electrophoresis and product purity and size was assessed by staining with ethidium bromide. Sanger sequencing was performed at the UCLA Genotyping and Sequencing core facility using an ABI 3730 Capillary DNA Analyzer. Sequence trace files were analyzed using Geospiza FinchTV. Validation status and PCR primers are listed in Table S1. Data Deposition and Availability Intensities, quality scores, and color space sequence for the genomic sequence of U87 SOLiD were uploaded to the Sequence Read Archive under the accession SRA009912.1/Sequence of U87 Glioblastoma Cell-line. Intensities, quality scores, and nucleotide space sequence for the exon capture U87 Illumina sequence were also uploaded to the Short Read Archive under the same accession. For both datasets, alignment files have been uploaded to the Short Read Archive as additional analysis results. Variant calls for both datasets are available via a SeqWare QueryEngine web service at http://genome.ucla.edu/U87. This tool allows for querying the variants using a variety of search criteria including coverage, mutational consequence, gene symbol, and others. SeqWare QueryEngine produces results in both BED and WIG format making it compatible with the majority of genome browsers such as the UCSC genome and table browsers. Variant data will be uploaded to SRA as metadata along with the raw sequences. For the whole genome SOLiD alignment, small indels (<21bp), SNVs, large deletions, and translocation events can be queried. For the exon capture Illumina alignment, small indels and SNVs can be queried. Software Availability Most software used for this project is open-source and freely available. We created two software projects that were instrumental in the analysis of the U87MG data: BFAST and SeqWare. The color- and nucleotide-space alignment tool BFAST can be downloaded from http://bfast.sourceforge.net and many of our alignment filtering as well as the primary step in structural variation detection can be found in the DNAA package at http://dnaa.sourceforge.net. The SeqWare software project was used throughout the analysis of variant calls. We used the SeqWare LIMS tool for sample tracking, the SeqWare Pipeline analysis programs for annotating variants with dbSNP status and mutational consequence predictions, and SeqWare QueryEngine was used to database and query variant calls and annotations. This software and documentation can be downloaded from http://seqware.sourceforge.net. Supporting Information Figure S1 Concordance between Solexa capture data and SOLiD whole genome data. The left plot displays the SNP call concordance between each experiment (Solexa capture data in blue, SOLiD whole genome data in red) with the Illumina 1M Beadchip microarray for the 8.5Mb of sequence pulled down in the capture experiment. The right plot displays concordance of the non-reference (mutant) allele calls with the array data for those regions. (0.43 MB TIF) Click here for additional data file. Figure S2 Paired end insert size distribution. Empirical paired end insert size distribution for reads where both ends aligned with duplicates removed. (0.41 MB TIF) Click here for additional data file. Figure S3 Alignment is robust against genome-wide repeat elements. Circos plot [35] of reads spanning a complete microdeletion on chromosome 2, bases 201855000–201858000, are shown in dark blue, with the normal reads in the surrounding region in light blue. The green plot shows base-coverage at each position. The outermost track shows the structure of a gene, CASP8, overlapping this region (large boxes-exons, lines-introns). The track containing black and red boxes shows genome-wide repeat elements (black-LINE, red-SINE). Note the high density of reads even over conserved LINE elements. Some SINE elements do demonstrate a drop in alignments, but these do not prevent the identification of structural variation-spanning reads. (0.21 MB TIF) Click here for additional data file. Figure S4 Commonly deleted regions in GBM according to GISTIC. This deletion plot shows significant regions of deletion in 293 GBM samples from the TCGA. The top of the plot shows the G-score and the bottom shows the q-values. G-score reflects the frequency and amplitude of the deletion. Q-values greater than 0.25 were considered significant. Overlap of genes mutated in U87 via SNVs or Indels and broad regions of deletion are considered to be likely cancer targets. This includes all or part of chromosomes 1, 6, 9, 10, 13, 14, 15, and 22. (0.43 MB TIF) Click here for additional data file. Table S1 PCR and dideoxy sequencing validation. A list of the variants that were validated by PCR and dideoxy sequencing including primers used, varient location, and validation status. (0.03 MB XLS) Click here for additional data file. Table S2 Structural variants in U87MG. All structural variants listed as regions immediately flanking the genomic breakpoint. (0.18 MB XLS) Click here for additional data file. Table S3 Genes knocked out by SNVs/Indels. List of all genes predicted to be knocked out by SNVs and Indels in U87MG. (0.20 MB XLS) Click here for additional data file. Table S4 Genes affected by structural variants. List of all genes predicted to be affected by structural variants in U87MG. (0.48 MB XLS) Click here for additional data file. Table S5 Annotation of mutated genes. Lists of genes predicted to be mutated in U87MG annotated by various cancer-related gene databases. (0.17 MB XLS) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2013
                31 May 2013
                : 8
                : 5
                : e64561
                Affiliations
                [1 ]Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
                [2 ]Institute for Molecular Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
                [3 ]Crump Institute for Molecular Imaging, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
                [4 ]Jonsson Comprehensive Cancer Center, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
                [5 ]Broad Stem Cell Research Center, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
                University of Alabama at Birmingham, United States of America
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: HRC TGG WJS EA. Performed the experiments: EA WJS AGY. Analyzed the data: EA WJS AC AGY TGG HRC. Contributed reagents/materials/analysis tools: TGG SJB. Wrote the paper: WJS EA HRC.

                Article
                PONE-D-13-01196
                10.1371/journal.pone.0064561
                3669316
                23741339
                938bc553-27e7-43da-9373-dfd2b291fa08
                Copyright @ 2013

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 2 January 2013
                : 15 April 2013
                Page count
                Pages: 7
                Funding
                H.R. Christofk is a Damon Runyon–Rachleff Innovation Awardee supported (in part) by the Damon Runyon Cancer Research Foundation, the Searle Scholars Program, the National Institutes of Health Director's New Innovator Award (DP2 OD008454-01), and the Caltech/UCLA Nanosystems Biology Cancer Center (NCI U54 CA151819). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology
                Biochemistry
                Bioenergetics
                Energy-Producing Organelles
                Metabolism
                Carbohydrate Metabolism
                Metabolic Pathways
                Oxygen Metabolism
                Small Molecules
                Molecular Cell Biology
                Gene Expression
                Medicine
                Oncology
                Basic Cancer Research

                Uncategorized
                Uncategorized

                Comments

                Comment on this article