61
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Landscape of X chromosome inactivation across human tissues

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          X chromosome inactivation (XCI) silences the transcription from one of the two X chromosomes in mammalian female cells to balance expression dosage between XX females and XY males. XCI is, however, characteristically incomplete in humans: up to one third of X-chromosomal genes are expressed from both the active and inactive X chromosomes (Xa and Xi, respectively) in female cells, with the degree of “escape” from inactivation varying between genes and individuals 1,2 (Fig. 1). However, the extent to which XCI is shared between cells and tissues remains poorly characterized 3,4 , as does the degree to which incomplete XCI manifests as detectable sex differences in gene expression 5 and phenotypic traits 6 . Here we report a systematic survey of XCI integrating over 5,500 transcriptomes from 449 individuals spanning 29 tissues from GTEx (V6 release), and 940 single-cell transcriptomes, combined with genomic sequence data (Fig. 1). We show that XCI at 683 X-chromosomal genes is generally uniform across human tissues, but identify examples of heterogeneity between tissues, individuals, and cells. We show that incomplete XCI affects at least 23% of X-chromosomal genes, identify seven new escape genes supported by multiple lines of evidence, and demonstrate that escape from XCI results in sex biases in gene expression, establishing incomplete XCI as a likely mechanism introducing phenotypic diversity 6,7 . Overall, this updated catalogue of XCI across human tissues informs our understanding of the extent and impact of the incompleteness in the maintenance of XCI. Mammalian female tissues consist of two mixed cell populations, each with either the maternally or paternally inherited X chromosome marked for inactivation. To overcome this heterogeneity, assessments of human XCI have often been confined to the use of artificial cell systems 1 , or samples presenting with skewed XCI 1,2 , i.e. preferential inactivation of one of the two X chromosomes, which is common in clonal cell lines but rare in karyotypically normal, primary human tissues 8 (Supplementary Note, Extended Data Fig. 1). Others have used bias in DNA methylation 3,4,9 or in gene expression 5,10 between males and females as a proxy for XCI status. Surveys of XCI are powerful in engineered model organisms, e.g. mouse models with completely skewed XCI 11 , but the degree to which these discoveries are generalizable to human XCI remains unclear given marked differences in XCI initiation and extent of escape across species 7 . Here, we describe a systematic survey of the landscape of human XCI using three complementary RNA sequencing (RNA-seq)-based approaches (Fig. 1) that together allow an assessment of XCI from individual cells to population across a diverse range of human tissues. Given the limited accessibility of most human tissues, particularly in large sample sizes, no global investigation of the impact of incomplete XCI on X-chromosomal expression has been conducted in data sets spanning multiple tissue types. We used the Genotype Tissue Expression (GTEx) project 12 data set (V6 release), which includes high-coverage RNA-seq data from diverse human tissues, to investigate male-female differences in the expression of 681 X-chromosomal protein-coding and long non-coding RNA (lncRNA) genes in 29 adult tissues (Extended Data Table 1), hypothesizing that escape from XCI should typically result in higher female expression of these genes. Previous work 5,10,13 has indicated that some escape genes show female bias in expression, but our analysis benefits from a larger set of profiled tissues and individuals, as well as the high sensitivity of RNA-seq. To confirm that male-female expression differences reflect incomplete XCI, we assessed the enrichment of sex-biased expression in known XCI categories using 561 genes with previously assigned XCI status, defined as escape (N=82), variable escape (N=89) or inactive (N=390)(Fig. 1, Supplementary Table 1). Sex-biased expression is enriched in escape genes compared to both inactive (two-sided paired Wilcoxon P=3.73×10-9) and variable escape genes (P=3.73×10-9) (Fig. 2b, Extended Data Fig. 2), with 74% of escape genes showing significant (false discovery rate (FDR) q-value < 0.01) male-female differences in at least one tissue (Fig. 2a, Extended Data Fig. 3-4, Supplementary Table 2). In line with two active X-chromosomal copies in females, escape genes in the non-pseudoautosomal, i.e. the X-specific, region (nonPAR) predominantly show female-biased expression across tissues (52 out of 67 assessed genes, binomial P=6.46×10-6). However, genes in the pseudoautosomal region PAR1, are expressed more highly in males (14/15 genes, binomial P=9.77×10-6) (Fig. 2a), suggesting that combined Xa and Xi expression in females fails to reach the expression arising from X and Y chromosomes in males (discussed below). Sex bias of escape genes is often shared across tissues; these genes show a higher number of tissues with sex-biased expression than genes in other XCI categories (Fig. 2a, Extended Data Fig. 2c), a result not driven by differences in the breadth of expression of escape and inactive genes (Extended Data Fig. 2e).Also, the direction of sex bias across tissues is consistent (Fig. 2a,c, Extended Data Fig. 2b). Together these observations point toward global and tight control of XCI, potentially arising from early lockdown of the epigenetic marks regulating XCI. Previous reports have identified several epigenetic signatures associated with XCI escape in human and mouse 14 ; in agreement with these discoveries we find that escape genes are enriched in chromatin states related to active transcription (Fig. 2e). While sex bias on the X chromosome is broadly specific to escape genes, some genes show unexpected patterns. Eight genes with some previous evidence for inactivation show >90% concordance in effect direction and significant sex bias (Fig. 2d, Supplementary Table 3), e.g. CHM, that replicates in single-cell RNA-seq (scRNA-seq; see below), suggesting that variable escape can also have considerable population-level impact. One gene without an assigned XCI status shows a similar sex bias pattern to escape genes; RP11-706O15.3 (Fig. 2d) resides between escape and variable escape genes PRKX and NLGN4X, consistent with known clustering of escape genes 1,2 . Some escape genes show more heterogeneous sex bias, e.g., ACE2 (Fig. 2a, Supplementary Discussion). Many of such genes lie in the evolutionarily older region of the chromosome 15 , in Xq, where escape genes also show higher tissue-specificity and lower expression levels (Extended Data Fig. 5), characteristics linked with higher protein evolutionary rates 16,17 . While sex bias serves as a proxy for XCI status, it provides only an indirect measure of XCI. We identified a GTEx female donor with an unusual degree of skewing of XCI (Fig. 3a), the same copy of chrX being silenced in ∼100% of cells across all tissues, yet without any X-chromosomal abnormality detected by whole-genome sequencing (WGS) (Supplementary Note, Extended Data Fig. 6), providing an opportunity to leverage allele-specific expression (ASE) across 16 tissues to investigate XCI. This approach is analogous to previous surveys in mouse 11 or in human cell lines with skewed XCI 2 , but extends the assessment to larger number of tissues and avoids biases arising from genetic heterogeneity between tissue samples. Analysis of the X-chromosomal allelic counts (Supplementary Tables 4-6) from this GTEx donor highlights the incompleteness and consistency of XCI across tissues (Fig. 3b). Approximately 23% of the 186 X-chromosomal genes assessed show expression from both alleles, indicative of incomplete XCI, matching previous estimates of the extent of escape 1,2 . For 43% of the genes expressed biallelically in this sample, Xi expression is of similar magnitude between tissues, thus supporting the observation of general global and tight control of XCI. However, suggesting some tissue-dependence in XCI, the rest of biallelically expressed genes show variability in Xi expression, including a gene subset (5.8% of all genes) that appear biallelic in only one of the multiple tissues assayed. While tissue-specific escape is common in mouse 11 , limited evidence exists for such a pattern in human tissues beyond neurons 3,4,9 . In our data, among the genes with the strongest evidence for tissue-specific escape is KAL1 (Fig. 3f, Supplementary Table 6), the causal gene for X-linked Kallmann syndrome; here KAL1 shows biallelic expression exclusively in lung (Fig. 3f), in line with the strong female bias detected specifically in lung expression in the previous analysis (Fig. 2a), suggesting that tissue differences in escape can directly translate to tissue-specific sex biases in gene expression. Altogether, the predictions of XCI status in this sample align with previous assignments (Supplementary Table 7, e.g. TSR2, XIST and ZBED1, Fig. 3c-f), but suggest five new incompletely inactivated genes (Fig. 3g-k, Supplementary Table 5), three of which act in a tissue-specific manner. For instance,CLIC2, in previous studies called either subject 2 to or variably escaping 1 from XCI, shows considerable Xi expression only in skin tissue. Such specific patterns illustrate the need to assay multiple tissue types to fully uncover the diversity in XCI. The emergence of scRNA-seq methods 18 presents an opportunity to directly assess XCI without the complication of cellular heterogeneity in bulk tissue samples (Fig. 1), as demonstrated recently in mouse studies 19-22 , and in human fibroblasts 23 and preimplantation development 24 . To directly profile XCI in human samples, we examined scRNA-seq data in combination with deep genotype sequences from 940 immune-related cells from four females: 198 cells from LCLs sampled from three females of African (Yoruba) ancestry, and 742 blood dendritic cells from a female of Asian ancestry 25 (Fig. 1, Extended Data Table 2). We utilized ASE to distinguish the expression coming from each of the two X-chromosomal haplotypes in a given cell (Supplementary Table 4). As the inference of allele-specific phenomena in single cells is complicated by widespread monoallelic expression 20,26-28 , besides searching for X-chromosomal sites with biallelic expression (Extended Data Fig. 7), we leveraged genotype phase information to detect sites where the expressed allele was discordant with the active X chromosome in that cell. Only 129 (78%) out of the 165 assayed genes (41-98 per sample)were fully inactivated in these data while the rest showed incomplete XCI in one or more samples (Fig. 4a-b, Supplementary Tables 8-9), largely consistent with previous assignments of XCI status to these genes (Fig. 4a, Supplementary Table 10). For instance, single cell data reveal consistent expression from both X-chromosomal alleles for eleven genes in PAR1, in line with their known escape from XCI (e.g. ZBED1, Fig. 4c), and replicate the known expression of XIST exclusively from Xi (Fig. 4d). We next assessed whether our approach could extend the spectrum of escape from XCI. For seven genes previously reported as inactivated the data from single cells pointed to incomplete XCI (Fig. 4e-k, Supplementary Table 11), including FHL1 (Fig. 4e), highlighted as a candidate escape gene also in the GTEx ASE analysis, and ATP6AP2 (Fig. 4h), which displays predominantly female-biased expression across GTEx tissues. Both of these genes demonstrate significant Xi expression in only a subset of the scRNA-seq samples, a pattern consistent with variable escape 1,2 . Between-individual variability exists not only in the presence but also in the degree of expression from Xi (e.g. MSL3, Fig. 4l). Highlighting the capacity of scRNA-seq to provide information beyond bulk RNA-seq, we identify examples where Xi expression varies considerably between the two X-chromosomal haplotypes within an individual (e.g. ASMTL; Supplementary Table 12), suggesting cis-acting variation as one of the determinants for the level of Xi expression 3 . As a further layer of heterogeneity in Xi expression, we find a unique pattern at TIMP1, where the level of Xi expression across cells is not significant, but exclusive to a subset of cells that express the gene biallelically (Extended Data Fig. 7), pointing to cell-to-cell variability in escape. Leveraging the ASE estimates from the scRNA-seq and GTEx analyses to infer the magnitude of the incompleteness of XCI,we find that expression from Xi at escape genes rarely reaches levels equal to Xa. Xi expression remains on average at 33% of Xa expression, yet with wide variability along the chromosome (Supplementary Discussion, Extended Data Fig. 8a), as demonstrated previously in specific tissue types 1,2 . Balanced expression dosage between males and females in PAR1 requires full escape from XCI, yet Xi expression remains below Xa expression also in this region (mean Xi to Xa ratio ∼0.80), pointing to partial spreading of XCI beyond nonPAR. For further support for the consistent male bias in PAR1 expression (Fig. 2a) being due to the incompleteness of escape, we observe no systematic up- or downregulation of Y chromosome expression in PAR1 (Extended Data Fig. 8b, Supplementary Discussion). As another consequence of the partial Xi expression, several of the X-Y homologous genes in nonPAR 29 become male-biased when expression from the Y chromosome counterpart is accounted for (Extended Data Fig. 8c). By combining diverse types and analyses of high-throughput RNA-seq data, we have systematically assessed the incompleteness and heterogeneity in XCI across 29 human tissues (Supplementary Table 13). We establish that scRNA-seq is suitable for surveys of human XCI and present the first steps towards understanding the cellular-level variability in the maintenance of XCI. Our phasing-based approach allows for the full use of low-coverage scRNA-seq, yet as any single individual and cell type is informative for restricted number of genes, larger data sets with more diverse cell types and conditions are required to fully profile XCI. We have thus utilized the multi-tissue GTEx data set to explore XCI in a larger number of X-chromosomal genes and to assess the tissue-heterogeneity and impacts of XCI on gene expression differences between the sexes. These analyses show that incomplete XCI is largely shared between individuals and tissues, and extend previous surveys by pinpointing several examples of variability in the degree of XCI escape between cells, chromosomes, and tissues. In addition, our data demonstrate that escape from XCI results in sex-biased expression in at least 60 genes, potentially contributing to sex differences in health and disease (Supplementary Discussion). As a whole, these results highlight the between-female and male-female diversity introduced by incomplete XCI, the biological implications of which remain to be fully explored. Methods GTEx data The GTEx project 12 collected tissue samples from 554 postmortem donors (187 females, 357 males; age range 20-70), produced RNA sequencing from 8,555 tissue samples and generated genotyping data for up to 449 donors (GTEx Analysis V6 release). More details of methods can be found in Aguet et al. (Aguet et al., co-submitted, Nature). All GTEx data, including RNA, genome and exome sequencing data, used in the analyses described are available through dbGaP under accession phs000424.v6.p1, unless otherwise stated. Summary data and details on data production and processing are also available on the GTEx Portal (http://gtexportal.org). Single-cell samples For the human dendritic cells samples profiled, the healthy donor (ID: 24A) was recruited from the Boston-based PhenoGenetic project, a resource of healthy subjects that are re-contactable by genotype 30 . The donor was a female Asian individual from China, of 25 years of age at the time of blood collection. She was a non-smoker, had normal BMI (height: 168.7cm; weight: 56.45kg; BMI: 19.8), and normal blood pressure (108/74). The donor had no family history of cancer, allergies, inflammatory disease, autoimmune disease, chronic metabolic disorders or infectious disorders. She provided written informed consent for the genetic research studies and molecular testing, as previously reported 25 . Daughters of three parent-child Yoruba trios from Ibadan, Nigeria, (i.e. YRI trios) collected as part of the International HapMap Project, were chosen for single-cell profiling both to maximize heterozygosity and due to availability of parental genotypes allowing for phasing. DNA and LCLs were ordered from the NHGRI Sample Repository for Human Genetic Research (Coriell Institute for Medical Research): LCLs from B-Lymphocyte for the three daughters (catalogue numbers: GM19240, GM19199, GM18518) and DNA extracted from LCLs for all members of the three trios (catalogue numbers: DNA: NA19240, NA19238, NA19239, NA19199, NA19197, NA19198, NA18518, NA18519, NA18520). These YRI samples are referred to by their family IDs: Y014, Y035 and Y117. Clinical muscle samples To assess whether PAR1 genes are equally expressed from X and Y chromosomes, a combination of skeletal muscle RNA sequencing and trio genotyping from eight male patients with muscular dystrophy, sequenced as part of an unrelated study, was used. Patient cases with available muscle biopsies were referred from clinicians starting April 2013 through June 2016. All patients submitted for RNA-sequencing had previously available trio whole exome sequencing with one sample having additional trio whole genome sequencing. Muscle biopsies were shipped frozen from clinical centers via liquid nitrogen dry shipper and, where possible, frozen muscle was sectioned on a cryostat and stained with H&E to assess muscle quality as well as the presence of overt freeze-thaw artifact. Genotyping The GTEx V6 release includes WGS data for 148 donors, including GTEX-UPIC. WGS libraries were sequenced on the Illumina HiSeqX or Illumina HiSeq2000. WGS data was processed through a Picard-based pipeline, using base quality score recalibration and local realignment at known indels. BWA-MEM aligner was used for mapping reads to the human genome build 37 (hg19). SNPs and indels were jointly called across all 148 samples and additional reference genomes using GATK's HaplotypeCaller version 3.1. Default filters were applied to SNP and indel calls using the GATK's Variant Quality Score Recalibration (VQSR) approach. An additional hard filter InbreedingCoeff <= -0.3 was applied to remove sites that VQSR failed to filter. WGS for one of the clinical muscle samples was performed on 500 ng to 1.5 ug of genomic DNA using a PCR-Free protocol that substantially increases the uniformity of genome coverage. These libraries were sequenced on the Illumina HiSeq × 10 with 151 bp paired-end reads and a target mean coverage of >307times;, and processed similarly as above. The Y117 trio (sample IDs NA19240 (daughter), NA19238 (mother), and NA19239 (father)) was whole-genome-sequenced as part of the 1000 Genomes project as described previously 31 . The VCF file containing the WGS-based genotypes for SNPs (YRI.trio.2010_09.genotypes.vcf.gz) was downloaded from the project's FTP site. The genotype coordinates (in human genome build 36) in the original VCF were converted to hg19 using the liftover script (liftOverVCF.pl) and chain files provided as part of the GATK package. WES was performed using Illumina's capture Exome (ICE) technology (Y035, Y014, 24A) or Agilent SureSelect Human All Exon Kit v2 exome capture (clinical muscle samples) with a mean target coverage of >80×. WES data was aligned with BWA, processed with Picard, and SNPs and indels were called jointly with other samples using GATK HaplotypeCaller package version 3.1 (24A, clinical muscle samples) or version 3.4 (Y035, Y014). Default filters were applied to SNP and indel calls using the GATK's Variant Quality Score Recalibration (VQSR) approach. A modified version of the Ensembl Variant Effect Predictor was used for variant annotation for all WES and WGS data. For trio WES or WGS data the genotypes of the proband were phased using the PhaseByTransmission tool of the GATK toolkit. Single cell data preparation and sequencing For profiling of healthy DCs, peripheral blood mononuclear cells (PBMCs) were first isolated from fresh blood within 2hrs of collection, using Ficoll-Paque density gradient centrifugation as previously described 32 . Single-cell suspensions were stained per manufacturer recommendations with an antibody panel designed to enrich for all known blood DC population for single cell sorting and single cell RNA-sequencing (scRNA-seq) profiling 25 . A total of 24 single cells from four loosely gated populations were sorted per 96-well plate, with each well containing 10ul of lysis buffer. A total of eight plates were analysed by single-cell RNA-sequencing. All LCL cell lines were cultured according to Coriell's recommendation (medium: RPMI 1640, 2mM L-glutamine, 15% fetal bovine serum (all three from ThermoFisher Scientific)) in T25 tissue culture flask with 10-20 ml medium at 37°C under 5% carbon dioxide. Cells were split upon reaching cell density of approximately 300,000-400,000 viable cells/ml. All three lymphoblast cultures were split once prior to proceeding with single cell sorting. Cells were washed with 1× PBS, pellet resuspended and stained with DAPI (Biolegend) for viability according to manufacturer's recommendation. All single live cells (for both DCs and LCL cell lines) were sorted in 96-well full-skirted eppendorf plate chilled to 4°C, pre-prepared with 10μl TCL buffer (Qiagen) supplemented with 1% beta-mercaptoethanol (lysis buffer) using BD FACS Fusion instrument. Single-cell lysates were sealed, vortexed, spun down at 300g at 4°C for 1 minute, immediately placed on dry ice and transferred for storage at -80°C. The Smart-Seq2 protocol was performed on single sorted cells as described 33,34 , with some modifications as described in Villani et al. 25 (Supplementary Methods). A total of 768 single DCs isolated from healthy Asian female individual, along with 96 single cells from GM19240, 48 single cells from GM19199, and 48 single cells from GM18518 were profiled. Briefly, single-cell lysates were thawed on ice purified, and reverse-transcribed using Maxima H Minus Reverse Transcriptase. PCR was performed with KAPA HiFi HotStart ReadyMix [KAPA Biosystems] and purified with Agencourt AMPureXP SPRI beads (Beckman-Coulter). The concentration of amplified cDNA was measured on the Synergy H1 Hybrid Microplate Reader (BioTek) using High-Sensitivity Qubit reagent (Life Technologies), and the size distribution of select wells was checked on a High-Sensitivity Bioanalyzer Chip (Agilent). Expected quantification was around 0.5-2 ng/μL with size distribution sharply peaking around 2kb. Library preparation was carried out using the Nextera XT DNA Sample Kit (Illumina) with custom indexing adapters, allowing up to 384 libraries to be simultaneously generated in a 384-well PCR plate (note that DCs were processed in 384-well plate while LCL were processed in 96-well plate format). The concentration of the final pooled libraries was measured using the High-Sensitivity DNA Qubit (Life Technologies), and the size distribution measured on a High-Sensitivity Bioanalyzer Chip (Agilent). Expected concentration of the pooled libraries was 10-30 ng/μL with size distribution of 300-700bp. For the DCs, we created pools of 384 cells, while 96 LCL samples were pooled at the time. We sequenced one library pool per lane as paired-end 25 base reads on a HiSeq2500 (Illumina). Barcodes used for indexing are listed in the Supplementary Methods. RNA-seq in GTEx RNA sequencing was performed using a non-strand-specific RNA-seq protocol with poly-A selection of RNA using the Illumina TruSeq protocol with sequence coverage goal of 50M 76 bp paired-end reads as described in detail previously 12 . The RNA-seq data, except for GTEX-UPIC, was aligned with Tophat version v1.4.1 to the UCSC human genome release version hg19 using the Gencode v19 annotations as the transcriptome reference. Gene level read counts and RPKMs were derived using the RNA-SeQC tool 35 using the Gencode v19 transcriptome annotation. The transcript model was collapsed into gene model as described previously 12 . Read count and RPKM quantification include only uniquely mapped and properly paired reads contained within exon boundaries. RNA-seq alignment to personalized genomes For the four single-cell samples and for GTEX-UPIC RNA-seq data was processed using a modification of the AlleleSeq pipeline 36,37 to minimize reference allele bias in alignment. A diploid personal reference genome for each of the samples was generated with the vcf2diploid tool 36 including all heterozygous biallelic single nucleotide variants identified in WES or WGS either together with (YRI samples) or without (GTEX-UPIC, 24A) maternal and paternal genotype information. The RNA-seq reads were then aligned to both parental references using STAR 38 version 2.4.1a in a per-sample 2-pass mode (GTEX-UPIC and YRI samples) or version 2.3.0e (24A) using hg19 as the reference. The alignments were combined by comparing the quality of alignment between the two references: for reads aligning uniquely to both references the alignment with the higher alignment score was chosen and reads aligning uniquely to only one reference were kept as such. RNA-seq of clinical muscle samples Patient RNA samples derived from primary muscle were sequenced using the GTEx sequencing protocol 12 with sequence coverage of 50M or 100M 76 bp paired-end reads. RNA-seq reads were aligned using STAR 38 2-Pass version v.2.4.2a using hg19 as the reference. Junctions were filtered after first pass alignment to exclude junctions with less than 5 uniquely mapped reads supporting the event and junctions found on the mitochondrial genome. The value for unique mapping quality was assigned to 60 and duplicate reads were marked with Picard MarkDuplicates (v.1.1099). Catalogue of X-inactivation status In order to compare results from the ASE and GTEx analyses with previous observations on genic XCI status we collated findings from two earlier studies 1,2 that represent systematic expression-based surveys into XCI. Each study catalogues hundreds of X-linked genes and together the data span two tissue types. Carrel and Willard 1 surveyed in total 624 X-chromosomal transcripts expressed in primary fibroblasts in nine cell hybrids each containing a different human Xi. In order to find the gene corresponding to each transcript, the primer sequences designed to test the expression of the transcripts in the original study were aligned to reference databases based on Gencode v19 transcriptome and hg19 using an in-house software (unpublished) (Supplementary Methods). In total 553 transcripts primer pairs were successfully matched to X-chromosomal Gencode v19 reference mapping together to 470 unique chrX genes (Supplementary Methods). These 470 genes were split into three XCI status categories (escape, variable, inactive) based on the level of Xi expression (i.e. the number of cell lines expressing the gene from Xi) resulting in 75 escape, 51 variable escape and 344 inactive genes. Cotton et al 2 surveyed XCI using allelic imbalance in clonal or near-clonal female LCL and fibroblast cell lines and provided XCI statuses for 508 genes (68 escape, 146 variable escape, 294 subject genes). The data was mapped to Gencode v19 using the reported gene names and their known aliases (Supplementary Methods), resulting in a list of XCI statuses for 506 X-chromosomal genes. The results were combined by retaining the XCI status in the original study where possible (i.e. same status in both studies or gene unique to one study) and for genes where the reported XCI statuses were in conflict the following rules were applied: 1) A gene was considered “escape” if it was called escape in one study and variable in the other, 2) “variable escape” if classified as escape and inactive, and 3) “inactive” if classified as inactive in one study and variable escape in the other. The final combined list of XCI statuses consisted of 631 X-chromosomal genes including 99 escape, 101 variable escape and 431 inactive genes. Analysis of sex-biased expression Differential expression analyses were conducted to identify genes that are expressed at significantly different levels between male and female samples using 29 GTEx V6 tissues with RNA-seq and genotype data available from more than 70 individuals after excluding samples flagged in QC and sex-specific, outlier (i.e. breast tissue) and highly correlated tissues 13 . Only autosomal and X-chromosomal protein-coding or lncRNA genes in Gencode v19 were included, and further all lowly-expressed genes were removed. (Supplementary Methods and Extended Data Table 1). Differential expression analysis between male and female samples was conducted following the voom-limma pipeline 39-41 available as an R package through Bioconductor (https://bioconductor.org/packages/release/bioc/html/limma.html) using the gene-level read counts as input. The analyses were adjusted for age, three principal components inferred from genotype data using EIGENSTRAT 42 , sample ischemic time, surrogate variables 43,44 built using the sva R package 45 , and the cause of death classified into five categories based on the 4-point Hardy scale (Supplementary Methods). To control the false discovery rate (FDR), the qvalue R package was used to obtain q-values applying the adjustment separately for the differential expression results from each tissue. The null hypothesis was rejected for tests with q-values below 0.01. XY homolog analysis A list of Y-chromosomal genes with functional counterparts in the X chromosome, i.e. X-Y gene pairs, was obtained from Bellott et al 29 , which lists 19 ancestral Y chromosome genes that have been retained in the human Y chromosome. After excluding two of the genes (MXRA5Y and OFD1Y), which were annotated as pseudogenes by Bellot et al and further four genes (SRY, RBMY, TSPY, and HSFY) that according to Bellot et al have clearly diverged in function from their X-chromosomal homologs, the remaining 13 Y-chromosomal genes were matched with their X chromosome counterparts using gene pair annotations given in Bellot et al or by searching for known paralogs of the Y-chromosomal genes. To test for completeness of dosage compensation at the X-Y homologous genes, the sex bias analysis in GTEx data was repeated replacing the expression of the X-chromosomal counterpart with the combined expression of the X and Y homologs. Chromatin state analysis To study the relationship between chromatin states and XCI, we used chromatin state calls from the Roadmap Epigenomics Consortium 46 . Specifically, we used the chromatin state annotations from the core 15-state model, publicly available at http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state. We followed our previously published method 47 to calculate the covariate-corrected percentage of each gene body assigned to each chromatin state. After pre-processing, we filtered down to the 399 inactive and 86 escape genes on the X chromosome, and down to 38 female epigenomes. To compare the chromatin state profiles of the escape and inactive genes in female samples, we used the one-sided Wilcoxon rank sum test. Specifically, for each chromatin state, we averaged the chromatin state coverage across the 38 female samples for each gene, and compared that average chromatin state coverage for all 86 escape genes to the average chromatin state coverage for all 399 inactive genes. We performed both one-sided tests, to test for enrichment in escape genes, as well as enrichment in inactive genes. Next, we performed simulations to account for possible chromatin state biases, such as the fact that the escape and inactive genes are all from the X chromosome. Specifically, we generated 10,000 randomized simulations where we randomly shuffled the “escape” or “inactive” labels on the combined set of 485 genes, while retaining the sizes of each gene set. For each of these simulated “escape” and “inactive” gene sets, we calculated both one-sided Wilcoxon rank sum test p-values as described above, and then, we calculated a permutation “p-value” for the real gene sets based on these 10,000 random simulations (Supplementary Methods). Finally, we used Bonferroni multiple hypothesis correction for our significance thresholds to correct for our 30 tests, one for each of 15 chromatin states, and both possible test directions. Allele-specific expression For ASE analysis the allele counts for biallelic heterozygous variants were retrieved from RNA-seq data using GATK ASEReadCounter (v.3.6) 37 . Heterozygous variants that passed VQSR filtering were first extracted for each sample from WES or WGS VCFs using GATK SelectVariants. The analysis was restricted to biallelic SNPs due to known issues in mapping bias in RNA-seq against indels 37 . Sample-specific VCFs and RNA-seq BAMs were inputted to ASEReadCounter requiring minimum base quality of 13 in the RNA-seq data (scRNA-seq samples, GTEX-UPIC) or requiring coverage in the RNA-seq data of each variant to be at least 10 reads, with a minimum base quality of 10 and counting only reads with unique mapping quality (MQ = 60) (clinical muscle samples). For downstream processing of the scRNA-seq and GTEX-UPIC ASE data, we applied further filters to the data to focus on exonic variation only and to conservatively remove potentially spurious sites (Supplementary Methods), e.g. sites with non-unique mappability were removed, and further, after an initial analysis of the ASE data, subjected 22 of the X-chromosomal ASE sites to manual investigation. For GTEX-UPIC the X-chromosomal ASE data was limited in case of multiple ASE sites to only one site per gene, by selecting the site with coverage >7 reads in the largest number of tissues, in order to have equal representation from each gene for downstream analyses. Assessing ASE across tissues For GTEX-UPIC sample for which ASE data from up to 16 tissues per each ASE site was available, we applied the two-sided Hierarchical Grouped Tissue Model (GTM*) implemented in MAMBA 1.0.0 48,49 to ASE data. The Gibbs sampler was run for 200 iterations with a burn-in of 50 iterations. GTM* is a Bayesian hierarchical model that borrows information across tissues and across variants, and provides parameter estimates that are useful for interpreting global properties of variants. It classifies the sites into ASE states according to their tissue-wide ASE profiles and provides an estimate of the proportion of variants in each of the five different ASE states (strong ASE across all tissues (SNGASE), moderate ASE across all tissues (MODASE), no ASE across all tissues (NOASE), and heterogeneous ASE across tissues (HET1 and HET0)). To summarize the GTM* output in the context of XCI, SNGASE was considered to reflect full XCI, MODASE and NOASE together to represent partial XCI with similar effects across tissues, and HET1 and HET0 to reflect partial yet heterogeneous patterns of XCI across tissues. In order to combine estimates from two ASE states, we summed the estimated proportions in each class, and subsequently calculated the 95% confidence intervals for each remaining ASE state using Jeffreys prior. Determining XCI status in GTEX-UPIC In addition to the ASE states provided by the above MAMBA analysis, genic XCI status was assessed by comparing the allelic ratios at each X-chromosomal ASE site in each tissue individually. For each ASE site, the alleles were first mapped to Xa and Xi; the allele with lower combined relative expression across tissues was assumed the Xi allele. As an exception, at XIST the higher expressing allele was assumed the Xi allele. The significance of Xi expression at each ASE observation was tested using a one-sided binomial test, where the hypothesized probability of success was set at 0.025, i.e., the fraction of Xi expression from total expression was expected to be significantly greater than 0.025. To account for multiple testing, FDR correction was applied, using the qvalue R package, to the P-values from the binomial test for each of the 16 tissues separately. Observations with q-values < 0.01 were considered significant, i.e., indicative of incomplete XCI at the given ASE site and tissue. Biallelic expression in single cells Biallelic expression in individual cells in the X chromosome was assessed only at ASE sites covered by the minimum of eight reads. A site was considered biallelically expressed when 1) allelic expression > 0.05, and 2) one-sided binomial test indicated allelic expression to be at least nominally significantly greater than 0.025. Only genes with at least two observations of biallelic expression across all cells within a sample were counted as biallelic. Phasing scRNA-seq data We assigned each cell to either of two cell populations distinguished by the parental X-chromosome designated for inactivation utilizing genotype phasing. For the YRI samples, where parental genotype data was available, the assignment to the two parental cell populations was unambiguous for all cells where X-chromosomal sites outside PAR1 or frequently biallelic sites were expressed. For 24A no parental genotype data was available, and hence we utilized the correlation structure of the expressed X-chromosomal alleles across the 948 cells to infer the two parental haplotypes utilizing the fact that in individual cells the expressed alleles at the chrX sites subject to full inactivation (i.e. the majority chrX ASE sites), are from the X chromosome active in each cell (Supplementary Methods). In other words, while monoallelic expression in scRNA-seq in the autosomes is largely stochastic in origin, in the X chromosome the pattern of monoallelic expression is consistent across cells with the same parental X chromosome active 21 , unless a gene is expressed also from the inactive X. As such, for the phase inference calculations, we excluded all PAR1 sites and all additional sites that were frequently biallelic, to minimize the contribution of escape genes to the phase estimation. After assigning each informative cell to either of the parental cell populations, the reference and alternate allele reads for each ASE site were mapped to active and inactive allele reads within each sample using the actual or inferred parental haplotypes. The data was first combined per variant by taking the sum of active and inactive counts separately across cells, and further similarly combined per gene, if multiple SNPs per gene were available. For 24A the allele expressed at XIST was assumed the Xi allele, in line with the exclusive Xi expression in the Yoruba samples confirmed using the information on parental haplotypes. Determining XCI status from scRNA-seq ASE Before calling XCI status using the Xa and Xi read counts from the phased data aggregated across cells, we excluded all sites without fewer than five cells contributing ASE data at each gene and also all sites with coverage lower than eight reads across cells within each sample. To determine whether the observed Xi expression is significantly different from zero, hence indicative of incomplete XCI at the site / gene, we required the Xi to total expression ratio to be significantly (q-value<0.01) greater than hypothesized upper bound for error, 0.025. This threshold was determined using the proportion of miscalled alleles at XIST ASE sites (by definition, XIST should express only alleles from the inactive chrX) in the two YRI samples, which presented with fully skewed XCI, i.e., the same active X chromosome across all assessed cells. Median proportion of miscalled XIST alleles was 0, yet one site in one of the samples showed up to 2.5% of other allele calls, and hence this was chosen as the error margin. FDR correction, conducted using the qvalue R package, was applied to each sample individually. Genes where at least one of the samples showed significant Xi expression were considered partially inactivated, while the remaining were classified as subject to full XCI. Allelic dropout, which is extensive in scRNA-seq 18,27 , can lead to biases in allelic ratios in individual cells, i.e., in our case resulting in false negatives where true escape genes are classified as inactivated, this approach utilized is based on using aggregate data across several cells and hence the XCI status estimates are robust to such errors. ChrX and chrY expression in PAR1 Using the parental origin of each allele reference and alternate allele read counts at PAR1 ASE sites were assigned to X and Y chromosomes (i.e. maternally and paternally inherited alleles, respectively). For each sample, the PAR1 ASE data was summarized by gene by taking the sum of X and Y chromosome reads across all informative ASE sites within each gene. Significance of deviation from equal expression was assessed using a two-sided binomial test. Manual curation of heterozygous variants from ASE analyses Twenty-two heterozygous variants assessed in chrX ASE analysis were subjected to manual curation due to providing results in the XCI analysis that were in conflict with previous assignment of the underlying gene to be subject to full XCI. For each sample, BWA-aligned germline bams were viewed in IGV using either WGS or WES data. The presence of a number of characteristics called into question the confidence of the variant read alignments and thus the variant itself (Supplementary Methods). Allele balance that deviated significantly from 50:50 was considered suspect and often coincided with the existence of homology between the reference sequence in the region surrounding the variant and another area of the genome, as ascertained using the UCSC browser self-chain track and/or BLAT alignment of variant reads from within IGV. Other sequence-based annotations added to the VCF by HaplotypeCaller were also evaluated in the interests of examining other signatures of ambiguous mapping. The phasing of nearby variants was also considered. If phased variants occurred in the DNA sequencing data that were not assessed in the ASE analysis, those variants were considered suspect. Extended Data Extended Data Figure 1 Assessment of skew in XCI in GTEx female samples (V3 analysis release) a) Shows the estimated skew in XCI by tissue across individuals and b) shows the skew in XCI by individual across tissue samples available. Number in brackets after tissue or sample name gives the number of individuals or tissues, respectively, contributing to each boxplot. Details of the analysis is given in the Supplementary Note. Extended Data Figure 2 Comparison of expression characteristics between reported genic XCI categories in the GTEx data a) Table showing the statistics for the comparison of the proportion of significantly biased (FDR<1%) genes by reported XCI status. Distributions are illustrated in Fig. 2b. N = 29 for all comparisons. b) Table showing the statistics for the comparison of the consistency in effect sizes across tissues. Distributions are illustrated in Fig. 2c. Only genes expressed in at least five of the 29 tissues are included. c) Number of tissues showing significant sex bias (FDR<1%) per gene by reported XCI status. d) Statistics for the comparison illustrated in c). e) Number of tissues in which genes are expressed by reported XCI status. f) Statistics for the comparison illustrated in e). All P-values are from two-sided Wilcoxon tests, except for a) where paired, two-sided Wilcoxon test was applied. Only genes assessed for sex bias in at least one tissue are included unless otherwise stated. Extended Data Figure 3 Change in the proportion of discovered sex-biased genes by XCI category with varying q-value cut-offs a) The proportion of sex-biased genes across tissues. Here a gene is classified as sex-biased if the q-value for association falls below the given threshold in at least one tissue. b-f) Examples of the change in the proportion of sex-biased expression in individual tissues. The dashed black line indicates the FDR<1% cut-off applied in the analyses to determine sex-biased expression. Extended Data Figure 4 Heatmap representation of male-female expression differences in all assessed X-chromosomal genes (N=681) across 29 GTEx tissues The color scale displays the direction of sex bias with red color indicating higher female expression. Genes that were too weakly expressed in the given tissue type to be assessed in the sex bias analysis are colored grey. Dots mark the observations where sex bias was significant at FDR<1% Extended Data Figure 5 Comparison of expression characteristics between Xp and Xq, the evolutionary newer and older regions of chrX, respectively, by XCI status and for the whole chromosome a) and b) show level of median expression across GTEx tissues in log2 RPKM units, and c) and d) show the breadth of expression measured as the number of tissues (max = 29) in which genes are expressed (median expression across samples > 0.1 RPKM and expressed in more than 10 individuals at >1 counts per million). P-values are calculated using the Wilcoxon Rank Sum test. All genes expressed in at least one tissue are included in the comparisons. Extended Data Figure 6 X-chromosomal RNA-seq and WGS data in the GTEx donor with fully skewed XCI (GTEX-UPIC) a) Allelic expression in chrX in 16 RNA-sequenced tissue samples available from the donor. Dashed red lines indicate PAR1 and PAR2 boundaries. b) Allele balance and allele depth across chrX in WGS for GTEX-UPIC and randomly chosen two female and one male GTEx WGS samples. Extended Data Figure 7 Expressed alleles at biallelically expressed ASE sites in scRNA-seq a) X-chromosomal genes repeatedly biallelic in scRNA-seq (see Methods for details). b) Illustration of the relative expression from the two alleles at all X-chromosomal ASE sites that were repeatedly biallelically expressed across cells in either of the two scRNA-seq samples that showed random XCI (Y035 and 24A). Narrow white lines separate observations from individual cells. Extended Data Figure 8 Assessment of the level of Xi expression at escape genes and in different regions of the X chromosome a) The ratio of Xi-to-Xa expression in the single cell samples (left panel; each circle represents a sample) and in the skewed XCI donor from GTEx (middle panel; each circle represents a tissue), and the female-to-male ratio in expression (right panel, each circle represents a tissue) at reported escape genes. Genes are ordered according to their location in the X chromosome with genes in the pseudoautosomal region residing in the top part of the figure. Dark border around a circle indicate there was significant evidence for Xi expression greater than the baseline in the given sample or tissue (left and middle panels) or significant sex-bias in the given tissue (right panel). Given some outliers, e.g. XIST, the Xi-to-Xa ratio is capped at 1.75 and female-to-male ratio at 2.25. b) The relative expression arising from the X and Y chromosome at PAR1 genes in skeletal muscle in eight males. The allelic expression at these genes was assigned to the two chromosomes utilizing parental genotypes available for these samples (see Methods for details). The dashed line at 0.5 indicates the point where expression from X and Y chromosomes is equal. The error bars give the 95% confidence intervals for the observed read ratio. c) Heatmap representation of the change in pattern of sex-bias at 13 X-Y homologous gene pairs (see Methods for details) in nonPAR from only including the X-chromosomal expression (heatmap on the left) to accounting for the Y-chromosomal expression (heatmap on the right). The color scale displays the direction of sex-bias with red color indicating higher female expression. Genes that were too lowly expressed in the given tissue type to be assessed in the sex-bias analysis are colored grey. Dots mark the observations where sex-bias was significant at FDR<1%. The grey bars on top of the heatmaps indicate the location of the gene in the X chromosome: dark grey indicating Xp and lighter grey Xq. Extended Data Table 1 Tissues, individuals and genes in the GTEx sex-bias analysis Tissues Individuals Genes analyzed Abbreviation Full name All Females Males Mean age All Autosomes ChrX ADPSBQ Adipose - Subcutaneous 297 186 111 52.15 15,273 14,735 538 ADPVSC Adipose - Visceral (Omentum) 184 117 67 51.97 15.301 14,765 536 ADRNLG Adrenal Gland 126 70 56 50.51 14.956 14,435 521 ARTAORT Artery - Aorta 197 126 71 51.11 14.675 14,137 538 ARTCRN Artery - Coronary 118 70 48 51.7 14,881 14,350 531 ARTTBL Artery - Tibial 284 183 101 50.26 14,501 13,981 520 BRNCTXA Brain - Cortex 92 66 26 57.67 15,339 14,791 548 CLNSGM Colon - Sigmoid 114 72 42 48.28 15,045 14,524 521 CLNTRN Colon - Transverse 255 159 96 50.93 15,732 15,181 551 ESPGEJ Esophagus - Gastroesophageal Junction 124 74 50 53.52 14,770 14,245 525 ESPMCS Esophagus - Mucosa 169 97 72 48.89 15,137 14,617 520 ESPMSL Esophagus - Muscularis 126 re 48 50.74 14,879 14,356 523 FIBRBLS Cells - Transformed fibroblasts 240 150 90 50.2 13,635 13,158 477 HRTAA Heart - Atrial Appendage 218 137 81 48.62 14,662 14,145 517 HRTLV Heart - Left Ventricle 159 105 54 53.64 14,075 13,586 489 LCL Cells - EBV-transformed lymphocytes 190 123 67 50.75 13,067 12,621 446 LIVER Liver 96 63 33 53.52 14,031 13,556 475 LUNG Lung 277 181 96 52.06 16,154 15,590 564 MSCLSK Muscle - Skeletal 361 228 133 51.85 13,623 13,153 470 NERVET Nerve - Tibial 256 163 93 51.65 15,563 15,020 543 PNCREAS Pancreas 149 87 62 50.09 14,355 13,861 494 PTTARY Pituitary 86 64 22 56.37 16,068 15,489 579 SKINNS Skin - Not Sun Exposed (Suprapubic) 195 128 67 53.06 15,601 15,069 532 SKINS Skin - Sun Exposed (Lower leg) 300 188 112 52.22 15,746 15,211 535 SMINTI Small Intestine - Terminal Ileum 77 43 34 47.62 15,594 15,046 548 SPLEEN Spleen 89 50 39 48.26 14,993 14,469 524 STMACH Stomach 169 97 72 48.2 15,604 15,057 547 THYROID Thyroid 278 179 99 52.14 15,974 15,417 557 WHLBLD Whole Blood 338 213 125 51.64 13,187 12,751 436 Total 449 290 159 52.27 19,839 19,158 681 Extended Data Table 2 Single-cell RNA-seq samples ID 24A Y117 Y035 Y014 Ancestry China, Asia Yoruba / Nigeria, Africa Yoruba / Nigeria, Africa Yoruba / Nigeria, Africa Design Singleton Trio Trio Trio Genotype data WES WGS WES WES Number of cells 742 96 48 48 Cell type Dendritic cells LCL LCL LCL Sequenced read pairs (mean (range)) 1,187,000 (335-7,403,000) 2,547,000 (38,190-5,126,000) 2,571,000 (46,940-5,038,000) 2,436,000 (69,130-5,457,000) Aligned read pairs* (mean (range)) 808,600 (197-5,727,000) 1,471,000 (14,910-3,309,000) 1,459,000 (16,400-2,893,000) 1,391,000 (14,920-3,067,000) Alignment rate (mean (range)) 0.667(0.271-0.799) 0.545(0.251-0.645) 0.551 (0.266-0.615) 0.526(0.175-0.606) Skew in XCI (% maternal active : % paternal active) 54:46 (373 cells where one parental chromosome active, 315 cells where the other parental chromosome active, 54 cells uninformative for X- chromosomal phasing) 100:0 (90 cells where maternal X chromosome active, 6 cells uninformative for X- chromosomal phasing) 79:21 (37 cells where maternal X chromosome active, 8 cells where paternal X chromosome active, 2 cells uninformative for X- chromosomal phasing) 100:0 (43 cells where maternal X chromosome active, 2 cells uninformative for X- chromosomal phasing) Notes Due to the unavailability of parental genotype information, the parental origin of the inferred X- chromosomal haplotypes is unknown * uniquely aligned, properly paired, QC passed reads. Supplementary Material reporting summary supp_infoguide supp_note supp_table1 supp_table10 supp_table11 supp_table12 supp_table13 supp_table14 supp_table2 supp_table3 supp_table4 supp_table5 supp_table6 supp_table7 supp_table8 supp_table9

          Related collections

          Most cited references20

          • Record: found
          • Abstract: not found
          • Article: not found

          Highly expressed genes in yeast evolve slowly.

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Common genetic variants modulate pathogen-sensing responses in human dendritic cells.

            Little is known about how human genetic variation affects the responses to environmental stimuli in the context of complex diseases. Experimental and computational approaches were applied to determine the effects of genetic variation on the induction of pathogen-responsive genes in human dendritic cells. We identified 121 common genetic variants associated in cis with variation in expression responses to Escherichia coli lipopolysaccharide, influenza, or interferon-β (IFN-β). We localized and validated causal variants to binding sites of pathogen-activated STAT (signal transducer and activator of transcription) and IRF (IFN-regulatory factor) transcription factors. We also identified a common variant in IRF7 that is associated in trans with type I IFN induction in response to influenza infection. Our results reveal common alleles that explain interindividual variation in pathogen sensing and provide functional annotation for genetic variants that alter susceptibility to inflammatory diseases.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              AlleleSeq: analysis of allele-specific expression and binding in a network framework

              Introduction Due to rapidly increasing throughput and decreasing costs, next-generation short read sequencing is fast replacing array-based technology for performing functional genomic assays such as mapping locations of transcription factor binding or determining transcribed sequences in the genome. The initial analyses of high-throughput functional data using ChIP-Seq (Johnson et al, 2007; Robertson et al, 2007) or RNA-Seq (Mortazavi et al, 2008; Nagalakshmi et al, 2008) yield similar results that were obtained using tiling array-based methodologies albeit with greater sensitivity and resolution, that is, binding regions or regions of transcription. Also, with the developments in sequencing technologies there have been increasingly larger studies of the amount of sequence variation across the human population (The 1000 Genomes Project Consortium, 2010). A natural area of recent focus has been looking at the degree of functional genomic differences across the human population (Gregg et al, 2010a, 2010b; Kasowski et al, 2010; McDaniell et al, 2010; Montgomery et al, 2010; Pickrell et al, 2010). However, in order to understand population effects it is first useful to characterize the effects of functional variation within a single individual such as differences of expression and binding between alleles (i.e., allele-specific differences). When comparing functional data between individuals it is necessary to worry about normalization before any comparisons are performed; however, within a single individual there is a natural control of each allele against each other. By utilizing the actual sequence composition of the functional genomic sequence reads that overlap a heterozygous SNP, it is possible to determine the sequences that originate from each allele separately (Degner et al, 2009; McDaniell et al, 2010; Montgomery et al, 2010; Pickrell et al, 2010; Lalonde et al, 2011). Thus, it is possible to determine sites where transcription or transcription factor binding is originating predominately from one allele, that is, allele-specific expression (ASE) or allele-specific binding (ASB); however, there are number of technical issues which make this analysis challenging. In each of the recently published studies that contained some level of allele-specific analysis, only one type of functional genomic assay was performed. A logical question is how these allele-specific events are coupled between assays. At first glance, we expect significant coordination between binding of different transcription factors and expression of target genes. This has been previously been studied in a more limited manner using array-based technologies (Maynard et al, 2008). Here, we address this question by analyzing a number of different functional genomic data sets using a pipeline that we have developed, AlleleSeq, for determining sites showing allele-specific behavior. For the first time, we analyze allele-specific behavior for both transcription data using a very deeply sequenced RNA-Seq data set (∼160 million mapped reads) as well as matching deeply sequenced ChIP-Seq data sets (∼30 to ∼60 million mapped reads) for a number of different transcription factors (cFos, cMyc, JunD, Max, NfκB, and CTCF) as well as polymerases (RNA Polymerase II and Polymerase III). These experiments were generated for the lymphoblastoid cell line GM12878, which has also been deeply sequenced together with both parents (as a trio) as part of the pilot II phase of The 1000 Genomes Project Consortium (2010). Thus, for these data sets we have a complete set of heterozygous variants (SNPs, indels, and structural variations (SVs)) for the individual NA12878, which can mostly be phased into maternal and paternal variants by comparing against the parents sequences. This is important for assessing the genome-wide amount of allele-specific behavior, which is severely limited by the number of identified heterozygous SNPs available (for instance, Montgomery et al (2010) and Pickrell et al (2010) used HapMap III SNP calls which are ∼10-fold fewer than those available from pilot II of the 1000 Genomes Project). Allele-specific behavior is presumably occurring also in regions devoid of heterozygous SNPs, where we cannot distinguish between the alleles. When assessing the number of comparisons of allele-specific behavior between transcription factor binding and expression, 10-fold fewer total number of heterozygous SNPs would only allow for ∼100-fold fewer comparisons between ASB and ASE SNPs to be made. There are numerous technical hurdles in determining allele-specific behavior. One might think that it is possible to simply map the sequenced reads against the reference genome in order to determine allele differences; however, this introduces reference biases. Most analyses of human genomic data use the reference genome sequence for comparison; nevertheless, when genome scale analysis of allele-specific behavior is performed we show that it is necessary to align reads against a diploid sequence for that individual. We deal with this by constructing a diploid personal genome sequence by using the variation data (both for SNPs, indels, and SVs) for NA12878 (Mills et al, 2011; The 1000 Genomes Project Consortium, 2010). While the 1000 Genomes Project has created call sets of sequence variants for each of the different genomes sequenced, they have not however assembled genome sequences (including NA12878) for each of the individuals sequenced. In the first part of our AlleleSeq pipeline, we generate a diploid genome sequence of maternal and paternal haplotypes by integrating the phased variation data (SNPs, indels, and SVs) onto the reference genome sequence. In addition, we filter out genomic sequences that are likely to correspond to copy number variants (CNVs) using read-depth analysis (Abyzov et al, 2011). Construction of individual personal reference diploid sequences, as a first step for functional genomic analysis, will likely become standard in the near future. In this paper, we show that ASE of genes as well as novel transcribed regions, that is, novel transcriptionally active regions (TARs) or transfrags (Kapranov et al, 2002; Rinn et al, 2003; Bertone et al, 2004), are coordinated with ASB of transcription factors and other DNA binding proteins located adjacent to the transcribed region. One can measure how well ASB and ASE are coordinated, by using a correlation plot of the two. However, representing the coordination between multiple allele-specific events is difficult. In order to facilitate this, we show how ASB for multiple transcription factors is coordinated with ASE of the target genes or novel TARs by visualizing this behavior using a simplifying regulatory network. We will see how certain allele-consistent regulatory motifs are enriched using network analysis. We will observe that ASB and ASE are not as coordinated as might have been naively expected and speculate on potentially complexities of allele-specific regulation. Results We start by assembling a set of sequence variants from the 1000 Genomes Project for the NA12878 individual. We then generated deeply sequenced ChIP-Seq data sets for cFos, cMyc, JunD, Max, and RNA Polymerase II for the GM12878 cell line. We also created a matching deeply sequenced RNA-Seq data set for the same cell line. We combined these data sets with previously published matching data sets for RNA Polymerase II, RNA Polymerase III, NfκB, and CTCF (Kasowski et al, 2010; McDaniell et al, 2010; Raha et al, 2010). We summarize these data sets in Table I (see Materials and methods for further details). Determining allele-specific behavior from functional genomic data alone Intuitively if one has performed a deeply sequenced functional genomic experiment such as RNA-Seq or ChIP-Seq from a single individual, it should be possible to determine allele-specific behavior solely from the sequences obtained. The first step in this approach would be to determine the SNPs and other sequence variants directly from the sequence reads obtained. This might be true for certain regions sequenced at great depth; however, since functional genomic data (e.g., reads from a ChIP-Seq experiment) cover the genome with greatly varying sequence depth due to the nature of the functional assay. Thus, the accuracy of SNP (and other variant) calling from functional genomics data will necessarily vary across the genome. Conversely, the average sequencing depth across the genome for conventional genomic DNA sequencing is nearly uniform (with some differences to repeated regions and compositional biases). We find that the accuracy of de novo SNP calling using reads from a functional genomic sequencing experiment such as RNA-Seq using a standard SNP caller package (e.g., SNVmix; Shah et al, 2009) is not as good as we would need for determining allele-specific behavior (see Supplementary Table 1 for the results of de novo SNP calling of heterozygous SNPs). Any significant amount of miscalling of heterozygous SNPs will (obviously) lead to ill determined allele-specific behavior. There are a number of possible explanations for such miscalls; the very events we would like to find, SNPs within regions showing ASE could potentially appear as homozygous using only the RNA-Seq sequence reads. If one experimentally only obtains sequences from a region that is expressed on one allele (due to ASE) then there is no way to know that any base within that region is polymorphic. Second, RNA editing could also lead to variations in RNA sequences that are not present at the DNA level. Finally, sequencing of RNA involves additional experimental steps like usage of reverse transcriptase that can increase chance of mis-sequencing. Obviously, determining short indels from sequenced functional genomic data would be even harder and SVs would be nearly impossible. Thus, while it might be possible to determine certain sequence variants from the functional genomic sequence reads, in order to generate a comprehensive set of polymorphic sites as well as other forms of sequence variation it is necessary to have an independently determined set from sequenced genomic DNA (such as from the 1000 Genomes Consortium). Building an individual diploid reference genome for NA12878 It might not seem obvious but for a number of reasons reconstruction of a diploid personal genome sequence and using it instead of the reference genome is a critical step preceding allele-specific analysis. First, using reference genome introduces biases in read mapping—reads originated from non-reference allele are more susceptible to mismapping since, when aligned to the reference allele, they contain at least one mismatch (in case of SNPs) or gap (in case of indels)—the reference bias effect, that is, both alleles are not treated equally by default. Second, expression or binding in regions of genome SV could be misinterpreted as ASE or ASB. For example, duplication of an allele in the studied genome will double binding signal for the allele while signal for the allele on another haplotype will be unchanged. Last, but not least, SNP calling in the regions of SV is likely to be less precise and contain more false positives compared with non-SV regions (The 1000 Genomes Project Consortium, 2010). Thus, we construct a personal diploid genome of NA12878 (see Materials and methods), by utilizing genomic variations (see Table II for summary statistics) determined in the framework of The 1000 Genomes Project Consortium (2010) and, additionally, SVs determined by sequencing of fosmid clones (Kidd et al, 2008). To accomplish this, we have developed a tool—vcf2diploid—for personal genome construction, which constitutes the first part of the AlelleSeq pipeline (see Figure 1A). The tool uses as input VCF files with all the SNPs, indels, and SVs available for an individual of interest and outputs fasta sequences for each allele for each chromosome, along with equivalence map files (see Figure 1 and Supplementary Figure 1) that map nucleotide positions between paternal, maternal, and reference haplotypes. It is important to be able to map annotation (i.e., genes) from the reference genome to the personal genome sequences. This is done using chain files, which facilitate the mapping of annotated regions between genomes using the liftOver tool (Rhead et al, 2010). This is particularly important for RNA-Seq where we also build maternal and paternal versions of the gene annotation (including, most importantly, splice-junction library) by mapping the GENCODE annotation (GENCODE 3c annotation is available from the UCSC Genome Browser; Harrow et al, 2006) onto the personal diploid genome. The constructed diploid genome of NA12878 was different in 3 962 637 (∼0.14%) bases from the reference for paternal and in 3 947 162 (∼0.14%) for maternal alleles. The software package to perform personal genome sequence construction (the vcf2diploid tool and associated source code), the actual diploid sequence for NA12878, splice-junction sequences and personalized gene annotation for NA12878 and corresponding equivalence maps (between the maternal and paternal sequences as well as the reference genome, NCBI36/hg18) are available from http://alleleseq.gersteinlab.org. The diploid sequence for NA12878 is a valuable resource for anyone performing any sequence-based analysis on this genome. The GM12878 cell lines are a primary tier one cell line under detailed investigation by the ENCODE Consortium. It should be also noted that a constructed personal genome is only as good and as complete as the variants used in construction. In light of this, the diploid genome of NA12878 that is presented here, is not perfect, but we believe it is the best possible sequence to date since it includes the most comprehensive set of variants. We intend to update this assembly as a resource, as sequence variants are even more accurately determined. In order to assess the effect of the differences between the maternal and paternal sequences compared with using the reference genome sequence on functional genomic data, we aligned the reads from the Pol II and CTCF ChIP-Seq data for GM12878 against each of the three sequences using BOWTIE (Langmead et al, 2009; see Supplementary Figure 2). In Table III, we compare the Pol II reads that align to each of the three genome sequences (reference, maternal, and paternal haplotypes). We observe that by allowing up to two mismatches more reads (0.3% for paternal and 0.4% for maternal) align to the correct NA12878 as compared with the reference genome sequence (NCBI36). The major difference in numbers for paternal/maternal and reference haplotypes is due to reads that map to one haplotype but not the other. Namely, only about 0.1–0.2% of reads that map to the reference cannot be mapped to paternal or maternal haplotype, while a significantly higher fraction of reads (∼0.5%) map to the paternal or maternal genome and cannot be mapped to the reference. For paternal and maternal haplotypes, unmapped reads and reads with different mapping locations contribute roughly equally to the differences in overall mapping, presumably mostly due to short indels and SVs. We also see similar results for the same analysis done to the reads for CTCF ChIP-Seq (see Supplementary Table 2). This demonstrates that it is important to use a correctly assembled personal genome for aligning reads when performing an allele specificity analysis. Similarly, transcription factor binding sites also overlapped more when aligned to the maternal and paternal genomes of NA12878, rather than the reference sequence. For this comparison, we used the set of independently mapped reads for all three genome sequences to determine binding sites using PeakSeq (Rozowsky et al, 2009), and performed a pair-wise nucleotide overlap of the binding sites between the three genome sequences (Supplementary Table 3). In addition, we observe that the differences in binding sites, among the three genomes, are greater than the underlying differences in read mapping. Determining ASE and ASB The second part of the AlleleSeq pipeline determines ASE using RNA-Seq data and ASB using ChIP-Seq data. After the maternal- and paternal-derived haploid sequences are constructed, sequenced reads are aligned against the maternal and paternal sequences separately using BOWTIE (Langmead et al, 2009). Locations of mapping are determined by selecting the best alignment to both genome sequences. Read counts for the maternal and paternal alleles are then generated at each heterozygous SNP nucleotide positions, and ASE/ASB events are reported by applying a binomial test followed by correction for multiple hypothesis testing. SNP positions that by read-depth analysis (Abyzov et al, 2011) are determined to be potentially in a CNV (the read depth of genomic DNA reads in a 1-kb window around the SNP is either 3) are excluded (see Materials and methods). We correct for multiple hypothesis testing by estimating the false-discovery rate (FDR) by explicit simulation of the number of false positives given an even null background (i.e., no allele-specific events)—see Figure 1B for a schematic of the second part of the pipeline (see Materials and methods for further technical details). We also align reads to the maternal and paternal splice-junction libraries and determine splice-junction ASE SNPs in a similar way. Results for GM12878 RNA-Seq and ChIP-Seq data We start our study of allele-specific phenomena by first focusing on analyses of individual events that occur within single experimental data set. We then analyze the coordination between binding and expression in a pair-wise manner using direct correlation. Finally, we investigate higher order coordination of ASB and ASE using a regulatory network framework that will allow us to study the agreement between multiple regulatory interactions and target expression simultaneously. We summarize the results of the AlleleSeq pipeline applied to the RNA-Seq data and ChIP-Seq data for GM12878 in Table IV. In the upper half of the table, we present the results for all the autosomes and in the lower half for chromosome X (with all the transcription factor combined). In the second column of Table IV, we list the number of genomic elements (genes, novel TARs, and binding sites) that are accessible for allele-specific behavior—that is, those that we could have detected allele-specific activity in. This is the set of genomic elements that contain at least one heterozygous SNP and are sequenced at sufficient depth in order to detect allele-specific activity, see Materials and methods for further details. The third column shows that number of genomic elements that we determine to show allele-specific behavior. The fourth column shows the fraction of genomic elements that are accessible for allele-specific behavior that do show either ASE or ASB. Finally for allele-specific events that can be phased we report those that are maternal or paternal. We observe that ∼19.4% of all autosomal GENCODE genes that are accessible for allele-specific behavior exhibit ASE. We similarly find that 21.6% of accessible heterozygous SNPs within splice junctions of genes also show ASE. Similarly, we find that 9.3% of autosomally expressed accessible novel TARs show ASE, we expect this number to be lower than genes as novel TARs correspond to exons of genes. We found that for genes that contained two or more heterozygous SNP showing allele-specific behavior, >81% of the time all the SNPs would show consistent ASE from the same allele (significantly greater than expected by chance), some of the exceptions could be due to allele-specific alternate splicing. For the transcription factors, the fraction of accessible autosomal binding sites that exhibit allele-specific behavior varies between 2% (for cMyc) and 11% (for Pol II). A possible explanation for the difference between binding and expression allele specificity is that even though these ChIP-Seq data sets have been sequence quite deeply (see Table I), in order to have comparable power with the RNA-Seq data one would need to sequence an order or magnitude or two further. The number of overlapping sequence reads in binding site peaks for ChIP-Seq data depends on the efficiency of the antibody used and for most ChIP-Seq data sets we do not have sufficient read depth within a binding site as compared with the read depth within exons of highly expressed genes. As expected when restricted to the autosomes, both ASE for genes and novel TARs and ASB for all the transcription factors and polymerases are evenly divided between the maternal and paternal alleles. In the lower half of Table IV, we present similar results for chromosome X. Since NA12878 is female there are two copies of chromosome X. We first observe that almost 80% of the accessible genes on chromosome X exhibit allele-specific behavior and 95% of these are expressed on the maternal copy. This is consistent with our knowledge of X-chromosome inactivation (Lyon, 1961; Goto and Monk, 1998) where one copy of the two copies is shut off. There are four genes on chromosome X that show ASE on the paternal copy; however, all of these are located in the pseudo-autosomal region of chromosome X which is known to escape X-chromosome inactivation (these include Xist, SLC25A6 and SFRS17A). We observe similar results on chromosome X for the allele-specific behavior of novel TARs as well as transcription factor binding where a greater fraction of sites exhibit allele-specific behavior compared with the autosomes and almost all are expressed on the maternal copy. It is interesting to note that most of the novel transcription and binding that shows paternal allele specificity are also in the pseudo-autosomal region similar to Xist and possibly have an associated functional role. We make available the complete list of SNPs that show ASB or ASE in VCF format as a resource from our website http://alleleseq.gersteinlab.org. We imagine that this file may be a useful resource for researchers interested in focusing on allele-specific sites in less deeply sequenced functional genomic experimental data sets. This might be valuable even if the functional genomic experiment was not performed on the GM12878 cell line as regions to investigate for allele-specific behavior. There are a number of technical issues associated with determining allele-specific behavior including various biases that can be introduced as part of the analysis. We investigate some of these potential effects in detail below: Reference bias: In order to assess whether our pipeline has some residual bias toward the reference allele versus the alternate, we can plot the fraction of reads from the alternative allele for each heterozygous SNP location sequenced sufficiently deeply. If there were no bias, we would expect that this distribution would be symmetric having as many reference allele-specific locations as for the alternate allele. In Figure 2, we plot the alternative allele fraction distribution for the RNA-Seq data, Pol II, and the other transcription factors combined. We first observe that the overall distributions are fairly symmetric and that the allele-specific events (indicated in blue) are also symmetric—this indicates that there is no residual bias toward or against the reference allele. In Supplementary Figure 3, we observe similar distributions for the fractions of reads from the maternal allele for each heterozygous SNP location that could be phased and that was sequenced sufficiently deeply in the appropriate assay. Correlation with SNP quality (genotype likelihood scores): SNPs determined by The 1000 Genomes Project Consortium (2010) each have a genotype likelihood score. In Supplementary Figure 4, we plot the distribution of all heterozygous SNPs and the subset of ASE SNPs versus this genotype likelihood score. We see a slight enrichment of ASE SNPs will lower scores; however, the majority of SNPs from both distributions have the highest possible score. For comparison, non-synonymous SNPs also show a similar distribution. Relation to genome duplications (effect of Degner et al, 2009): It has been reported by Degner et al (2009) that heterozygous sites showing apparent allele-specific behavior can be caused by regions in the genome that have been duplicated. Thus, even though there might be a similar number of reads coming from each allele, only one of the alleles would have uniquely mapping reads which would lead to seemingly allele-specific behavior (see Supplementary Figure 5 for a schematic comparing region without a duplication to regions that have been duplicated). In order to assess the size of this effect on our results, we used the Pol II ChIP-Seq reads mapped uniquely and independently to each of the maternal and paternal genomes. This is as opposed to the normal mapping procedure in the AlleleSeq pipeline, where we map independently to both haplotypes and then select the allele with the better mapping location. At each heterozygous SNP location determined to show ASB we computed the haplotype fraction, the fraction of reads mapped to one allele over the sum of reads mapped to both alleles (we choose the haplotype fraction corresponding to the allele that has the greater fraction, see Supplementary Figure 5). For sites that have not been duplicated the haplotype fraction should have a value close to 0.5, while for duplicated regions exhibiting the Degner effect the fraction would be close to 1 (where all the uniquely mapped reads would come from one allele). In Supplementary Figure 6, we plot the distribution of haplotype fractions for all Pol II ASB sites. We observe that only a minority of the sites ( 0.6). As valid ASB sites that contain additional proximal sequence variants (such as additional SNPs or indels) would also exhibit a fraction biased toward one haplotype, we conclude that this is an upper bound on the size of this effect and do not consider it significant. Modified binomial test: In order to assess the effect of aligning reads against the constructed diploid genome sequence for NA12878 versus using the reference genome sequence we perform the following analysis. For the RNA-Seq reads, we also aligned the reads against the reference genome and determined ASE using an even binomial distribution (threshold applied in a similar manner). As an additional comparison, we also applied the methodology of Montgomery et al (2010) where a skewed binomial distribution is used with the reads aligned against the reference genome (they composite for the reference bias induced by mapping to the reference genome by modifying the binomial distribution). Similar to Figure 2, we plotted the distribution of all expressed heterozygous SNPs (ASE SNPs in blue) for these two methods in Supplementary Figure 7. Using the naive methodology with an even binomial we see the skew of the ASE SNPs toward the reference genome which is largely removed using the modified binomial test. When comparing our set of 5862 ASE SNPs determined using the personal genome we find that only 69% are shared with those determined using the naive approach. Using the modified binomial methodology from Montgomery et al (2010), we see an improvement (75% in common); however, we still are detecting a significant number of ASE sites that were missed aligning to the reference genome and only modifying the binomial test versus using the correct diploid genome to align against. Comparison with McDaniell et al, (2010): We have also compared the ASB sites for the CTCF ChIP-Seq data from the AlleleSeq pipeline against those from McDaniell et al (2010). Restricting the comparison with common heterozygous SNP between the two analysis (McDaniell et al, 2010 used an earlier set of SNP calls for NA12878 from The 1000 Genomes Project Consortium, 2010) we find that greater using a P-value threshold of 0.01 on their results >85% of the ASB SNPs are in common. Allele-specific elements using heterozygous indels: The AlleleSeq pipeline determines allele-specific behavior for genomic elements (transcribed regions or binding sites) that contain heterozygous SNPs. It is also possible to determine allele-specific behavior for genomic elements that contain a heterozygous indel. In Supplementary Table 4, we show the results for transcribed exons and novel TARs as well binding sites for Pol II and CTCF that can be determined to show allele-specific behavior using a heterozygous indel to distinguish the haplotypes. Correlation of ASB in binding sites containing known motifs In our analysis of ASB events, heterozygous SNPs are initially only used for distinguishing between the maternal and paternal alleles (presumably allele-specific behavior also occurs in genomic regions not containing heterozygous SNPs). However, in some locations the heterozygous SNP might be the causative reason for the difference in binding between the alleles, this would most likely occur in ASB sites where the heterozygous SNP is within a known DNA binding motif for a transcription factor. In order to see how ASB is correlated with perturbations to known DNA binding motifs, we compared the allele that is bound against the allele that matches better with the known motif. Thus, we first scanned ASB sites for known motifs, position weight matrices (PWMs) obtained from TRANSFAC (Matys et al, 2006) and JASPAR (Portales-Casamar et al, 2010) (see Materials and methods for further details). We correlated the nucleotide of the allele, which is preferentially bound with the fitness score of the PWM. We observed a number of significant correlations between the PWM score for both alleles and the allele that is bound. The allele that is bound tends to correspond to the better match to the known PWM. In particular, we see this for the known cMyc motif within both cMyc and Max binding sites (see Figure 3). This is interesting as we observe a correlation between the allele-specific behavior of cMyc motifs with Max binding sites, indicating that these transcription factors tend to significantly regulate the same target genes. We also include in Supplementary Figure 8 additional examples of the correlation between motif fitness score and the allele being bound for CTCF binding sites containing CTCF motifs and cMyc motifs within Pol II binding sites. Correlation of ASE with protein structural fitness Some heterozygous SNPs within genes can result in one allele losing its ability to function as a transcript (i.e., heterozygous loss of function). Additionally, non-synonymous SNPs within the protein-coding sequence can cause a difference in the structure fitness of the protein coded from each allele. We studied the coordination between these effects and the genes that show ASE. We first investigated the overlap between genes that exhibit ASE with genes that show loss of function on one allele due to a premature stop codon, a frame-shift or a disruption of an intron–exon splice site (all caused by heterozygous SNPs). We find four cases of genes that show ASE as well as heterozygous loss of function and in all four cases the allele that is expressed is opposite to the allele that has lost its ability to code for a protein. We speculate that in some of these cases the transcript from the allele suffering from a disruption might be degraded due to non-sense-mediated decay, which, in turn, might cause the ASE from the other allele. Since some heterozygous SNPs that show ASE are within the protein-coding sequence of genes, it is natural to ask whether the allele that is expressed could track with allele-dependent structural changes (for SNPs in non-synonymous positions in the protein-coding sequence). However, it is not clear that we expect to find a correlation between structural fitness and ASE, as many of these SNPs are not selected for in the human population in any case. In order to assess whether the allele that is expressed (for genes showing ASE) is correlated with the allele containing mutations deleterious to protein structure we performed the following analysis. We compared the occurrences of ASE SNPs within genes where the SNP corresponds to a non-synonymous substitution within the protein-coding sequence of the gene. Using the tool SIFT (Ng and Henikoff, 2003), we can compare the preference of the allele that is expressed with the allele whose amino-acid sequence shows better structural fitness. We find that 20 of the 37 genes that meet these criteria show expression on the allele that has the protein sequence that has better fitness. While we find slightly more genes where the allele with better structural fitness occurs on the same allele that is expressed, this result is not significant. Correlating ASB with ASE In the upper panel of Figure 4, we present an example of the gene SKA3 on chromosome 13 which has multiple heterozygous SNPs within exons which show consistent maternal ASE which agrees with the maternal ASB of another heterozygous SNP within a Pol II binding site proximal to the 5′ end of the gene. In this example, we see coordinated maternal binding of Pol II with expression of the gene. In the lower panel of Figure 4, we present a similar example of a novel transcribed region on chromosome 4 where we see coordinated paternal binding of Pol II with the paternal expression of the novel TAR. These two examples show how ASB is coordinated with ASE for a known gene and a novel TAR. To investigate this trend, we assess to what extent allele-specific behaviors detected using heterozygous SNPs are coordinated on a genome-wide scale. In Table V, we tabulate the number of genes and novel TARs that have evidence for ASE and for proximal ASB. We also tabulate the total counts of genes that have a proximal binding site where both the gene and binding are jointly accessible for detecting allele-specific behavior. We perform a similar calculation for novel TARs and their proximal sites. In Table V, we present the tabulated counts for Pol II & Pol III and CTCF separately from all the other transcription factors combined. We find a number of genes (74 genes proximal to Pol II & Pol III sites, 29 genes proximal to CTCF binding sites, and 44 genes with proximal transcription factor sites other than CTCF) and novel TARs (55, 8, and 15 for Pol II & Pol III, CTCF, and remaining transcription factors, respectively) in which we see both ASB proximal to ASE. We separate CTCF from the remaining transcription factors because of its function as an insulator. While these numbers might seem relatively small, they reflect the low chance of having both a proximal binding site as well as an expressed gene with both of them jointly accessible for the detection of allele-specific activity. This underscores the need to sequence deeply and use a comprehensively determined set of SNPs or else we would have significantly fewer genes to be able to compare ASB and ASE. In order to assess the degree of coordinated allele-specific behavior for the 74 genes that exhibit ASE that have a proximal ASB Pol II binding site we performed the following analysis. For each gene, we tabulated the allele-specific behavior of the most significant ASE SNP versus the most significant ASB SNP (if there are more than one significant heterozygous allele-specific SNP present). In Table VI , we tabulate maternal and paternal ASB of binding sites against maternal and paternal ASE of proximal genes (we do this separately for Pol II & Pol III, CTCF, and the remaining transcription factors combined). We see that there is a statistically significant coordination between ASB of Pol II & Pol III to genes that exhibit ASE (Fisher's exact test, P-value=1.4e−3). Similarly, as seen in Table VI there is also a statistically significant correlation between the 45 genes that exhibit ASE with ASB for all the combined transcription factors excluding CTCF (Fisher's exact test, P-value=1.8e−5). We do not however, see a significant correlation of ASB with ASE for CTCF which is probably due to its role as an insulator. Further coordination of allele-specific behavior As a further way to assess the coordination of allele-specific events within genes, we combined all the ASE and ASB SNPs that occurred within a gene (from 2.5 kb upstream of the transcription start site (TSS) to the transcription termination site (TTS) including introns). Using only genes that contained >10 SNPs showing ASE or ASB we could compute the fraction of SNPs that show maternal specificity. Ideally, if all SNPs were perfectly coordinated then the fraction would be either zero or one. In the first panel of Figure 5, we see the actual distribution, most genes do show a high degree of coordination. Under a random null (where each ASE or ASB event could be maternal or paternal with equal chance) for the same genes each with the same number of SNPs, we would expect a null distribution of maternal fraction computational simulated in panel two of Figure 5. Using a Kolmolgorov–Smirnov test, we find significantly more coordination of ASB and ASE SNPs in genes than compared with the random null expectation (P-value=8.45e−8; see the third panel of Figure 5). Representing ASE and ASB behavior on a regulatory network In the previous analysis, we showed that binding and expression were correlated in a pair-wise manner. Next, we would like to investigate the coordination of allele-specific behavior between multiple factors and expression simultaneously. This is hard to represent in a two-dimensional correlation plot; thus, we have developed a simplified representation of ASE and ASB in a regulatory network framework. Looking at the occurrences of network motifs (Milo et al, 2002) in this framework allows us to measure the coordination of allele-specific behavior between multiple transcription factors and target expression. The network shown in Figure 6 represents the regulatory network of six transcription factors (cMyc, Max, cFos, JunD, NfκB, and CTCF) and two polymerases (Pol II and Pol III). The network discretizes the ASB events into allele-specific regulation of target genes and novel TARs and their ASE. The edges in the network represent ASB of the TF or polymerase to the target gene or novel TAR (red edges indicate predominantly maternal regulation, blue edges indicate paternal regulation, and gray edges indicate allele-specific regulation that could not be phased). ASE of the target genes is indicated by the color of the target gene or novel TAR (red—maternal, blue—paternal, and gray—unphased allele-specific behavior). Thus, the network contains all information on the allele specificity of the regulation and the expression of the targets. One can observe that there is a clear agreement between the allele specificity of the regulation and the expression of the target. When we tabulate the maternal/paternal regulation edges with maternal/paternal expressed genes or novel TARs (see first part of Table VII) we find that they are highly coordinated (P-value 3 (the normalization factor of 2 indicates the diploid nature of the human genome). We filtered out SNPs that are more likely to be false positives or may represent duplicated or deleted regions which would complicate calling allele-specific behavior. The allele-specific SNP processing pipeline The pipeline has four main inputs: one or more collections of unmapped reads, a set of SNP positions, a personal genome, a set of known genes, listing transcription starts and stops, and exon coordinates. The processing follows these steps. For each logical set of reads: (1) The reads are trimmed, if necessary, to remove ends that contain large numbers of errors and filtered to remove any reads containing N's. (2) SNP locations are converted to a standardized format that describes the alleles for all heterozygous SNPs in GM12878, including parental phasing, if possible. Phasing is possible for all heterozygous GM12878 SNPs except those in which both parents are also heterozygous. (3) The filtered reads are mapped, using bowtie, to the maternal and paternal genomes. Bowtie was invoked with these flags: --best --strata -v 2 -m 1, which returns only unique hits within a minimum number of mismatches, up to two. (4) The two sets of mapped reads are merged into a single set, with each read represented at most once, using the better mapping from the maternal or paternal haplotypes. If the two mappings for the same read tie in quality, one is chosen at random. (5) Using Het-SNP file and the mapped reads, allele counts are generated for each Het-SNP location. The resulting counts file contains the number of As, Cs, Gs, and Ts found in reads mapped over each SNP location. Various other values are also generated for each Het-SNP location, including reference allele, maternal/paternal allele (if determinable), major and minor allele, and a binomial P-value assuming a 50/50 probability of sampling each of two alleles. (6) In order to calculate the FDR, we perform an explicit computational simulation to correct for multiple hypothesis testing. We start with all the heterozygous SNP locations; for each SNP location, we randomly assign each mapped read in the data set to either allele. At a given P-value threshold (using the binomial test), we can determine the number of false positive allele-specific event calls (from the simulated data); and thus, we can determine the FDR as the number of false positive over the total number of observed positives. We require a FDR of 50% and at least twice as high as the second most frequent nucleotide. A double-degenerate code is used if the combined frequencies of two nucleotides are >75% but each of them is present in 0.01) and P(n) is the background frequency of allele n. In Figure 3, we plot this difference in motif scores, delta (maternal–paternal) against the fraction of maternally derived reads overlapping the same heterozygous SNP in the ASB site. In small number of cases where there are multiple SNPs in the TF motif region, the best one is chosen where if the maternal read count fraction greater than half the best is equal to the biggest delta, while if fraction is less than half the smallest delta is chosen. Building an allele-dependent regulatory network We decided to integrate the expression data for genes and TARs from the RNA-Seq experiment with the TF binding data from the (cFos, cMyc, JunD, Max, NfkB, CTCF, Pol II, and III) ChIP-Seq experiments into a regulatory network. In order to construct a regulatory network to determine the edge between a TF and a gene by assigning an ASB event to a target ASE gene if it lies within 2.5 kb upstream of the annotated TSS and the TTS. For ASE novel TARs we do not know which strand is being expressed, thus we associate ASB events that occur within 2.5 kb of either end of the novel TAR. If it is allele specific then it could be further classified into paternal, maternal, and unphased. The ‘unphased' category represents the case where the experiments show allele specificity but it cannot be phased. After constructing the edges between the TFs and gene/novel TARs in the network, we overlaid the gene/novel TAR ASE information onto the nodes. Each gene/novel TAR was categorized into three categories: paternal, maternal, or unphased ASE. After constructing the network, we performed a network motif analysis on it, the results of which are shown in Table VII. We analyzed the occurrences of MIMs where two TFs regulate the same gene/novel TAR and SIMs where a single TF regulates two different gene/novel TARs, taking into account the allele specificity of the regulation and the expression of the targets. Counting of occurrences of MIMs and SIMs was performed using Cytoscape (Cline et al, 2007). Supplementary Material Supplementary Material Supplementary Figures S1–8 and Supplementary Tables S1–4 Supplementary Dataset 1 Supplementary data files of allele-specific SNPs in vcf format. Supplementary Dataset 2 Supplementary data files of genes showing alelle-specific expression. Review Process File
                Bookmark

                Author and article information

                Journal
                0410462
                6011
                Nature
                Nature
                Nature
                0028-0836
                1476-4687
                31 October 2017
                11 October 2017
                11 April 2018
                : 550
                : 7675
                : 244-248
                Affiliations
                [1 ]Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
                [2 ]Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
                [3 ]Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Charlestown, MA 02129, USA
                [4 ]Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
                [5 ]Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
                [6 ]New York Genome Center, New York, NY 10013, USA
                [7 ]Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY 10003, USA
                [8 ]Department of Systems Biology, Columbia University, New York, NY 10032, USA
                [9 ]Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
                Author notes
                [* ]Correspondence and requests for materials should be addressed to T.T.( ttuk@ 123456broadinstitute.org ) or D.G.M. ( danmac@ 123456broadinstitute.org )
                [†]

                Lists of participants and their affiliations are listed in attached file.

                Article
                NIHMS905235
                10.1038/nature24265
                5685192
                29022598
                787616be-71a6-4902-8c00-8890e4ea3c36

                Reprints and permissions information is available at www.nature.com/reprints

                Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms

                History
                Categories
                Article

                Uncategorized
                Uncategorized

                Comments

                Comment on this article