The role of sex in the genomics of human complex traits

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Nearly all human complex traits and disease phenotypes exhibit some degree of sex differences, including differences in prevalence, age of onset, severity or disease progression. Until recently, the underlying genetic mechanisms of such sex differences have been largely unexplored. Advances in genomic technologies and analytical approaches are now enabling a deeper investigation into the effect of sex on human health traits. In this Review, we discuss recent insights into the genetic models and mechanisms that lead to sex differences in complex traits. This knowledge is critical for developing deeper insight into the fundamental biology of sex differences and disease processes, thus facilitating precision medicine.

Related collections

Most cited references 134

Record: found
Abstract: found
Article: found

Is Open Access

Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS

Dan Nicolae, Eric R. Gamazon, Wei Zhang … (2010)

Introduction Results of genome-wide association studies (GWAS) in complex traits published to date have provided us with surprisingly little new information on the nature of the genetic component to these phenotypes, despite the large number of single nucleotide polymorphisms (SNPs) found to be reproducibly associated with such traits. In some cases, this reflects the fact that major aspects of the biological basis for disease were already well understood; results of GWAS in autoimmune disorders, for example, have reinforced the central importance of the immune system and its regulation. For a few disorders, results of GWAS have highlighted biological contributing factors to disease that had not previously been recognized as central, such as the complement system in macular degeneration [1] or autophagy in Crohn's disease [2]–[4]. While it does seem ungrateful to question the utility of GWAS when they have yielded so many more reproducible associations than we have achieved with any other approach, the fact is that our primary goal in conducting GWAS for a complex trait – achieving a comprehensive understanding of the genetic basis for that trait – remains elusive for most of the traits that have been examined. The failure to achieve this goal is made particularly acute by the recognition that the loci that have been successfully identified not only provide us with little insight into the genetic basis for the trait, but also account for little of the overall heritability [5]. Although there are important caveats to these grim statistics – most published studies have conducted discovery research in populations of European descent, and some important disorders have not even been examined using GWAS yet – the collective experience has led to sharp disagreement as to whether there is sufficient value in targeting near-term investments in genomics to GWAS, or whether such investments are better targeted to sequencing [6]–[8]. The key issue in this controversy is whether the genetic risk factors not yet discovered are largely similar in frequency and effect size to those that have been discovered using GWAS (i.e. common alleles with low risk) or are, instead, rarer alleles that would be best identified through sequencing studies. Studies we report here suggest that we have not yet exhausted the signals that can be discovered through GWAS. Annotating SNPs with information on expression can improve our ability to more easily distinguish those associations likely to be replicated, and provide us with a better understanding of the genes and mechanisms driving the associations we discover. Moreover, it appears that for at least a subset of complex disorders, there are many more common variants truly associated with disease and highly likely to be expression quantitative trait loci (eQTLs). These variants can be identified and characterized using existing GWAS and tools such as the SCAN database (SNP and Copy number ANnotation; http://www.scandb.org) [9]. Results SNPs Associated with Complex Traits Are More Likely to Be eQTLs than Frequency-Matched SNPs from GWAS Platforms The 1598 SNPs characterized in the GWAS catalog [10] as showing association with complex traits included 625 that would be classified as eQTLs with a p-value threshold of 10−4 using CEPH (Centre d'Etude du Polymorphisme Humain) samples of European descent from Utah (CEU) or 972 using the combined sample of the CEU plus the Yoruban samples from Ibadan Nigeria (YRI) data to define eQTLs, 46 (83 for the combined CEU+YRI) that would be classified as eQTLs with a p-value threshold of 10−6 and 17 (18 for the combined CEU+YRI) that would be classified as eQTLs with a p-value threshold of 10−8. As summarized in Figure 1, we observed significantly more eQTLs among these 1598 trait-associated SNPs than expected given their minor allele frequency (MAF) distribution. Given that many gene transcript levels show substantial correlations, it is, perhaps, not surprising that we also observed that SNPs associated with many transcripts (sometimes called eQTL hot spots or master regulators) were also enriched among trait-associated SNPs. Using a more stringent definition for defining trait-associated SNPs (up to 10−8) reduced their absolute numbers of such SNPs, but increased the significance of the observed enrichment of eQTLs among trait-associated SNPs. 10.1371/journal.pgen.1000888.g001 Figure 1 Trait-associated SNPs are more likely to be eQTLs. The distribution of the number of eQTLs (defined as p 0.3 and for each replicate conducted 100 simulations conditional on MAF as for Figure 1 (see Materials and Methods for simulation details). The eQTL enrichment was preserved in LD-pruned sets of cataloged variants, for eQTL definition thresholds with p 0.05) on high throughput platforms, associated with traits in the GWAS catalog, and associated with phenotypes selected to represent those for which lymphoblastoid cell lines (LCLs) might be expected to be a poor proxy for the human tissues most relevant for disease (neurological/psychiatric phenotypes, cancers) and those for which LCLs might be expected to be a good proxy for the human tissues most relevant to disease (autoimmune disorders). As is apparent from the table, neurological and psychiatric phenotypes as well as cancers (regardless of tumor site) showed levels of enrichment similar to those observed for the overall trait-associated SNP set, although as expected, autoimmune disorders showed somewhat greater levels of enrichment. Division of the GWAS catalog SNPs into those identified through studies on autoimmune disorders and those identified through studies on other complex traits reveals that the apparent enrichment of eQTLs (defined at a threshold of p .05) 1,213,906 595,285 (.490) 29,347 (.024) All Catalog SNPs 1598 972 (0.608) 83 (0.052) Autoimmune Disorders 259 165 (0.637) 21 (0.081) Cancers 93 56 (0.602) 4 (0.043) Neurological/Psychiatric Disorders 63 41 (0.651) 2 (0.032) Using eQTLs from CEU only Platform SNPs (MAF >.05) 1,213,906 345,249 (.284) 12,749 (.011) All Catalog SNPs 1598 625 (0.391) 46 (0.029) Autoimmune Disorders 259 116 (0.448) 17 (0.066) Cancers 93 30 (0.323) 3 (.032) Neurological/Psychiatric Disorders 63 20 (0.317) 1 (0.016) SNPs from the NHGRI catalog were classified according to the type of disease leading to their inclusion as a trait-associated SNP to investigate whether eQTLs identified in LCLs are more enriched in disorders for which LCLs are more likely to be an appropriate tissue match for the disease. Studies were conducted separately using eQTLs identified in the combined CEU+YRI samples (upper lines), and those identified only in the CEU (lower lines). Only SNPs with MAF >0.05 are included in these studies, and eQTL p-value thresholds of 10−4 and 10−6 are shown. Using eQTL Information Improves Ability to Identify Reproducible Signals in GWAS In order to understand the practical utility of our observation that trait-associated SNPs are more likely to be eQTLs, we examined in more detail results summaries from the Wellcome Trust Case Control Consortium (WTCCC) GWAS [11], using the data on Crohn's disease as a primary example application. If SNPs associated with Crohn's disease are no more likely than non-associated SNPs to be eQTLs, we would expect the proportion of SNPs associated with Crohn's disease at p 3. SNPs have been ordered according to their WTCCC Crohn's association p-values from the most to the least significant, and divided in groups of 10,000. For each group bin of 10,000 trait-associated SNPs, the number of SNPs with expression score larger than 3 was calculated and the results are shown in the scatterplot. The horizontal line illustrates the expectation based on the observed eQTL function scores in all SNPs in the WTCCC Crohn's dataset. The dotted lines represent estimated 95% confidence bands obtained using simulations. Analyses on other phenotypes from the WTCCC association studies [11] confirmed that our proposed annotations benefit phenotypes beyond Crohn's disease. Figure 4 illustrates that both type 1 diabetes (T1D) and rheumatoid arthritis (RA) had significantly more SNPs than expected with phenotype associations (p 3. The plots show results of eQTL enrichment analysis for the remaining Wellcome Trust phenotypes beginning with the SNPs most strongly associated with disease (similar to analyses for Crohn's disease summarized in Figure 3). For each disease, SNPs have been ordered according to their association p-values from the most to the least significant, and divided in groups of 10,000. For each group, the number of SNPs with expression score larger than 3 was calculated and the results are shown in the scatterplots. The horizontal lines illustrate the expectation based on the observed scores in all SNPs in the relevant WTCCC dataset. The dotted lines represent estimated 95% confidence bands obtained using simulations. Discussion Our results build on those of previous studies, particularly those of Schadt and colleagues [12]–[15] that focused on using transcriptome information to improve understanding of genetic signals, but the rapid accumulation of trait-associated SNPs through GWAS, coupled with the systematic efforts to catalog these variants [10],[16] has enabled us to generalize the concept that information on eQTLs will have utility for understanding the genetic component to complex traits. Our primary observations – that SNPs reproducibly associated with complex human traits are more likely to be eQTLs – are novel, although perhaps not unexpected given that few of even those SNPs most reproducibly associated with complex traits have been attributed to missense or nonsense variants. Our subsequent observations made in association studies in which we annotated SNPs with eQTL information, imply that at least some complex disorders have substantial numbers of undiscovered susceptibility loci that can be more easily discovered and characterized by annotating SNPs with information on eQTL scores. This is also a novel observation. Moreover, use of information on expression appears to benefit our understanding of not only more tractable disorders, such as Crohn's disease and autoimmune disorders, for which relatively large numbers of loci have already been identified at the cost of genotyping 10,000 or fewer samples, but also for less tractable disorders such as hypertension and bipolar disease, for which fewer associations have been characterized despite the large numbers of genotyped samples (>29,000 in GWAS and >34,000 for replication in studies of hypertension [17]). Use of expression information should move us closer to that elusive goal of developing a more comprehensive understanding of the genetic basis for complex traits in two ways: (1) by identifying and characterizing a larger number of contributing loci, we should be better able to discern the key biological functions affected by genetic risk factors and (2) SNP signals may be more accurately characterized with respect to target genes and mechanism of effect, further enhancing our ability to discern relevant biological functions. Thus, our results provide a strong additional rationale for the inclusion of eQTL information in the annotation of SNPs from GWAS. Moreover, even the analyses we have completed to characterize trait-associated SNPs as being more likely than allele frequency matched SNPs from the same platform to be eQTLs have revealed information about the potential biological rationale for some of the observed associations. For example, an intronic SNP, rs3129934, on chromosome 6 that is a cis-acting eQTL for multiple HLA transcripts with a high eQTL functional score (24.9) shows the strongest association with T1D in the MHC region in the WTCCC study [11] and is also the SNP most strongly associated with increased risk for multiple sclerosis (MS) in a pooling-based GWAS [18]. A variety of the approaches we used in conducting these studies should be of general utility in enrichment studies. It is clear, for example, that the MAF spectrum of SNPs associated with complex human traits is quite different than that of SNPs on the products used to conduct GWAS, or in the overall HapMap. The much higher minor allele frequencies observed for SNPs reproducibly associated with complex traits than for SNPs on high throughput platforms is a natural consequence of the increased power to detect associations for SNPs with higher minor allele frequencies. It is clearly critical to condition on the MAF in enrichment studies in which simulations are conducted using SNPs from high throughput platforms or from the HapMap, and to date, this has not been widely appreciated. The correlation among SNPs due to LD is a further complicating factor in conducting enrichment studies. We are confident that failure to adequately account for LD will not affect our estimate of the proportion of SNPs that are eQTLs, but rather the variance of this estimate. There are, however, more subtle biases related to LD that might have affected our studies. For example, true associations may be more easily detected in regions of the genome with high LD because these regions are more likely to have good coverage on high throughput platforms for GWAS. Although it is unclear that this would bias results of our studies, it is important to note that these types of subtle biases can affect results of enrichment studies in unpredictable ways. The challenges of conducting eQTL studies are also widely appreciated. The eQTLs identified in array-based transcriptome studies in LCLs are likely to be a limited subset of the eQTLs that will ultimately be identified using more sensitive techniques in comprehensive collections of human tissues. Moreover, a variety of studies have highlighted the potential sensitivity of eQTL studies to cell and tissue subtypes [15],[19],[20]. Although several studies have suggested that eQTLs will be largely tissue specific [21],[22], others note that substantial numbers of eQTLs are shared across tissues [15],[23]. Similarly, some have found that LCLs are prone to such high levels of non-genetic variability (growth parameters, EBV levels, ATP levels, etc.) that they might be considered unreliable for genomic studies [24], while others highlight the reproducibility of results of transcriptome studies [25],[26] and find that LCLs show little variability as a consequence of non-genetic factors [27]. To the degree that the transcriptome is a reflection of the local and temporal environment, it could be argued that eQTLs identified in generic studies will have little value for enhancing understanding of complex trait studies unless eQTLs are identified in relevant tissues of the individuals participating in the complex trait studies. Our results provide little support for such arguments, and instead highlight the utility of LCLs as a reasonable model for the study of expression. And while LCLs are clearly not the ideal tissue for expression studies in many complex disorders, we note that we plan to serve results of eQTL studies in a variety of tissues in SCAN, and consider the studies summarized in this manuscript to be an initial effort that will be enriched by subsequent studies in additional tissues. It is possible that diseases without evidence for eQTL enrichment using data from LCLs will ultimately yield such evidence when data from more relevant tissues is used; alternatively, at least a subset of those disorders without evidence for enrichment for eQTLs may turn out to derive substantial heritability from rare variants. In any case, results of our studies clearly support additional investment in eQTL identification in other tissues, as well as in the development of public access databases and software tools, such as those in SCAN (http://www.scandb.org), that allows results of such studies to be productively used by the scientific community. Materials and Methods SNPs Reproducibly Associated with Complex Human Traits The National Human Genome Research Institute (NHGRI) has collected the results of published GWA studies in a publicly available online database (http://www.genome.gov/gwastudies) [16]. This catalog is continually updated; the version we utilize for the present study was downloaded on June 29, 2009. We used the NHGRI default set of SNPs reported to be associated to complex traits with a p-value of at least 10−5 for all figures and tables in the manuscript, but we note that results show more significant enrichment for more stringent definitions for trait-associated SNPs (we tested 10−5 through 10−8). The SCAN Database We have created an online database to serve the results of our gene-expression studies in LCLs. SCAN (SNP and Copy number ANnotation database; http://www.scandb.org) can be queried through either the SNP or gene to retrieve information on the relationship between SNPs and transcript levels at user-specified p-value thresholds [9]. SCAN enables batch queries of genes to retrieve a list of SNPs that predict the expression of the genes at a user-specified threshold. SCAN also holds multi-locus disequilibrium measures [28] to summarize reported LD information among SNPs and to characterize coverage of genes by the high-throughput genotyping platforms. The expression data served in SCAN had been assayed in HapMap LCLs (87 CEU and 89 YRI) with the Affymetrix GeneChip Human Exon 1.0 ST Array [29]. The exon array measures both exon-level and gene-level expression, includes approximately 1.4 million probesets, and profiles over 17,000 gene transcripts. Genome-wide association analyses of 13,080 transcript clusters (gene-level) with reliable expression and more than 2 million SNPs in the CEU or YRI populations were conducted using the Quantitative Trait Disequilibrium Test (QTDT) [30]. A transcript cluster or probeset in the exon array is defined as reliably expressed in LCLs if the log2-transformed expression signal is greater than 6 in at least 80% of the 176 HapMap samples included analyses. Each transcript cluster includes a set of probesets (exon-level) containing all known exons and 5′- and 3′- untranslated regions (UTRs) in the genome. In contrast to other arrays (e.g., U95 and U133 series arrays), the probes on the exon array cover entire gene regions and not just 3′-UTRs. Since SNPs in probes can generate spurious eQTL signals by affecting hybridization, we downloaded SNP data from dbSNP (build 129, human genome assembly build 36) to identify probes in probesets that hybridize to regions containing SNPs. The genomic coordinates of the probes were downloaded from the Affymetrix website (http://www.affymetrix.com). We filtered these probes before conducting the expression analyses generating the results served in SCAN. Analysis of eQTL Enrichment All analyses have been conducted using all available information (eQTLs for both CEU and YRI samples) and also using only information from the CEU samples. Results are substantially similar, but because most GWAS reported to date have been conducted in samples of recent European descent, and to be conservative, summary figures use only the eQTL information from the CEU samples. We also provide details for some of the key eQTL studies in tabular form (Table 1), where we provide summary information for enrichment studies using CEU+YRI eQTL data and for enrichment studies using only the CEU eQTL data. The MAF distribution for SNPs showing reproducible associations with complex human traits is markedly different from the MAF distribution for all SNPs on the high density SNP genotyping platforms, or the minor allele frequency distribution of SNPs included on high throughput platforms used for GWAS (Figure S1). Thus, any study focused on enrichments among SNPs reproducibly associated with complex human phenotypes must be conditioned on MAF. To enable us to conduct simulations conditional on MAF, we constructed MAF bins as follows: allele frequencies were calculated from HapMap genotype data on the CEU. Only the parents' genotypes were included in the frequency calculations since the children's alleles are not independent. We used the toolkit PLINK [31] to calculate frequencies. We classified SNPs on Affymetrix Genome-Wide Human SNP Array 6.0 and Illumina's High Density Human 1M-Duo into non-overlapping minor allele frequency (MAF) bins, each of width .05, using the MAFs of the SNPs in the CEU samples, as most of the published reports summarized in the NHGRI catalog are from GWAS on individuals of recent European descent. We conducted simulations to test for an enrichment of eQTLs among the NHGRI variants associated with complex traits by generating 1000 randomized SNP sets Si each of the same size as the original NHGRI list (1598) containing variants matched on MAF distribution, sampled without replacement from the set of typed SNPs on each high-throughput platform, which had been grouped into discrete MAF bins. For each set, we determined the number of eQTLs, Qi , at a p-value threshold. We tested the robustness of the eQTL enrichment across a range of p-value thresholds (10−8 to 10−4), as shown in Figure 1. These simulations (N = 1000) yield an empirical p-value, calculated as the proportion of simulations in which the number of eQTLs exceeds the observed number, Q, in the NHGRI list. Because some GWAS report associations for SNPs that were interrogated through imputation rather than direct genotyping, we repeated the analysis described above except that we used the entire set of HapMap SNPs to generate the 1000 randomized SNP sets with MAF matched to catalog SNPs, rather than only the SNPs included on high throughput GWAS platforms. Results of these studies show an even more significant enrichment of eQTLs among catalog SNPs (Figure S2). Thus, we focus in this paper on the more conservative results obtained using the simulations generated from SNPs on high-throughput GWAS platforms. To investigate whether the eQTL enrichment we observe in the NHGRI variants is driven by linkage disequilibrium (LD), an LD-pruned SNP set was generated from the NHGRI list using r2 4 MB from (or on another chromosome than) any gene with which the SNP shows transcript level association. In order to devise a one-dimensional score that summarizes as much of the eQTL signal information as possible for prioritizing GWAS signals according to their “functionality”, we first note that neither the scale of the score nor the distribution of the score for non-functional SNPs are relevant – just the appropriate ranking of SNPs with a function in regulating transcript levels. The building blocks for the SNP score are: po which is the overall smallest eQTL p-value (it can be either cis or trans), pc which is the smallest cis p-value, No which is the overall number of transcripts used in the analysis, and Nc which is the number of cis transcripts (i.e. the number of transcripts annotated to genes in the vicinity of the SNP under investigation). The score we propose: builds on Bonferroni corrections applied separately for trans and cis signals (there are 13,080 transcripts used in the analysis). Note that the scores are truncated at zero. Utilizing Expression Information in GWAS In order to determine whether our newly derived functional score can aid in identifying associations likely to be replicated, we first examined the summary data for Crohn's disease from the WTCCC association study [11] (downloaded when results data were publicly available). After QC (less than 5% missing data, good clustering), we filtered SNPs at a 1% MAF threshold, and removed markers with no rs ID, leaving 391,878 SNPs. We matched these SNPs using the rs IDs within the SCAN database, leaving us with 386,306 SNPs. We then ordered the SNPs according to their score S from largest to smallest score. The top SNPs based on the eQTL score show an excess of small association p-values with Crohn's disease (for example, in the top ten SNPs, six have p-values smaller than 0.01). We divided the SNPs into groups of 10,000 (in the order of their eQTL scores) and counted the number of SNPs in each group with p 3, >2, and >1 among the top 1000 SNPs for Crohn's disease (Figure 3). We repeated the Crohn's disease enrichment studies in the other WTCCC phenotypes. As with Crohn's disease, we first conducted analyses checking for an excess of small association p-values in the SNPs that have high expression annotation scores (Figure 4). Finally, we ranked the SNPs according to their association p-values and looked for an enrichment of high expression scores (Figure 5). To insure that results of enrichment studies were not unduly influenced by large numbers of correlated SNPs in the MHC region with high eQTL scores, we repeated the analyses summarized in Figure 2 and Figure 4 after removing all SNPs on chromosome 6 (see Figure S5). Results were relatively unchanged. Supporting Information Figure S1 Minor allele frequency (MAF) distributions for Affymetrix 6.0 SNPs (A), Illumina 1M SNPs (B), and NHGRI Associated SNPs (C). (1.57 MB TIF) Click here for additional data file. Figure S2 (A) The distribution of the number of eQTLs (at p<10−4 observed for each of 1,000 draws of 1,598 SNPs from bins matched for minor allele frequency to the 1,598 SNPs downloaded from the NHGRI catalog (bins include all HapMap SNPs) is shown in the bar graphs, with the actual number of eQTLs observed in the 1,598 SNPs from the NHGRI catalog shown as a solid circle. (B) Identical to Figure 1, leftmost panel, with analysis as above, except that SNPs for the simulation were drawn from SNPs on high-throughput GWAS panels. (2.69 MB TIF) Click here for additional data file. Figure S3 Observed numbers of eQTLs among all trait-associated SNPs trimmed for LD are shown as solid black circles with the bar graph showing the distribution of eQTLs among SNPs chosen at random from the same allele frequency bins from among all SNPs included on high-throughput GWAS products. Enrichment of eQTLs among trait-associated SNPs is preserved even in the absence of LD among trait-associated SNPs. (2.09 MB TIF) Click here for additional data file. Figure S4 Observed numbers of master regulators among all trait-associated SNPs are indicated by the solid black circles and the distribution of the number of master regulators (defined as SNPs predicting 10 transcripts and SNPs predicting 100 transcripts, both for p<10−4) observed in 1,000 draws of 1,598 SNPs from minor-allele-frequency-matched bins (including all SNPs on Illumina 1M and Affymetrix 6.0 products) is plotted in the bar graphs. (2.14 MB TIF) Click here for additional data file. Figure S5 This is similar to Figure 2, except that all SNPs on chromosome 6 have been removed from calculations, demonstrating that the Crohn's enrichment for eQTLs is not dependent on SNPs in the HLA region. Results were similar for T1D and RA. (0.88 MB TIF) Click here for additional data file.

0 comments Cited 530 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Different immune cells mediate mechanical pain hypersensitivity in male and female mice.

Simon Beggs, Sarah Taves, Jessica K Alexander … (2015)

A large and rapidly increasing body of evidence indicates that microglia-to-neuron signaling is essential for chronic pain hypersensitivity. Using multiple approaches, we found that microglia are not required for mechanical pain hypersensitivity in female mice; female mice achieved similar levels of pain hypersensitivity using adaptive immune cells, likely T lymphocytes. This sexual dimorphism suggests that male mice cannot be used as proxies for females in pain research.

0 comments Cited 496 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Landscape of X chromosome inactivation across human tissues

Taru Tukiainen, Alexandra-Chloe Villani, Angela Yen … (2017)

X chromosome inactivation (XCI) silences the transcription from one of the two X chromosomes in mammalian female cells to balance expression dosage between XX females and XY males. XCI is, however, characteristically incomplete in humans: up to one third of X-chromosomal genes are expressed from both the active and inactive X chromosomes (Xa and Xi, respectively) in female cells, with the degree of “escape” from inactivation varying between genes and individuals 1,2 (Fig. 1). However, the extent to which XCI is shared between cells and tissues remains poorly characterized 3,4 , as does the degree to which incomplete XCI manifests as detectable sex differences in gene expression 5 and phenotypic traits 6 . Here we report a systematic survey of XCI integrating over 5,500 transcriptomes from 449 individuals spanning 29 tissues from GTEx (V6 release), and 940 single-cell transcriptomes, combined with genomic sequence data (Fig. 1). We show that XCI at 683 X-chromosomal genes is generally uniform across human tissues, but identify examples of heterogeneity between tissues, individuals, and cells. We show that incomplete XCI affects at least 23% of X-chromosomal genes, identify seven new escape genes supported by multiple lines of evidence, and demonstrate that escape from XCI results in sex biases in gene expression, establishing incomplete XCI as a likely mechanism introducing phenotypic diversity 6,7 . Overall, this updated catalogue of XCI across human tissues informs our understanding of the extent and impact of the incompleteness in the maintenance of XCI. Mammalian female tissues consist of two mixed cell populations, each with either the maternally or paternally inherited X chromosome marked for inactivation. To overcome this heterogeneity, assessments of human XCI have often been confined to the use of artificial cell systems 1 , or samples presenting with skewed XCI 1,2 , i.e. preferential inactivation of one of the two X chromosomes, which is common in clonal cell lines but rare in karyotypically normal, primary human tissues 8 (Supplementary Note, Extended Data Fig. 1). Others have used bias in DNA methylation 3,4,9 or in gene expression 5,10 between males and females as a proxy for XCI status. Surveys of XCI are powerful in engineered model organisms, e.g. mouse models with completely skewed XCI 11 , but the degree to which these discoveries are generalizable to human XCI remains unclear given marked differences in XCI initiation and extent of escape across species 7 . Here, we describe a systematic survey of the landscape of human XCI using three complementary RNA sequencing (RNA-seq)-based approaches (Fig. 1) that together allow an assessment of XCI from individual cells to population across a diverse range of human tissues. Given the limited accessibility of most human tissues, particularly in large sample sizes, no global investigation of the impact of incomplete XCI on X-chromosomal expression has been conducted in data sets spanning multiple tissue types. We used the Genotype Tissue Expression (GTEx) project 12 data set (V6 release), which includes high-coverage RNA-seq data from diverse human tissues, to investigate male-female differences in the expression of 681 X-chromosomal protein-coding and long non-coding RNA (lncRNA) genes in 29 adult tissues (Extended Data Table 1), hypothesizing that escape from XCI should typically result in higher female expression of these genes. Previous work 5,10,13 has indicated that some escape genes show female bias in expression, but our analysis benefits from a larger set of profiled tissues and individuals, as well as the high sensitivity of RNA-seq. To confirm that male-female expression differences reflect incomplete XCI, we assessed the enrichment of sex-biased expression in known XCI categories using 561 genes with previously assigned XCI status, defined as escape (N=82), variable escape (N=89) or inactive (N=390)(Fig. 1, Supplementary Table 1). Sex-biased expression is enriched in escape genes compared to both inactive (two-sided paired Wilcoxon P=3.73×10-9) and variable escape genes (P=3.73×10-9) (Fig. 2b, Extended Data Fig. 2), with 74% of escape genes showing significant (false discovery rate (FDR) q-value 90% concordance in effect direction and significant sex bias (Fig. 2d, Supplementary Table 3), e.g. CHM, that replicates in single-cell RNA-seq (scRNA-seq; see below), suggesting that variable escape can also have considerable population-level impact. One gene without an assigned XCI status shows a similar sex bias pattern to escape genes; RP11-706O15.3 (Fig. 2d) resides between escape and variable escape genes PRKX and NLGN4X, consistent with known clustering of escape genes 1,2 . Some escape genes show more heterogeneous sex bias, e.g., ACE2 (Fig. 2a, Supplementary Discussion). Many of such genes lie in the evolutionarily older region of the chromosome 15 , in Xq, where escape genes also show higher tissue-specificity and lower expression levels (Extended Data Fig. 5), characteristics linked with higher protein evolutionary rates 16,17 . While sex bias serves as a proxy for XCI status, it provides only an indirect measure of XCI. We identified a GTEx female donor with an unusual degree of skewing of XCI (Fig. 3a), the same copy of chrX being silenced in ∼100% of cells across all tissues, yet without any X-chromosomal abnormality detected by whole-genome sequencing (WGS) (Supplementary Note, Extended Data Fig. 6), providing an opportunity to leverage allele-specific expression (ASE) across 16 tissues to investigate XCI. This approach is analogous to previous surveys in mouse 11 or in human cell lines with skewed XCI 2 , but extends the assessment to larger number of tissues and avoids biases arising from genetic heterogeneity between tissue samples. Analysis of the X-chromosomal allelic counts (Supplementary Tables 4-6) from this GTEx donor highlights the incompleteness and consistency of XCI across tissues (Fig. 3b). Approximately 23% of the 186 X-chromosomal genes assessed show expression from both alleles, indicative of incomplete XCI, matching previous estimates of the extent of escape 1,2 . For 43% of the genes expressed biallelically in this sample, Xi expression is of similar magnitude between tissues, thus supporting the observation of general global and tight control of XCI. However, suggesting some tissue-dependence in XCI, the rest of biallelically expressed genes show variability in Xi expression, including a gene subset (5.8% of all genes) that appear biallelic in only one of the multiple tissues assayed. While tissue-specific escape is common in mouse 11 , limited evidence exists for such a pattern in human tissues beyond neurons 3,4,9 . In our data, among the genes with the strongest evidence for tissue-specific escape is KAL1 (Fig. 3f, Supplementary Table 6), the causal gene for X-linked Kallmann syndrome; here KAL1 shows biallelic expression exclusively in lung (Fig. 3f), in line with the strong female bias detected specifically in lung expression in the previous analysis (Fig. 2a), suggesting that tissue differences in escape can directly translate to tissue-specific sex biases in gene expression. Altogether, the predictions of XCI status in this sample align with previous assignments (Supplementary Table 7, e.g. TSR2, XIST and ZBED1, Fig. 3c-f), but suggest five new incompletely inactivated genes (Fig. 3g-k, Supplementary Table 5), three of which act in a tissue-specific manner. For instance,CLIC2, in previous studies called either subject 2 to or variably escaping 1 from XCI, shows considerable Xi expression only in skin tissue. Such specific patterns illustrate the need to assay multiple tissue types to fully uncover the diversity in XCI. The emergence of scRNA-seq methods 18 presents an opportunity to directly assess XCI without the complication of cellular heterogeneity in bulk tissue samples (Fig. 1), as demonstrated recently in mouse studies 19-22 , and in human fibroblasts 23 and preimplantation development 24 . To directly profile XCI in human samples, we examined scRNA-seq data in combination with deep genotype sequences from 940 immune-related cells from four females: 198 cells from LCLs sampled from three females of African (Yoruba) ancestry, and 742 blood dendritic cells from a female of Asian ancestry 25 (Fig. 1, Extended Data Table 2). We utilized ASE to distinguish the expression coming from each of the two X-chromosomal haplotypes in a given cell (Supplementary Table 4). As the inference of allele-specific phenomena in single cells is complicated by widespread monoallelic expression 20,26-28 , besides searching for X-chromosomal sites with biallelic expression (Extended Data Fig. 7), we leveraged genotype phase information to detect sites where the expressed allele was discordant with the active X chromosome in that cell. Only 129 (78%) out of the 165 assayed genes (41-98 per sample)were fully inactivated in these data while the rest showed incomplete XCI in one or more samples (Fig. 4a-b, Supplementary Tables 8-9), largely consistent with previous assignments of XCI status to these genes (Fig. 4a, Supplementary Table 10). For instance, single cell data reveal consistent expression from both X-chromosomal alleles for eleven genes in PAR1, in line with their known escape from XCI (e.g. ZBED1, Fig. 4c), and replicate the known expression of XIST exclusively from Xi (Fig. 4d). We next assessed whether our approach could extend the spectrum of escape from XCI. For seven genes previously reported as inactivated the data from single cells pointed to incomplete XCI (Fig. 4e-k, Supplementary Table 11), including FHL1 (Fig. 4e), highlighted as a candidate escape gene also in the GTEx ASE analysis, and ATP6AP2 (Fig. 4h), which displays predominantly female-biased expression across GTEx tissues. Both of these genes demonstrate significant Xi expression in only a subset of the scRNA-seq samples, a pattern consistent with variable escape 1,2 . Between-individual variability exists not only in the presence but also in the degree of expression from Xi (e.g. MSL3, Fig. 4l). Highlighting the capacity of scRNA-seq to provide information beyond bulk RNA-seq, we identify examples where Xi expression varies considerably between the two X-chromosomal haplotypes within an individual (e.g. ASMTL; Supplementary Table 12), suggesting cis-acting variation as one of the determinants for the level of Xi expression 3 . As a further layer of heterogeneity in Xi expression, we find a unique pattern at TIMP1, where the level of Xi expression across cells is not significant, but exclusive to a subset of cells that express the gene biallelically (Extended Data Fig. 7), pointing to cell-to-cell variability in escape. Leveraging the ASE estimates from the scRNA-seq and GTEx analyses to infer the magnitude of the incompleteness of XCI,we find that expression from Xi at escape genes rarely reaches levels equal to Xa. Xi expression remains on average at 33% of Xa expression, yet with wide variability along the chromosome (Supplementary Discussion, Extended Data Fig. 8a), as demonstrated previously in specific tissue types 1,2 . Balanced expression dosage between males and females in PAR1 requires full escape from XCI, yet Xi expression remains below Xa expression also in this region (mean Xi to Xa ratio ∼0.80), pointing to partial spreading of XCI beyond nonPAR. For further support for the consistent male bias in PAR1 expression (Fig. 2a) being due to the incompleteness of escape, we observe no systematic up- or downregulation of Y chromosome expression in PAR1 (Extended Data Fig. 8b, Supplementary Discussion). As another consequence of the partial Xi expression, several of the X-Y homologous genes in nonPAR 29 become male-biased when expression from the Y chromosome counterpart is accounted for (Extended Data Fig. 8c). By combining diverse types and analyses of high-throughput RNA-seq data, we have systematically assessed the incompleteness and heterogeneity in XCI across 29 human tissues (Supplementary Table 13). We establish that scRNA-seq is suitable for surveys of human XCI and present the first steps towards understanding the cellular-level variability in the maintenance of XCI. Our phasing-based approach allows for the full use of low-coverage scRNA-seq, yet as any single individual and cell type is informative for restricted number of genes, larger data sets with more diverse cell types and conditions are required to fully profile XCI. We have thus utilized the multi-tissue GTEx data set to explore XCI in a larger number of X-chromosomal genes and to assess the tissue-heterogeneity and impacts of XCI on gene expression differences between the sexes. These analyses show that incomplete XCI is largely shared between individuals and tissues, and extend previous surveys by pinpointing several examples of variability in the degree of XCI escape between cells, chromosomes, and tissues. In addition, our data demonstrate that escape from XCI results in sex-biased expression in at least 60 genes, potentially contributing to sex differences in health and disease (Supplementary Discussion). As a whole, these results highlight the between-female and male-female diversity introduced by incomplete XCI, the biological implications of which remain to be fully explored. Methods GTEx data The GTEx project 12 collected tissue samples from 554 postmortem donors (187 females, 357 males; age range 20-70), produced RNA sequencing from 8,555 tissue samples and generated genotyping data for up to 449 donors (GTEx Analysis V6 release). More details of methods can be found in Aguet et al. (Aguet et al., co-submitted, Nature). All GTEx data, including RNA, genome and exome sequencing data, used in the analyses described are available through dbGaP under accession phs000424.v6.p1, unless otherwise stated. Summary data and details on data production and processing are also available on the GTEx Portal (http://gtexportal.org). Single-cell samples For the human dendritic cells samples profiled, the healthy donor (ID: 24A) was recruited from the Boston-based PhenoGenetic project, a resource of healthy subjects that are re-contactable by genotype 30 . The donor was a female Asian individual from China, of 25 years of age at the time of blood collection. She was a non-smoker, had normal BMI (height: 168.7cm; weight: 56.45kg; BMI: 19.8), and normal blood pressure (108/74). The donor had no family history of cancer, allergies, inflammatory disease, autoimmune disease, chronic metabolic disorders or infectious disorders. She provided written informed consent for the genetic research studies and molecular testing, as previously reported 25 . Daughters of three parent-child Yoruba trios from Ibadan, Nigeria, (i.e. YRI trios) collected as part of the International HapMap Project, were chosen for single-cell profiling both to maximize heterozygosity and due to availability of parental genotypes allowing for phasing. DNA and LCLs were ordered from the NHGRI Sample Repository for Human Genetic Research (Coriell Institute for Medical Research): LCLs from B-Lymphocyte for the three daughters (catalogue numbers: GM19240, GM19199, GM18518) and DNA extracted from LCLs for all members of the three trios (catalogue numbers: DNA: NA19240, NA19238, NA19239, NA19199, NA19197, NA19198, NA18518, NA18519, NA18520). These YRI samples are referred to by their family IDs: Y014, Y035 and Y117. Clinical muscle samples To assess whether PAR1 genes are equally expressed from X and Y chromosomes, a combination of skeletal muscle RNA sequencing and trio genotyping from eight male patients with muscular dystrophy, sequenced as part of an unrelated study, was used. Patient cases with available muscle biopsies were referred from clinicians starting April 2013 through June 2016. All patients submitted for RNA-sequencing had previously available trio whole exome sequencing with one sample having additional trio whole genome sequencing. Muscle biopsies were shipped frozen from clinical centers via liquid nitrogen dry shipper and, where possible, frozen muscle was sectioned on a cryostat and stained with H&E to assess muscle quality as well as the presence of overt freeze-thaw artifact. Genotyping The GTEx V6 release includes WGS data for 148 donors, including GTEX-UPIC. WGS libraries were sequenced on the Illumina HiSeqX or Illumina HiSeq2000. WGS data was processed through a Picard-based pipeline, using base quality score recalibration and local realignment at known indels. BWA-MEM aligner was used for mapping reads to the human genome build 37 (hg19). SNPs and indels were jointly called across all 148 samples and additional reference genomes using GATK's HaplotypeCaller version 3.1. Default filters were applied to SNP and indel calls using the GATK's Variant Quality Score Recalibration (VQSR) approach. An additional hard filter InbreedingCoeff 307times;, and processed similarly as above. The Y117 trio (sample IDs NA19240 (daughter), NA19238 (mother), and NA19239 (father)) was whole-genome-sequenced as part of the 1000 Genomes project as described previously 31 . The VCF file containing the WGS-based genotypes for SNPs (YRI.trio.2010_09.genotypes.vcf.gz) was downloaded from the project's FTP site. The genotype coordinates (in human genome build 36) in the original VCF were converted to hg19 using the liftover script (liftOverVCF.pl) and chain files provided as part of the GATK package. WES was performed using Illumina's capture Exome (ICE) technology (Y035, Y014, 24A) or Agilent SureSelect Human All Exon Kit v2 exome capture (clinical muscle samples) with a mean target coverage of >80×. WES data was aligned with BWA, processed with Picard, and SNPs and indels were called jointly with other samples using GATK HaplotypeCaller package version 3.1 (24A, clinical muscle samples) or version 3.4 (Y035, Y014). Default filters were applied to SNP and indel calls using the GATK's Variant Quality Score Recalibration (VQSR) approach. A modified version of the Ensembl Variant Effect Predictor was used for variant annotation for all WES and WGS data. For trio WES or WGS data the genotypes of the proband were phased using the PhaseByTransmission tool of the GATK toolkit. Single cell data preparation and sequencing For profiling of healthy DCs, peripheral blood mononuclear cells (PBMCs) were first isolated from fresh blood within 2hrs of collection, using Ficoll-Paque density gradient centrifugation as previously described 32 . Single-cell suspensions were stained per manufacturer recommendations with an antibody panel designed to enrich for all known blood DC population for single cell sorting and single cell RNA-sequencing (scRNA-seq) profiling 25 . A total of 24 single cells from four loosely gated populations were sorted per 96-well plate, with each well containing 10ul of lysis buffer. A total of eight plates were analysed by single-cell RNA-sequencing. All LCL cell lines were cultured according to Coriell's recommendation (medium: RPMI 1640, 2mM L-glutamine, 15% fetal bovine serum (all three from ThermoFisher Scientific)) in T25 tissue culture flask with 10-20 ml medium at 37°C under 5% carbon dioxide. Cells were split upon reaching cell density of approximately 300,000-400,000 viable cells/ml. All three lymphoblast cultures were split once prior to proceeding with single cell sorting. Cells were washed with 1× PBS, pellet resuspended and stained with DAPI (Biolegend) for viability according to manufacturer's recommendation. All single live cells (for both DCs and LCL cell lines) were sorted in 96-well full-skirted eppendorf plate chilled to 4°C, pre-prepared with 10μl TCL buffer (Qiagen) supplemented with 1% beta-mercaptoethanol (lysis buffer) using BD FACS Fusion instrument. Single-cell lysates were sealed, vortexed, spun down at 300g at 4°C for 1 minute, immediately placed on dry ice and transferred for storage at -80°C. The Smart-Seq2 protocol was performed on single sorted cells as described 33,34 , with some modifications as described in Villani et al. 25 (Supplementary Methods). A total of 768 single DCs isolated from healthy Asian female individual, along with 96 single cells from GM19240, 48 single cells from GM19199, and 48 single cells from GM18518 were profiled. Briefly, single-cell lysates were thawed on ice purified, and reverse-transcribed using Maxima H Minus Reverse Transcriptase. PCR was performed with KAPA HiFi HotStart ReadyMix [KAPA Biosystems] and purified with Agencourt AMPureXP SPRI beads (Beckman-Coulter). The concentration of amplified cDNA was measured on the Synergy H1 Hybrid Microplate Reader (BioTek) using High-Sensitivity Qubit reagent (Life Technologies), and the size distribution of select wells was checked on a High-Sensitivity Bioanalyzer Chip (Agilent). Expected quantification was around 0.5-2 ng/μL with size distribution sharply peaking around 2kb. Library preparation was carried out using the Nextera XT DNA Sample Kit (Illumina) with custom indexing adapters, allowing up to 384 libraries to be simultaneously generated in a 384-well PCR plate (note that DCs were processed in 384-well plate while LCL were processed in 96-well plate format). The concentration of the final pooled libraries was measured using the High-Sensitivity DNA Qubit (Life Technologies), and the size distribution measured on a High-Sensitivity Bioanalyzer Chip (Agilent). Expected concentration of the pooled libraries was 10-30 ng/μL with size distribution of 300-700bp. For the DCs, we created pools of 384 cells, while 96 LCL samples were pooled at the time. We sequenced one library pool per lane as paired-end 25 base reads on a HiSeq2500 (Illumina). Barcodes used for indexing are listed in the Supplementary Methods. RNA-seq in GTEx RNA sequencing was performed using a non-strand-specific RNA-seq protocol with poly-A selection of RNA using the Illumina TruSeq protocol with sequence coverage goal of 50M 76 bp paired-end reads as described in detail previously 12 . The RNA-seq data, except for GTEX-UPIC, was aligned with Tophat version v1.4.1 to the UCSC human genome release version hg19 using the Gencode v19 annotations as the transcriptome reference. Gene level read counts and RPKMs were derived using the RNA-SeQC tool 35 using the Gencode v19 transcriptome annotation. The transcript model was collapsed into gene model as described previously 12 . Read count and RPKM quantification include only uniquely mapped and properly paired reads contained within exon boundaries. RNA-seq alignment to personalized genomes For the four single-cell samples and for GTEX-UPIC RNA-seq data was processed using a modification of the AlleleSeq pipeline 36,37 to minimize reference allele bias in alignment. A diploid personal reference genome for each of the samples was generated with the vcf2diploid tool 36 including all heterozygous biallelic single nucleotide variants identified in WES or WGS either together with (YRI samples) or without (GTEX-UPIC, 24A) maternal and paternal genotype information. The RNA-seq reads were then aligned to both parental references using STAR 38 version 2.4.1a in a per-sample 2-pass mode (GTEX-UPIC and YRI samples) or version 2.3.0e (24A) using hg19 as the reference. The alignments were combined by comparing the quality of alignment between the two references: for reads aligning uniquely to both references the alignment with the higher alignment score was chosen and reads aligning uniquely to only one reference were kept as such. RNA-seq of clinical muscle samples Patient RNA samples derived from primary muscle were sequenced using the GTEx sequencing protocol 12 with sequence coverage of 50M or 100M 76 bp paired-end reads. RNA-seq reads were aligned using STAR 38 2-Pass version v.2.4.2a using hg19 as the reference. Junctions were filtered after first pass alignment to exclude junctions with less than 5 uniquely mapped reads supporting the event and junctions found on the mitochondrial genome. The value for unique mapping quality was assigned to 60 and duplicate reads were marked with Picard MarkDuplicates (v.1.1099). Catalogue of X-inactivation status In order to compare results from the ASE and GTEx analyses with previous observations on genic XCI status we collated findings from two earlier studies 1,2 that represent systematic expression-based surveys into XCI. Each study catalogues hundreds of X-linked genes and together the data span two tissue types. Carrel and Willard 1 surveyed in total 624 X-chromosomal transcripts expressed in primary fibroblasts in nine cell hybrids each containing a different human Xi. In order to find the gene corresponding to each transcript, the primer sequences designed to test the expression of the transcripts in the original study were aligned to reference databases based on Gencode v19 transcriptome and hg19 using an in-house software (unpublished) (Supplementary Methods). In total 553 transcripts primer pairs were successfully matched to X-chromosomal Gencode v19 reference mapping together to 470 unique chrX genes (Supplementary Methods). These 470 genes were split into three XCI status categories (escape, variable, inactive) based on the level of Xi expression (i.e. the number of cell lines expressing the gene from Xi) resulting in 75 escape, 51 variable escape and 344 inactive genes. Cotton et al 2 surveyed XCI using allelic imbalance in clonal or near-clonal female LCL and fibroblast cell lines and provided XCI statuses for 508 genes (68 escape, 146 variable escape, 294 subject genes). The data was mapped to Gencode v19 using the reported gene names and their known aliases (Supplementary Methods), resulting in a list of XCI statuses for 506 X-chromosomal genes. The results were combined by retaining the XCI status in the original study where possible (i.e. same status in both studies or gene unique to one study) and for genes where the reported XCI statuses were in conflict the following rules were applied: 1) A gene was considered “escape” if it was called escape in one study and variable in the other, 2) “variable escape” if classified as escape and inactive, and 3) “inactive” if classified as inactive in one study and variable escape in the other. The final combined list of XCI statuses consisted of 631 X-chromosomal genes including 99 escape, 101 variable escape and 431 inactive genes. Analysis of sex-biased expression Differential expression analyses were conducted to identify genes that are expressed at significantly different levels between male and female samples using 29 GTEx V6 tissues with RNA-seq and genotype data available from more than 70 individuals after excluding samples flagged in QC and sex-specific, outlier (i.e. breast tissue) and highly correlated tissues 13 . Only autosomal and X-chromosomal protein-coding or lncRNA genes in Gencode v19 were included, and further all lowly-expressed genes were removed. (Supplementary Methods and Extended Data Table 1). Differential expression analysis between male and female samples was conducted following the voom-limma pipeline 39-41 available as an R package through Bioconductor (https://bioconductor.org/packages/release/bioc/html/limma.html) using the gene-level read counts as input. The analyses were adjusted for age, three principal components inferred from genotype data using EIGENSTRAT 42 , sample ischemic time, surrogate variables 43,44 built using the sva R package 45 , and the cause of death classified into five categories based on the 4-point Hardy scale (Supplementary Methods). To control the false discovery rate (FDR), the qvalue R package was used to obtain q-values applying the adjustment separately for the differential expression results from each tissue. The null hypothesis was rejected for tests with q-values below 0.01. XY homolog analysis A list of Y-chromosomal genes with functional counterparts in the X chromosome, i.e. X-Y gene pairs, was obtained from Bellott et al 29 , which lists 19 ancestral Y chromosome genes that have been retained in the human Y chromosome. After excluding two of the genes (MXRA5Y and OFD1Y), which were annotated as pseudogenes by Bellot et al and further four genes (SRY, RBMY, TSPY, and HSFY) that according to Bellot et al have clearly diverged in function from their X-chromosomal homologs, the remaining 13 Y-chromosomal genes were matched with their X chromosome counterparts using gene pair annotations given in Bellot et al or by searching for known paralogs of the Y-chromosomal genes. To test for completeness of dosage compensation at the X-Y homologous genes, the sex bias analysis in GTEx data was repeated replacing the expression of the X-chromosomal counterpart with the combined expression of the X and Y homologs. Chromatin state analysis To study the relationship between chromatin states and XCI, we used chromatin state calls from the Roadmap Epigenomics Consortium 46 . Specifically, we used the chromatin state annotations from the core 15-state model, publicly available at http://egg2.wustl.edu/roadmap/web_portal/chr_state_learning.html#core_15state. We followed our previously published method 47 to calculate the covariate-corrected percentage of each gene body assigned to each chromatin state. After pre-processing, we filtered down to the 399 inactive and 86 escape genes on the X chromosome, and down to 38 female epigenomes. To compare the chromatin state profiles of the escape and inactive genes in female samples, we used the one-sided Wilcoxon rank sum test. Specifically, for each chromatin state, we averaged the chromatin state coverage across the 38 female samples for each gene, and compared that average chromatin state coverage for all 86 escape genes to the average chromatin state coverage for all 399 inactive genes. We performed both one-sided tests, to test for enrichment in escape genes, as well as enrichment in inactive genes. Next, we performed simulations to account for possible chromatin state biases, such as the fact that the escape and inactive genes are all from the X chromosome. Specifically, we generated 10,000 randomized simulations where we randomly shuffled the “escape” or “inactive” labels on the combined set of 485 genes, while retaining the sizes of each gene set. For each of these simulated “escape” and “inactive” gene sets, we calculated both one-sided Wilcoxon rank sum test p-values as described above, and then, we calculated a permutation “p-value” for the real gene sets based on these 10,000 random simulations (Supplementary Methods). Finally, we used Bonferroni multiple hypothesis correction for our significance thresholds to correct for our 30 tests, one for each of 15 chromatin states, and both possible test directions. Allele-specific expression For ASE analysis the allele counts for biallelic heterozygous variants were retrieved from RNA-seq data using GATK ASEReadCounter (v.3.6) 37 . Heterozygous variants that passed VQSR filtering were first extracted for each sample from WES or WGS VCFs using GATK SelectVariants. The analysis was restricted to biallelic SNPs due to known issues in mapping bias in RNA-seq against indels 37 . Sample-specific VCFs and RNA-seq BAMs were inputted to ASEReadCounter requiring minimum base quality of 13 in the RNA-seq data (scRNA-seq samples, GTEX-UPIC) or requiring coverage in the RNA-seq data of each variant to be at least 10 reads, with a minimum base quality of 10 and counting only reads with unique mapping quality (MQ = 60) (clinical muscle samples). For downstream processing of the scRNA-seq and GTEX-UPIC ASE data, we applied further filters to the data to focus on exonic variation only and to conservatively remove potentially spurious sites (Supplementary Methods), e.g. sites with non-unique mappability were removed, and further, after an initial analysis of the ASE data, subjected 22 of the X-chromosomal ASE sites to manual investigation. For GTEX-UPIC the X-chromosomal ASE data was limited in case of multiple ASE sites to only one site per gene, by selecting the site with coverage >7 reads in the largest number of tissues, in order to have equal representation from each gene for downstream analyses. Assessing ASE across tissues For GTEX-UPIC sample for which ASE data from up to 16 tissues per each ASE site was available, we applied the two-sided Hierarchical Grouped Tissue Model (GTM*) implemented in MAMBA 1.0.0 48,49 to ASE data. The Gibbs sampler was run for 200 iterations with a burn-in of 50 iterations. GTM* is a Bayesian hierarchical model that borrows information across tissues and across variants, and provides parameter estimates that are useful for interpreting global properties of variants. It classifies the sites into ASE states according to their tissue-wide ASE profiles and provides an estimate of the proportion of variants in each of the five different ASE states (strong ASE across all tissues (SNGASE), moderate ASE across all tissues (MODASE), no ASE across all tissues (NOASE), and heterogeneous ASE across tissues (HET1 and HET0)). To summarize the GTM* output in the context of XCI, SNGASE was considered to reflect full XCI, MODASE and NOASE together to represent partial XCI with similar effects across tissues, and HET1 and HET0 to reflect partial yet heterogeneous patterns of XCI across tissues. In order to combine estimates from two ASE states, we summed the estimated proportions in each class, and subsequently calculated the 95% confidence intervals for each remaining ASE state using Jeffreys prior. Determining XCI status in GTEX-UPIC In addition to the ASE states provided by the above MAMBA analysis, genic XCI status was assessed by comparing the allelic ratios at each X-chromosomal ASE site in each tissue individually. For each ASE site, the alleles were first mapped to Xa and Xi; the allele with lower combined relative expression across tissues was assumed the Xi allele. As an exception, at XIST the higher expressing allele was assumed the Xi allele. The significance of Xi expression at each ASE observation was tested using a one-sided binomial test, where the hypothesized probability of success was set at 0.025, i.e., the fraction of Xi expression from total expression was expected to be significantly greater than 0.025. To account for multiple testing, FDR correction was applied, using the qvalue R package, to the P-values from the binomial test for each of the 16 tissues separately. Observations with q-values 0.05, and 2) one-sided binomial test indicated allelic expression to be at least nominally significantly greater than 0.025. Only genes with at least two observations of biallelic expression across all cells within a sample were counted as biallelic. Phasing scRNA-seq data We assigned each cell to either of two cell populations distinguished by the parental X-chromosome designated for inactivation utilizing genotype phasing. For the YRI samples, where parental genotype data was available, the assignment to the two parental cell populations was unambiguous for all cells where X-chromosomal sites outside PAR1 or frequently biallelic sites were expressed. For 24A no parental genotype data was available, and hence we utilized the correlation structure of the expressed X-chromosomal alleles across the 948 cells to infer the two parental haplotypes utilizing the fact that in individual cells the expressed alleles at the chrX sites subject to full inactivation (i.e. the majority chrX ASE sites), are from the X chromosome active in each cell (Supplementary Methods). In other words, while monoallelic expression in scRNA-seq in the autosomes is largely stochastic in origin, in the X chromosome the pattern of monoallelic expression is consistent across cells with the same parental X chromosome active 21 , unless a gene is expressed also from the inactive X. As such, for the phase inference calculations, we excluded all PAR1 sites and all additional sites that were frequently biallelic, to minimize the contribution of escape genes to the phase estimation. After assigning each informative cell to either of the parental cell populations, the reference and alternate allele reads for each ASE site were mapped to active and inactive allele reads within each sample using the actual or inferred parental haplotypes. The data was first combined per variant by taking the sum of active and inactive counts separately across cells, and further similarly combined per gene, if multiple SNPs per gene were available. For 24A the allele expressed at XIST was assumed the Xi allele, in line with the exclusive Xi expression in the Yoruba samples confirmed using the information on parental haplotypes. Determining XCI status from scRNA-seq ASE Before calling XCI status using the Xa and Xi read counts from the phased data aggregated across cells, we excluded all sites without fewer than five cells contributing ASE data at each gene and also all sites with coverage lower than eight reads across cells within each sample. To determine whether the observed Xi expression is significantly different from zero, hence indicative of incomplete XCI at the site / gene, we required the Xi to total expression ratio to be significantly (q-value 0.1 RPKM and expressed in more than 10 individuals at >1 counts per million). P-values are calculated using the Wilcoxon Rank Sum test. All genes expressed in at least one tissue are included in the comparisons. Extended Data Figure 6 X-chromosomal RNA-seq and WGS data in the GTEx donor with fully skewed XCI (GTEX-UPIC) a) Allelic expression in chrX in 16 RNA-sequenced tissue samples available from the donor. Dashed red lines indicate PAR1 and PAR2 boundaries. b) Allele balance and allele depth across chrX in WGS for GTEX-UPIC and randomly chosen two female and one male GTEx WGS samples. Extended Data Figure 7 Expressed alleles at biallelically expressed ASE sites in scRNA-seq a) X-chromosomal genes repeatedly biallelic in scRNA-seq (see Methods for details). b) Illustration of the relative expression from the two alleles at all X-chromosomal ASE sites that were repeatedly biallelically expressed across cells in either of the two scRNA-seq samples that showed random XCI (Y035 and 24A). Narrow white lines separate observations from individual cells. Extended Data Figure 8 Assessment of the level of Xi expression at escape genes and in different regions of the X chromosome a) The ratio of Xi-to-Xa expression in the single cell samples (left panel; each circle represents a sample) and in the skewed XCI donor from GTEx (middle panel; each circle represents a tissue), and the female-to-male ratio in expression (right panel, each circle represents a tissue) at reported escape genes. Genes are ordered according to their location in the X chromosome with genes in the pseudoautosomal region residing in the top part of the figure. Dark border around a circle indicate there was significant evidence for Xi expression greater than the baseline in the given sample or tissue (left and middle panels) or significant sex-bias in the given tissue (right panel). Given some outliers, e.g. XIST, the Xi-to-Xa ratio is capped at 1.75 and female-to-male ratio at 2.25. b) The relative expression arising from the X and Y chromosome at PAR1 genes in skeletal muscle in eight males. The allelic expression at these genes was assigned to the two chromosomes utilizing parental genotypes available for these samples (see Methods for details). The dashed line at 0.5 indicates the point where expression from X and Y chromosomes is equal. The error bars give the 95% confidence intervals for the observed read ratio. c) Heatmap representation of the change in pattern of sex-bias at 13 X-Y homologous gene pairs (see Methods for details) in nonPAR from only including the X-chromosomal expression (heatmap on the left) to accounting for the Y-chromosomal expression (heatmap on the right). The color scale displays the direction of sex-bias with red color indicating higher female expression. Genes that were too lowly expressed in the given tissue type to be assessed in the sex-bias analysis are colored grey. Dots mark the observations where sex-bias was significant at FDR<1%. The grey bars on top of the heatmaps indicate the location of the gene in the X chromosome: dark grey indicating Xp and lighter grey Xq. Extended Data Table 1 Tissues, individuals and genes in the GTEx sex-bias analysis Tissues Individuals Genes analyzed Abbreviation Full name All Females Males Mean age All Autosomes ChrX ADPSBQ Adipose - Subcutaneous 297 186 111 52.15 15,273 14,735 538 ADPVSC Adipose - Visceral (Omentum) 184 117 67 51.97 15.301 14,765 536 ADRNLG Adrenal Gland 126 70 56 50.51 14.956 14,435 521 ARTAORT Artery - Aorta 197 126 71 51.11 14.675 14,137 538 ARTCRN Artery - Coronary 118 70 48 51.7 14,881 14,350 531 ARTTBL Artery - Tibial 284 183 101 50.26 14,501 13,981 520 BRNCTXA Brain - Cortex 92 66 26 57.67 15,339 14,791 548 CLNSGM Colon - Sigmoid 114 72 42 48.28 15,045 14,524 521 CLNTRN Colon - Transverse 255 159 96 50.93 15,732 15,181 551 ESPGEJ Esophagus - Gastroesophageal Junction 124 74 50 53.52 14,770 14,245 525 ESPMCS Esophagus - Mucosa 169 97 72 48.89 15,137 14,617 520 ESPMSL Esophagus - Muscularis 126 re 48 50.74 14,879 14,356 523 FIBRBLS Cells - Transformed fibroblasts 240 150 90 50.2 13,635 13,158 477 HRTAA Heart - Atrial Appendage 218 137 81 48.62 14,662 14,145 517 HRTLV Heart - Left Ventricle 159 105 54 53.64 14,075 13,586 489 LCL Cells - EBV-transformed lymphocytes 190 123 67 50.75 13,067 12,621 446 LIVER Liver 96 63 33 53.52 14,031 13,556 475 LUNG Lung 277 181 96 52.06 16,154 15,590 564 MSCLSK Muscle - Skeletal 361 228 133 51.85 13,623 13,153 470 NERVET Nerve - Tibial 256 163 93 51.65 15,563 15,020 543 PNCREAS Pancreas 149 87 62 50.09 14,355 13,861 494 PTTARY Pituitary 86 64 22 56.37 16,068 15,489 579 SKINNS Skin - Not Sun Exposed (Suprapubic) 195 128 67 53.06 15,601 15,069 532 SKINS Skin - Sun Exposed (Lower leg) 300 188 112 52.22 15,746 15,211 535 SMINTI Small Intestine - Terminal Ileum 77 43 34 47.62 15,594 15,046 548 SPLEEN Spleen 89 50 39 48.26 14,993 14,469 524 STMACH Stomach 169 97 72 48.2 15,604 15,057 547 THYROID Thyroid 278 179 99 52.14 15,974 15,417 557 WHLBLD Whole Blood 338 213 125 51.64 13,187 12,751 436 Total 449 290 159 52.27 19,839 19,158 681 Extended Data Table 2 Single-cell RNA-seq samples ID 24A Y117 Y035 Y014 Ancestry China, Asia Yoruba / Nigeria, Africa Yoruba / Nigeria, Africa Yoruba / Nigeria, Africa Design Singleton Trio Trio Trio Genotype data WES WGS WES WES Number of cells 742 96 48 48 Cell type Dendritic cells LCL LCL LCL Sequenced read pairs (mean (range)) 1,187,000 (335-7,403,000) 2,547,000 (38,190-5,126,000) 2,571,000 (46,940-5,038,000) 2,436,000 (69,130-5,457,000) Aligned read pairs* (mean (range)) 808,600 (197-5,727,000) 1,471,000 (14,910-3,309,000) 1,459,000 (16,400-2,893,000) 1,391,000 (14,920-3,067,000) Alignment rate (mean (range)) 0.667(0.271-0.799) 0.545(0.251-0.645) 0.551 (0.266-0.615) 0.526(0.175-0.606) Skew in XCI (% maternal active : % paternal active) 54:46 (373 cells where one parental chromosome active, 315 cells where the other parental chromosome active, 54 cells uninformative for X- chromosomal phasing) 100:0 (90 cells where maternal X chromosome active, 6 cells uninformative for X- chromosomal phasing) 79:21 (37 cells where maternal X chromosome active, 8 cells where paternal X chromosome active, 2 cells uninformative for X- chromosomal phasing) 100:0 (43 cells where maternal X chromosome active, 2 cells uninformative for X- chromosomal phasing) Notes Due to the unavailability of parental genotype information, the parental origin of the inferred X- chromosomal haplotypes is unknown * uniquely aligned, properly paired, QC passed reads. Supplementary Material reporting summary supp_infoguide supp_note supp_table1 supp_table10 supp_table11 supp_table12 supp_table13 supp_table14 supp_table2 supp_table3 supp_table4 supp_table5 supp_table6 supp_table7 supp_table8 supp_table9