Single-cell expression profiling by RNA-Seq promises to exploit cell-to-cell variation
in gene expression to reveal regulatory circuitry governing cell differentiation and
other biological processes. Here, we describe Monocle, a novel unsupervised algorithm
for ordering cells by progress through differentiation that dramatically increases
temporal resolution of expression measurements in a model of skeletal muscle differentiation.
This reordering unmasks switch-like changes in expression of key regulatory factors,
reveals sequentially organized waves of gene regulation, and exposes novel regulators
of cell differentiation. A loss-of function screen revealed that many of these inhibitors
act through regulatory elements also used by pro-myogenic factors to activate downstream
genes. This study demonstrates that single-cell expression analysis by Monocle can
uncover novel regulatory interactions governing differentiation.
Cell differentiation is governed by a vast and complex gene regulatory program. During
differentiation, each cell makes fate decisions independently by integrating a wide
array of signals from other cells, executing a complex choreography of gene regulatory
changes. Recently, several studies carried out at single-cell resolution have revealed
high cell-to-cell variation in most genes during differentiation
1–5
, even among key developmental regulators. Although high variability complicates analysis
of such experiments
6
, it might define biological progression between cellular states, revealing regulatory
modules of genes that co-vary in expression across individual cells
7
.
Prior studies have used approaches from computational geometry
8,9
and supervised machine learning
10
to order bulk cell populations from time-series microarray experiments by progress
through a biological process. Applying this concept to order individual cells could
expose fine-grained gene expression dynamics as they differentiate. We have developed
Monocle, an algorithm that harnesses single cell variation to sort cells in “pseudo
time” according to progress through differentiation. Applying Monocle to the classic
model of myogenesis unveiled dynamics at unprecedented resolution and exposed novel
regulatory factors.
Skeletal myoblasts undergo a well-characterized sequence of morphological and transcriptional
changes during differentiation
11
. Global expression and epigenetic profiles have reinforced the view that a small
cohort of transcription factors (e.g. MYOD, MYOG, MRF4, and MYF5) orchestrates these
changes
12
. However, efforts to expand this set of factors and map the broader myogenic regulatory
network have been hampered by the temporal resolution of global expression measurements,
with thousands of genes following a limited number of coarse kinetic trends
13
.
Single-cell measurements of markers of myogenesis have made clear that cells do not
progress through differentiation in synchrony. A population of cells captured at the
same time may thus cover a range of distinct intermediate differentiation states.
Drawing conclusions from a group of individuals based on the properties of their average
is a hazardous practice because the average can mask important trends among the individuals,
resulting in phenomena such as Simpson's paradox
14
. Experimental synchronization or stringent isolation of myogenic precursors is often
challenging and dramatically alters differentiation kinetics.
We hypothesized that capturing complete expression profiles of individual cells might
avoid these problems and dramatically increase temporal resolution in global transcriptome
dynamics. In essence, a single-cell RNA-Seq experiment might constitute a time-series,
with each cell representing a distinct time point along a continuum.
To test this hypothesis we investigated the single cell transcriptome dynamics during
myogenesis. We expanded primary human myoblasts under high mitogen conditions (GM),
and then induced differentiation by switching to low-mitogen media (DM). We then captured
50–100 cells at each of four time points following serum switch using the Fluidigm
C1 microfluidic system. RNA from each cell was isolated and used to construct mRNA-Seq
libraries, which were then sequenced to a depth of ~4 million reads per library, resulting
in a complete gene expression profile for each cell (Fig 1a, S1).
Averaging expression profiles of cells collected at the same time correlated well
with the corresponding bulk RNA-Seq libraries, and moderately expressed genes were
detectable (≥ 1 FPKM) in a majority of individual cells (Fig 1b, S2, S3). However,
markers of mature myocytes were present at all time points following serum switch,
and many other genes showed similar temporal heterogeneity (Fig 1c) We speculated
that the high variability in cell-to-cell gene expression levels was due to unsynchronized
differentiation, with myoblasts, intermediate myocytes, and mature myotubes residing
in the same well concurrently. Indeed, large, multinucleated MYH2+ cells were abundant
after 72 hours in DM, but these cells were present at lower frequency even at 24 hours
(Fig 1c).
We reasoned that informatically ordering the cells by their progress through differentiation,
rather than by the time they were collected, would distinguish genes activated early
in differentiation from those activated later. To this end, we developed a novel unsupervised
algorithm, Monocle, which re-ordered the cells to maximize the transcriptional similarity
between successive pairs (Fig 2a). The algorithm first represents the expression profile
of each cell as a point in a high-dimensional Euclidean space, with one dimension
for each gene. Second, it reduces the dimensionality of this space using Independent
Component Analysis
15
. Third, Monocle constructs a minimum spanning tree (MST) on the cells, an approach
now commonly used in other single-cell settings, such as flow or mass cytometry
16,17
. Fourth, the algorithm finds the longest path through the MST, corresponding to the
longest sequence of transcriptionally similar cells. Finally, Monocle uses this sequence
to produce a “trajectory” of an individual cell's progress through differentiation.
Progress along a differentiation trajectory is measured in “pseudo-time”: the total
transcriptional change a cell undergoes as it differentiates. This strategy is derived
from a prior algorithm for temporally ordering microarray samples
8
, but extends it to allow for multiple cell fates stemming from a single progenitor
cell type. As cells progress, they may diverge along two or more separate paths. After
Monocle finds the longest sequence of similar cells, it examines cells not along this
path to find alternative trajectories through the MST. These sub-trajectories are
ordered and connected to the main trajectory, and each cell is annotated with both
a trajectory and a pseudo-time value. Monocle thus orders cells by progress through
differentiation and can reconstruct branched biological processes, which might arise
when a precursor cell makes cell fate decisions that govern the generation of multiple
subsequent lineages. Importantly, Monocle is unsupervised and needs no prior knowledge
of specific genes that distinguish cell fates, and is thus suitable for studying a
wide array of dynamic biological processes.
Monocle decomposed myoblast differentiation into a two-phase trajectory and isolated
a branch of non-differentiating cells (Fig 2b). The first phase of the trajectory
was primarily composed of cells collected under high-mitogen conditions and which
expressed markers of actively proliferating cells such as CDK1, while the second mainly
consisted of cells collected at 24, 48, or 72 hours following serum switch. Cells
in the second phase were positive for markers of muscle differentiation such as MYOG
(Fig S4). A tightly grouped third population of cells branched from the trajectory
near the transition between phases. These cells lacked myogenic markers but expressed
PDGFRA and SPHK1, suggesting that they are contaminating interstitial mesenchymal
cells and did not arise from the myoblasts. Such cells were recently shown to stimulate
muscle differentiation
18
. Monocle's estimates of the frequency and proliferative status of these cells were
consistent with estimates derived from immunofluorescent stains against ANPEP/CD13
and nuclear phosphorylated H3-Ser10 (Fig S4). Monocle thus enabled analysis of the
myoblast differentiation trajectory without subtracting these cells by immunopurification,
maintaining in vitro differentiation kinetics that resemble physiological cell crosstalk
occurring in the in vivo niche.
To find genes that were dynamically regulated as the cells progressed through differentiation,
we modeled each gene's expression as a nonlinear function of pseudo-time. A total
of 1,061 genes were dynamically regulated during differentiation (FDR < 5%) (Fig 2c).
Cells positive for MEF2C and MYH2, early and late markers of differentiation (respectively)
were present at expected frequencies as assayed by both immunofluorescence and RNA-Seq.
Moreover, the pseudo-time ordering of cells shows an increase in MEF2C+ cell counts
prior to the increase in MYH2+ cells. Importantly, genes that play active roles at
the early and late stages of muscle differentiation showed pseudo-temporal kinetics
that were highly consistent with expectations, with cell-cycle regulators active early
in pseudo-time, and sarcomere components active later, confirming the accuracy of
the ordering (Fig S5).
We next examined the pseudo-temporal kinetics of a set of genes whose mouse orthologs
are targeted by Myod, Myog, or Mef2 proteins in C2C12 myoblasts
19
(Fig S6). The kinetics of these genes during differentiation were highly consistent
with changes observed during murine myogenesis, with nearly all significantly dynamically
regulated genes also differentially expressed during murine myogenesis and vice versa.
In contrast to the high resolution of pseudotime ordering, simply comparing gene expression
levels between groups of cells collected on different days masked changes in key transcriptional
regulators of myogenesis. For example, the pseudo-time reordering of the cells shows
switch-like inactivation of ID1, which is a critical event in muscle differentiation
and leads to the activation of MYOG
12
(Fig 2e,f). Thus, Monocle's ordering of cells by progress increases temporal resolution
of transcriptional dynamics and pinpoints key regulatory events that govern differentiation.
We further assessed Monocle's robustness over different experimental designs by simulating
experiments with fewer captured cells. Monocle placed subsets as small as 50 cells
in pseudo-temporal order highly similar (spearman >= 0.8) to their relative order
within the full data set. The algorithm retained the ability to detect dynamically
regulated genes with high precision (>= 95%) over all designs and with increasing
recall as more of the cells were included. (Fig S7)
We next grouped genes with similar trends in expression, reasoning that such groups
might share common biological functions and regulators. Clustering of genes according
to direction and timing revealed six distinct trends (Fig 3). Genes downregulated
early or upregulated late in pseudo-time were highly enriched for biological processes
central to myogenesis, including cell-cycle exit and activation of muscle-specific
structural proteins. However, the other clusters included many genes with broad roles
in development, including mediators of cell-cell signaling, RNA export and translational
control, and remodeling of cell morphology (Fig S8).
A timeseries analysis of myoblast differentiation with bulk RNA-Seq identified up
and down-regulated genes, but did not identify the transient clusters or distinguish
the early from late regulation visible with pseudo-temporally ordered single cells
(Fig S9). Furthermore, dynamic range of expression was compressed for most genes,
confirming that failure to account for variability in progress through differentiation
leads directly to the effects associated with Simpson's paradox. Pseudo-temporal cell
ordering thus decomposes the coarse kinetic trends produced by conventional RNA-Seq
into distinct, sequential waves of transcriptional reconfiguration.
To identify factors driving myoblast differentiation, we performed a cis-regulatory
analysis on genes in each pseudo-temporal cluster. Cis regulatory elements were first
identified based on DNaseI hypersensitive sites in HSMM cells and HSMM-derived myotubes
20
, classified according to function according to histone marks
21
, and finally annotated with conserved transcription factor binding sites. While downregulated
genes were enriched at near significant levels with binding sites for genes that play
roles in proliferation (e.g. MAX, E2F, and NMYC), nearly all significantly enriched
motifs fell near upregulated genes. These genes were highly enriched for regulatory
elements containing binding motifs for 175 transcription factors, including numerous
well-known regulators of myogenesis, such as MYOD, MYOG, PBX1, MEIS1, and the MEF2
family (Fig S10). Some, but not all, of these factors were revealed by a regulatory
element analysis performed using bulk RNA-Seq data, underscoring the power of increased
(pseudo) temporal resolution of single-cell analysis (Fig S11). A similar analysis
of microRNA target sites identified miR-1, miR-206, miR-133, and numerous others as
regulators of genes activated during myogenesis (Fig S12). Of these, only miR-1/206
target sites were significantly enriched among genes found to be transiently upregulated
and then sharply downregulated. This may suggest that miR-1 and miR-206, which are
expressed at an intermediate stage of myoblast differentiation, may act to strongly
suppress a subset of genes activated earlier.
Many of the transcription factors implicated by our cis regulatory analysis to govern
differentiation had no previously appreciated role in muscle development. To test
potential roles of these factors we performed an RNAi mediated loss of function screen
for 11 candidates. Briefly, we virally expressed proliferating myoblasts with one
of 44 distinct shRNAs targeting either one of these factors or a mock (non-targeting)
control, followed by serum-induced differentiation for five days. We then measured
the frequency and size of myosin heavy chain 2 (MYH2)-positive cells with a high-throughput
immunofluorescence pipeline. Of the targets we tested, MZF1, ZIC1, XBP1, and USF1
showed significantly altered differentiation kinetics (Fig 4a,b, Fig S13) when depleted
with two or more independent hairpins (FDR < 5%).
Knockdown of XBP1, USF1, ZIC1, and MZF1 enhanced myotube formation, with larger myotubes
containing a higher fraction of total nuclei than mock shRNA controls. Depletion of
CUX1, ARID5B, POU2F1, and AHR also increased differentiation efficiency, albeit less
significantly. Importantly, whole-well nuclei counts were similar between knockdowns
and mock controls, indicating that enhanced differentiation was not simply a result
of higher initial cell counts or increased proliferation. With the exception of ZIC1,
forced overexpression did not substantially alter differentiation kinetics (data not
shown).
Notably, several of these factors have binding motifs that are highly enriched in
promoters and enhancers that also have motifs for known muscle regulators (Fig 4c).
For example, USF1 motifs are enriched in enhancers that also have MYOD motifs. Together,
these results confirm that the transcription factors identified as possible regulators
in fact play a role in myoblast differentiation, and demonstrate the power of Monocle
for identifying key differentiation genes.
Here, we report that individual myoblasts progress through differentiation in an unsynchronized
manner, but that they can be reordered according to progress through differentiation.
This pseudo-time ordering pinpoints key events in differentiation that are masked
both by conventional bulk cell expression profiling, and by single-cell expression
profiles ordered by time collected. The reordering resolves sequentially activated
transcriptional sub-programs that are regulated by common factors. The temporal resolution
offered by hundreds of ordered cells might enable future efforts to computationally
infer novel gene-regulatory modules. For example, the enrichment of transiently upregulated
genes for common microRNA target sites raises the question of whether those microRNAs
are expressed later, curtailing what would have been higher levels of expression.
Sequencing-based measurements of small RNAs and mRNAs from the same cell will provide
answers to such systems-level questions. Moreover, single-cell analysis distinguishes
cells of interest from contaminating cell types such as interstitial mesenchymal cells
without experimental isolation that might disrupt cell-cell interactions important
in the in vivo niche.
We identified eight previously unappreciated transcription factors that dramatically
influence the course of myoblast differentiation, thus proving the principle of pseudo-temporal
analysis and expanding the catalog of regulators in this well-studied system. Several
of the eight factors reported here may normally repress differentiation by competing
with pro-myogenic factors for these regulatory elements. Alternatively, these inhibitors
may co-occupy regulatory elements with pro-myogenic factors, preventing transactivation
of their targets (Fig. 4d). Previous studies in other contexts provide mechanistic
data supporting both of these models. USF1 inhibits MyoD autoactivation in Xenopus
by competing with MyoD at its promoter through an alternative E-box
22
. Our results suggest that USF1 may repress a broad array of targets via E-box competition.
CUX1 represses targets in several developmental contexts through binding site competition
23
. XBP1 was recently reported to inhibit myoblast differentiation in mice, potentially
through the mechanisms proposed here
24
. Further experiments in these HSMM cells and myoblasts from other anatomic depots
will be needed to confirm the mechanism of these factors.
While the positive regulators of myogenesis have been well characterized, only a handful
of inhibitors have been identified. The eight inhibitors reported here may shed light
on how the balance of proliferation and differentiation is maintained during development
and regeneration. Ordering the expression profiles of individual cells by biological
progress is thus a powerful new tool for studying cell differentiation, and could
in principle be used to map regulatory networks that govern a much wider array of
biological processes.
Supplementary Material
1