Introduction
Functional genomics has demonstrated considerable success in inferring the inner working
of a cell through analysis of its response to various perturbations. In recent years
several technological advances have pushed gene perturbation screens to the forefront
of functional genomics. Most importantly, modern technologies make it possible to
probe gene function on a genome-wide scale in many model organisms and human. For
example, large collections of knock-out mutants play a prominent role in the study
of Saccharomyces cerevisiae [1], and RNA interference (RNAi) has become a widely used
high-throughput method to knock-down target genes in a wide range of organisms, including
Drosophila melanogaster, Caenorhabditis elegans, and human [2]–[4].
Another major advance is the development of rich phenotypic descriptions by imaging
or measuring molecular features globally. Observed phenotypes can reveal which genes
are essential for an organism, or work in a particular pathway, or have a specific
cellular function. Combining high-throughput screening techniques with rich phenotypes
enables researchers to observe detailed reactions to experimental perturbations on
a genome-wide scale. This makes gene perturbation screens one of the most promising
tools in functional genomics.
Advances in the design and analysis of gene perturbation screens may have an immediate
impact on many areas of biological and medical research. New screening and phenotyping
techniques often directly translate into new insights in gene and protein functions.
Results of perturbation screens can also reveal unexploited areas of potential therapeutic
intervention. For example, a recent RNAi screen showed that some of the most critical
protein kinases for the proliferation and survival of cancer cell lines are also the
least studied [5].
A goal becoming more and more prominent in both experimental as well as computational
research is to leverage gene perturbation screens to the identification of molecular
interactions, cellular pathways, and regulatory mechanisms. Research focus is shifting
from understanding the phenotypes of single proteins to understanding how proteins
fulfill their function, what other proteins they interact with, and where they act
in a pathway. Novel ideas on how to use perturbation screens to uncover cellular wiring
diagrams can lead to a better understanding of how cellular networks are deregulated
in diseases like cancer. This knowledge is indispensable for finding new drug targets
to attack the drivers of a disease and not only the symptoms.
This review surveys the current state-of-the-art in analyzing single gene perturbation
screens from a network point of view. We describe approaches to make the step from
the parts list to the wiring diagram by using phenotypes for network inference and
integrating them with complementary data sources.
Phenotypes
A phenotype can be any observable characteristic of an organism. Analysis strategies
strongly depend on how rich and informative phenotype descriptors are. We will call
phenotypes resulting from a single reporter (or a small number of reporters) low-dimensional
phenotypes and the genes showing significant results hits [6],[7]. Examples of such
low-dimensional phenotypes are cell viability versus cell death [1], growth rates
[8], or the activity of reporter constructs, e.g., a luciferase, downstream of a pathway
of interest [9]. Low-dimensional phenotyping screens can identify candidate genes
on a genome-wide scale and are often used as a first step for follow-up analysis.
We will discuss methods to functionally interpret hits from low-dimensional phenotyping
screens and to place them in the context of cellular networks in the first part of
this review.
The second part will be devoted to high-dimensional phenotyping screens, which evaluate
a large number of cellular features at the same time. Observing system-wide changes
promises key insights into cellular mechanisms and pathways that can not be supplied
by low-dimensional screens. For example, high-dimensional phenotypes can include changes
in cell morphology [10]–[13], or growth rates under a wide range of conditions [14],
or transcriptional changes measured on microarrays [15]–[18], or changes in the metabolome
and proteome [19] measured by mass spectrometry [20] or flow cytometry [21],[22].
Morphological and growth phenotypes can be obtained on a genome-wide scale [13],[14],
while transcriptional and proteomic phenotypes are often restricted to individual
pathways or processes [16],[17],[21].
The distinction between low- and high-dimensional phenotypes may sound technical,
but it is crucial for choosing potential analysis methods. The central difference
is that high-dimensional phenotypes allow one to compute correlations and other similarity
measures, which are not applicable for low-dimensional phenotypes. Another important
distinction is between static phenotypes, providing a “snapshot” of a cell's reaction
to a gene perturbation, and dynamic phenotypes showing a cell's reaction over time.
We expect more and more studies in the future to produce dynamic output and in the
following note explicitly which methods can be applied to dynamic phenotypes. For
the biological interpretation of screening results it is very important to keep in
mind which level of “cellular granularity” a phenotype describes: growth rates or
cell morphologies are much more “high-level” features of the cell than gene or protein
expressions. As soon as more studies produce dynamic phenotypes on many different
cellular levels, integrative analysis of interconnected phenotypes [23] will become
more important. In the following, however, we concentrate on the current state-of-the
art, which almost always uses a single type of readout in a perturbation screen.
Preprocessing Pipeline
In this review we focus on single gene perturbations by knockouts [1] or RNAi [4]
that allow targeting of individual genes or combinations of genes. Before network
analysis, the raw data needs to pass an initial analysis and quality control pipeline
specific to the perturbation and phenotyping technologies used. Low-dimensional screens
are mostly performed in multiple-well-plates and a typical analysis pipeline [4] includes
data preprocessing, removal of spatial biases per plate, normalization between plates,
and finally detection of significant hits [6],[7],[24]. In vertebrates, genes need
to be targeted with multiple siRNAs to ensure effective down-regulation [4], and the
multiple phenotypes per gene can afterwards be integrated into a statistical score
[25]. High-dimensional morphological screens depend on computational analysis like
image segmentation [26],[27] and phenotype discovery [28]–[30] for rapid and consistent
phenotyping. Molecular high-dimensional phenotypes need preprocessing depending on
their platform and different approaches exist, e.g., for flow-cytometry data [31]
or microarrays [32].
From Phenotypes to Cellular Networks
The phenotypes we have discussed above allow only an indirect view on how different
genes in the same process interact to achieve a particular phenotype. Cell morphology
or sensitivity to stresses, for example, are global features of the cell and hard
to relate directly to how individual genes contribute to them (see Figure 1A). Gene
expression phenotypes show transcriptional changes in the genes downstream of a perturbed
pathway but offer only an indirect view of pathway structure because of the high number
of nontranscriptional regulatory events like protein modifications [33]. For example,
different protein activation states by phosphorylation may not be visible by changes
in mRNA concentrations (see Figure 1B).
10.1371/journal.pcbi.1000655.g001
Figure 1
Cellular networks underlying observable phenotypes.
(A) Phenotypes are the response of the cell to external signals mediated by cellular
networks and pathways. The goal of computation is to reconstruct these networks from
the observed phenotypes. (B) Global molecular phenotypes like gene expression allow
a view inside the cell but also have limitations. This is exemplified here in a cartoon
pathway adapted from [61] showing a cascade of five genes/proteins (1–5). Proteins
1–3 form a kinase cascade, 4 is a transcription factor acting on 5. Up-regulation
of 1 starts information flow in the cascade and results in 5 being turned on. In gene
expression data this is visible as a correlation between 1 and 5 (represented as an
undirected edge in the model). Experimentally perturbing a gene, say 3, removes the
corresponding protein from the cascade, breaks the information flow, and results in
an expression change at 5 (represented as an arrow in the model). However, the different
phosphorylation and activation states of proteins 2–4 will most probably not be visible
as changes in gene expression. Thus, because of the pathway mostly acting on the protein
level most parts of the cascade (dashed arrows in the model) can not be inferred from
gene expression data directly.
This gap between observed phenotypes and underlying cellular networks is the main
problem in the analysis of perturbation screens and applies to both low- and high-dimensional
screens. The goal of computational analysis is to bridge this gap by inferring gene
function and recovering pathways and mechanisms from observed phenotypes. The following
methods address the challenge in different ways, mostly by integrating the perturbation
effects and phenotypes with additional sources of information like collections of
functionally related gene sets or protein-interaction networks.
Network Analysis of Low-Dimensional Phenotypes
Global Overview by Enrichment Analysis
A simple way to link phenotypes to gene function is to test whether pathways or functional
groups of genes (e.g., defined by Gene Ontology terms [34] or MSigDB [35]) are enriched
in the list of hits. Most methods use a hypergeometric test statistic (see Figure
2A) and many can be used online [36]–[38] or as Bioconductor packages [39]. An alternative
global functional annotation method tests whether functional groups show a trend towards
especially strong or weak phenotypes without using a cutoff to define hits (see Figure
2B) [35]. Enrichment analysis can also be very useful to analyze high-dimensional
phenotypes, for example when functionally annotating the results of a clustering method.
10.1371/journal.pcbi.1000655.g002
Figure 2
Functional annotation of hits by enrichment analysis.
(A) In the first approach [38] a cutoff is applied to select the hits with strongest
phenotypes. A hyper-geometric test then evaluates if the overlap between the hits
and a given gene set is surprisingly large (or small) compared to the overlap with
a random set. (B) A second approach [35] does not need a cutoff. It maps the gene
set (black bars) onto the observed phenotypes and quantifies if there is a significant
trend or if the genes are spread out uniformly over the whole range.
Enrichment analysis results in a list of p-values describing how significantly each
gene set was represented in the hits. Enrichment analysis reduces complexity and improves
interpretability of results by moving from single genes to functionally related gene
sets. This type of analysis is often called “unbiased” and “hypothesis-free” and is
ideal for a comprehensive first overview. However, enrichment analysis loses its value
for complexity reduction if the number of gene sets becomes too big. Also, overlap
and dependencies between gene lists that could potentially bias the results have so
far only been addressed for the gene ontology (GO) graph [38],[39] but not for more
general collections of gene lists like MSigDB [35].
Good data analysis asks specific questions. A hypothesis-free method can only be the
very first starting point for a deeper exploration of the data. For example, all enrichment
methods rely on known gene sets and cannot uncover new pathways or components. Enrichment
methods treat pathways as bags of unconnected genes without considering connections
within and between pathways. Thus, enrichment methods can only deliver a very crude
picture of the cell. In the following we will discuss approaches to overcome some
of the limitations of enrichment analysis by integrating the observed phenotypes with
complementary sources of information.
Mapping Phenotypes to Networks
Another valuable source of information to interpret RNAi hits are gene and protein
networks obtained either experimentally [40],[41] or computationally by literature
mining [42], or integrating heterogeneous genomic data [43]–[45]. All computational
networks are available online on supplementary Web pages and the experimental networks
can be obtained from databases like STRING [46] or BioGRID [47].
Using these complementary data sources can improve hit identification [48]–[50] and
even provide a more refined view of the pathways the hits contribute to. One strategy
is to search for subnetworks containing a surprisingly large number of hits (see Figure
3A). While this strategy is already useful when evaluating interesting subnetworks
by eye [51],[52], its true power comes from the availability of efficient search algorithms
to find subnetworks enriched for RNAi hits and assess their significance [53]–[57].
An additional application of mapping hits to a network is that known phenotypes can
be used to predict phenotypes of genes not included in the screen, e.g., by assuming
that a gene connected to many hits should also show a strong phenotype [51]. The success
of all network-mapping strategies strongly depends on the quality and coverage of
both the screen and the linkage in the network.
10.1371/journal.pcbi.1000655.g003
Figure 3
Extracting rich subnetworks.
Different patterns in the graph point to a common cellular mechanism causing a phenotype:
(A) hits in a low-dimensional screen (red nodes) clustering in highly connected subnetworks,
and (B) high correlation between high-dimensional phenotypes of target genes connected
in the background network. The black graph represents any type of background network.
Gene Prioritization
Other approaches complement genomic data with biological prior knowledge showing how
“interesting” hits look. Gene prioritization [49],[58] ranks genes according to how
promising they would be for follow-up studies. Because it uses prior knowledge to
fine-tune the algorithm, gene prioritization can be more focussed than a global uninformed
search for enriched subnetworks.
Network Analysis of High-Dimensional Phenotypes
Global Overview by Clustering and Ranking
Most state-of-the-art analysis techniques rely on a “guilt-by-association” paradigm:
genes with similar phenotypes will most probably have a similar biological function.
This explains the prevalence of clustering techniques in analyzing high-dimensional
phenotyping screens [10],[13],[14],[17]. Clustering is a convenient first analysis
and visualization step that can highlight strong trends and patterns in the data and
can thus yield a global first impression of functional units. Another analysis strategy
relying on guilt-by-association is to rank genes by their phenotypic similarity compared
to a gene of interest [11]. Clustering and ranking can be combined with enrichment
analysis (as discussed above) for functional interpretation.
Graph Methods Linking Causes to Effects
Another useful data visualization especially for transcriptional phenotypes is to
build a directed (not necessarily acyclic) graph by drawing an arrow between two genes
if perturbing one results in a significant expression change at the other [59]. This
graph representation can be then used as a starting point for further analysis, for
example by using graph-theoretic methods of transitive reduction [60] to distinguish
between direct and indirect effects of a perturbation [61],[62].
Probabilistic Graphical Models
Most approaches to infer pathway structure from experimental data rely on probabilistic
graphical models. For low-dimensional phenotypes they often suffer from nonuniqueness
and unidentifiability issues [63], but can be applied very successfully in high-dimensional
settings. A prominent approach are (static or dynamic) Bayesian networks, which describe
probabilistically how a gene is controlled by its regulators [64],[65]. To model experimental
perturbations most approaches rely on the concept of “ideal interventions” [66], which
deterministically fix a target gene to a particular state (e.g., “0” for a gene knockout).
Ideal interventions were applied in Bayesian networks [21],[67],[68], factor graphs
[69], and dependency networks [70]. In simulations [71],[72] and on real data [21],[71]
it was found that interventions are critical for effective inference.
The model of ideal interventions contains a number of idealizations (hence the name),
most importantly that manipulations only affect single genes and that perturbation
strength can be controlled deterministically. The first assumption may not be true
if there are off-target or compensatory effects involving other genes. The second
assumption may also not hold true in realistic biological scenarios; in particular
for RNAi screens where experimentalists often lack knowledge about the exact knock-down
efficiency. Probabilistic generalizations of ideal interventions can be used to cope
with this uncertainty [73].
Probabilistic Data Integration
High-dimensional phenotypic profiles can be mapped to given graphs and networks by
finding subgraphs that are connected in the background network and at the same time
show high similarity of phenotypic profiles. These approaches already exist for mapping
gene expression data onto protein interaction networks [74] and the same algorithms
could easily be applied to any other kind of high-dimensional phenotypic profiles
(see Figure 3B). Other approaches use data integration to construct potential pathways
from protein interactions and transcription factor binding data to relate perturbed
genes to the observed downstream effects [75]–[77].
Multiple Input - Multiple Output (MIMO) Models
Many of the approaches discussed so far—like clustering or graphical models—can be
applied to both static “snapshots” as well as dynamic time-course measurements. Another
approach to model specifically the dynamics of networks comes from a branch of control
theory called “systems identification” [78] and uses so called Multiple Input - Multiple
Output (MIMO) models. MIMO models represent the evolution of a perturbed cell over
time by linear differential equations [79]–[83] and can represent nonlinear effects
by transfer functions [84]. The models can be inferred by regression techniques in
the linear case [80] or Monte Carlo stochastic search in the nonlinear case [84].
The framework is very flexible and can incorporate single as well as combinatorial
perturbations.
Nested Effects Models (NEMs)
One of the key problems in analyzing perturbation screens is that the observed phenotypes
are downstream of the perturbed pathway and may not show the direct influence of one
pathway component on another. A class of models explicitly addressing this problem
are Nested Effects Models (NEMS) [33],[85]. They reconstruct pathway structure from
subset relations on the basis of the following rationale: Perturbing some genes may
have an influence on a global process, while perturbing others affects subprocesses
of it. Imagine, for example, a signaling pathway activating several transcription
factors. Blocking the entire pathway will most probably affect all targets of all
transcription factors, while perturbing a single transcription factor will only affect
its direct targets, which are a subset of the phenotype obtained by blocking the complete
pathway. Given high-dimensional phenotypes showing a subset structure, NEMs find the
most likely pathway topology explaining the data. They differ from other statistical
approaches like Bayesian networks by encoding subset relations instead of correlations
or other similarity measures. The theory of NEMs has been applied and extended in
several studies [86]–[89]. An implementation is available as an R/Bioconductor package
[90]. Other extensions to the NEM framework distinguish between activating and inhibiting
regulation [91] or include dynamic information from time-series measurements [92].
Discussion and Outlook
In this review we have discussed two main approaches to describe the reaction of a
cell to an experimental gene perturbation: low-dimensional phenotypes measure individual
reporters for cell viability or pathway activation, while high-dimensional phenotypes
show global effects on cell morphology, transcriptome, or proteome. Table 1 lists
examples of freely available software implementing some of these approaches. All of
them can be directly applied to gene perturbation screens, even though some of them
have been introduced in different contexts. While this review has focused on single
gene knock-outs and knock-downs, similar approaches can be applied to gene over-expression
screens [22],[83],[93],[94], drug treatment [84], environmental stresses changing
many genes [95],[96], or even natural genetic variation [97].
10.1371/journal.pcbi.1000655.t001
Table 1
Examples of software for network analysis of gene perturbation screens.
Method Name
Description with Reference
Web Page
General data analysis and network visualization
Bioconductor
Software environment for the analysis of genomic data featuring hundreds of contributed
packages [112]
www.bioconductor.org
Cytoscape
Software platform for visualizing molecular interaction networks and integrating them
with other data types [113]
www.cytoscape.org
Setting up data for network analysis
cellHTS2
End-to-end analysis of cell-based screens: from raw intensity readings to the annotated
hit list [6]
www.bioconductor.org
RNAither
Analysis of cell-based RNAi screens, includes quality assessment and customizable
normalization [7]
www.bioconductor.org
EBImage
Cell image analysis and feature extraction [27]
www.bioconductor.org
CellProfiler
Cell image analysis and feature extraction [26]
www.cellprofiler.org
Enrichment analysis
DAVID
Tools for data annotation, visualization, and integration [36]
david.abcc.ncifcrf.gov
GOLEM
Enrichment analysis and visualization of GO graph (Figure 2A) [37]
function.princeton.edu/GOLEM
Ontologizer
Enrichment analysis with dependencies between GO nodes (Figure 2A) [38]
compbio.charite.de/ontologizer
GSEA
Gene set enrichment analysis (Figure 2B) [35]
www.broadinstitute.org/gsea/
Clustering and ranking
Cell Profiler Analyst
Interactive exploration and analysis of multidimensional data from image-based experiments
[28]
www.cellprofiler.org
PhenoBlast
Ranking of phenotype profiles according to similarity with given profile [11]
www.rnai.org
Endeavour
Prioritizes hits for further analysis [58]
www.esat.kuleuven.be/endeavour/
Finding rich subnetworks
heinz
Finds optimal subnetworks rich in hits (Figure 3A) [55]
www.planet-lisa.net
jActiveModules
Finds heuristic subnetworks rich in hits (Figure 3A) [53]
www.cytoscape.org
Matisse
Finds subnetworks with high phenotypic similarity (Figure 3B) [74]
acgt.cs.tau.ac.il/matisse/
Network reconstruction
nem
NEMs reconstruct pathway features from subset relations in high-dim phenotypes [90]
www.bioconductor.org
copia
Copia uses MIMO models to reconstruct networks from perturbations [84]
cbio.mskcc.org/copia/
This list is far from comprehensive, but hopefully provides a starting point even
for noncoding experimentalists.
Predicting Phenotypes from Metabolic Networks
The focus of this review is on functionally annotating hits in a network context and
reconstructing networks from high-dimensional phenotypes. In a complementary direction
of research, genome-wide reconstructions of metabolic networks [98],[99] are used
to predict effects of gene perturbations. Instead of predicting networks from phenotypes,
these approaches predict phenotypes from networks. For example, in S. cerevisiae and
Escherichia coli computational models very accurately predict fitness effects of gene
knock-outs [100],[101] as well as compensatory rescue effects [102]. However, recent
developments in metabolic network modeling have led to linear programming algorithms
to extract relevant context-specific subnetworks of activity from a genome-wide network
[103],[104]. In the same way as the probabilistic data integration methods discussed
above, e.g., [74], these algorithms could be used in the future to find metabolic
subnetworks active under certain gene perturbations.
From Single to Combinatorial Perturbations
While single gene perturbation screens have been immensely successful in extending
our knowledge of pathway components and interactions, an important limitation can
be caused by compensatory effects, genetic buffering, and redundancy of cellular mechanisms
and pathways [105],[106]. This limitation can only be overcome by perturbing several
genes at the same time. The number of possible combinations grows rapidly and thus
current approaches are mainly limited to perturbing pairs of genes and observing low-dimensional
phenotypes like fitness estimates [107]. The analysis of combinatorial perturbations
is outside the scope of this review.
The End of the Screen is the Beginning of the Experiment
Global phenotyping and pathway screening can be combined in the same study. For example,
a first genome-wide screen identifies key genes representative for pathways and cellular
mechanisms involved in the phenotype. In a second step the hits of the first screen
could be assayed for high-dimensional molecular phenotypes to infer a pathway diagram
using NEMs or other statistical approaches.
In a further step preliminary pathway models could be used to plan an additional round
of experimentation. Different modeling frameworks propose future experiments to most
effectively refine a pathway hypothesis, e.g., Bayesian networks [108],[109], physical
network models [76], logical models [110], Boolean networks [111], and dynamical modeling
[79].
Iteratively integrating experimentation and computation may lead to a virtuous circle
and is one of the most promising approaches to refine our understanding of the inner
working of the cell.