iSeg: an efficient algorithm for segmentation of genomic and epigenomic data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems.

Results

We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences.

Conclusions

We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.

Electronic supplementary material

The online version of this article (10.1186/s12859-018-2140-3) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 35

Record: found
Abstract: found
Article: not found

Design and analysis of ChIP-seq experiments for DNA-binding proteins

Peter V. Kharchenko, Michael Tolstorukov, Peter Park (2008)

Recent progress in massively parallel sequencing platforms has allowed for genome-wide measurements of DNA-associated proteins using a combination of chromatin immunoprecipitation and sequencing (ChIP-seq). While a variety of methods exist for analysis of the established microarray alternative (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein binding positions with high accuracy. Using three separate datasets, we illustrate new methods for improving tag alignment and correcting for background signals. We also compare sensitivity and spatial precision of several novel and previously described binding detection algorithms. Finally, we analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

0 comments Cited 423 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A faster circular binary segmentation algorithm for the analysis of array CGH data.

E Venkatraman, Adam B. Olshen (2007)

Array CGH technologies enable the simultaneous measurement of DNA copy number for thousands of sites on a genome. We developed the circular binary segmentation (CBS) algorithm to divide the genome into regions of equal copy number. The algorithm tests for change-points using a maximal t-statistic with a permutation reference distribution to obtain the corresponding P-value. The number of computations required for the maximal test statistic is O(N2), where N is the number of markers. This makes the full permutation approach computationally prohibitive for the newer arrays that contain tens of thousands markers and highlights the need for a faster algorithm. We present a hybrid approach to obtain the P-value of the test statistic in linear time. We also introduce a rule for stopping early when there is strong evidence for the presence of a change. We show through simulations that the hybrid approach provides a substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed. We also present the analyses of array CGH data from breast cancer cell lines to show the impact of the new approaches on the analysis of real data. An R version of the CBS algorithm has been implemented in the "DNAcopy" package of the Bioconductor project. The proposed hybrid method for the P-value is available in version 1.2.1 or higher and the stopping rule for declaring a change early is available in version 1.5.1 or higher.

0 comments Cited 385 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Unsupervised pattern discovery in human chromatin structure through genomic segmentation.

Michael M Hoffman, Orion Buske, Jie Wang … (2012)

We trained Segway, a dynamic Bayesian network method, simultaneously on chromatin data from multiple experiments, including positions of histone modifications, transcription-factor binding and open chromatin, all derived from a human chronic myeloid leukemia cell line. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, transcriptional regulator CTCF-binding regions and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.

0 comments Cited 283 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jinfeng Zhang: jinfeng@stat.fsu.edu

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 11 April 2018

Publication date PMC-release: 11 April 2018

Publication date Collection: 2018

Volume: 19

Electronic Location Identifier: 131

Affiliations

[1 ]ISNI 0000 0001 0647 2963, GRID grid.255962.f, Department of Mathematics, , Florida Gulf Coast University, ; Fort Myers, FL USA

[2 ]ISNI 0000 0004 0472 0419, GRID grid.255986.5, Department of Statistics, , Florida State University, ; Tallahassee, FL USA

[3 ]ISNI 0000 0004 0472 0419, GRID grid.255986.5, Center for Genomics and Personalized Medicine, , Florida State University, ; Tallahassee, FL USA

[4 ]ISNI 0000 0004 0472 0419, GRID grid.255986.5, Department of Biological Science, , Florida State University, ; Tallahassee, FL USA

Article

Publisher ID: 2140

DOI: 10.1186/s12859-018-2140-3

PMC ID: 5896135

PubMed ID: 29642840

SO-VID: 35dae74c-fdbf-415c-a951-ef61ec6db5de

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 5 September 2017

Date accepted : 26 March 2018

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100000001, National Science Foundation;

Award ID: IOS Award 1444532

Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;

Award ID: R01GM126558

Award Recipient : Jinfeng Zhang

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Comments

Comment on this article

scite_

Cited by 5

See all cited by

Most referenced authors 1,466

See all reference authors

iSeg: an efficient algorithm for segmentation of genomic and epigenomic data

Read this article at

Abstract

Background

Results

Conclusions

Electronic supplementary material

Related collections

Genomic Prediction

Most cited references 35

Design and analysis of ChIP-seq experiments for DNA-binding proteins

A faster circular binary segmentation algorithm for the analysis of array CGH data.

Unsupervised pattern discovery in human chromatin structure through genomic segmentation.

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 230

Cited by 5

Most referenced authors 1,466