0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      excluderanges: exclusion sets for T2T-CHM13, GRCm39, and other genome assemblies

      brief-report

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Summary

          Exclusion regions are sections of reference genomes with abnormal pileups of short sequencing reads. Removing reads overlapping them improves biological signal, and these benefits are most pronounced in differential analysis settings. Several labs created exclusion region sets, available primarily through ENCODE and Github. However, the variety of exclusion sets creates uncertainty which sets to use. Furthermore, gap regions (e.g. centromeres, telomeres, short arms) create additional considerations in generating exclusion sets. We generated exclusion sets for the latest human T2T-CHM13 and mouse GRCm39 genomes and systematically assembled and annotated these and other sets in the excluderanges R/Bioconductor data package, also accessible via the BEDbase.org API. The package provides unified access to 82 GenomicRanges objects covering six organisms, multiple genome assemblies, and types of exclusion regions. For human hg38 genome assembly, we recommend hg38.Kundaje.GRCh38_unified_blacklist as the most well-curated and annotated, and sets generated by the Blacklist tool for other organisms.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The complete sequence of a human genome*

          Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion base pair (bp) sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million bp of sequence containing 1,956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies. Twenty years after the initial drafts, a truly complete sequence of a human genome reveals what has been missing.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The ENCODE Blacklist: Identification of Problematic Regions of the Genome

            Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Design and analysis of ChIP-seq experiments for DNA-binding proteins

              Recent progress in massively parallel sequencing platforms has allowed for genome-wide measurements of DNA-associated proteins using a combination of chromatin immunoprecipitation and sequencing (ChIP-seq). While a variety of methods exist for analysis of the established microarray alternative (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein binding positions with high accuracy. Using three separate datasets, we illustrate new methods for improving tag alignment and correcting for background signals. We also compare sensitivity and spatial precision of several novel and previously described binding detection algorithms. Finally, we analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                April 2023
                17 April 2023
                17 April 2023
                : 39
                : 4
                : btad198
                Affiliations
                Department of Biostatistics, Virginia Commonwealth University , Richmond, VA 23298, United States
                Department of Biostatistics, University of North Carolina-Chapel Hill , Chapel Hill, NC 27514, United States
                Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599, United States
                Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22908, United States
                Department of Pathology, Virginia Commonwealth University , Richmond, VA 23284, United States
                Massey Cancer Center, Virginia Commonwealth University , Richmond, VA 23220, United States
                Center for Public Health Genomics, University of Virginia , Charlottesville, VA 22908, United States
                Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599, United States
                Thurston Arthritis Research Center, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599, United States
                Department of Cell Biology and Physiology, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599, United States
                Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599, United States
                Curriculum in Genetics and Molecular Biology, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599, United States
                Department of Biostatistics, University of North Carolina-Chapel Hill , Chapel Hill, NC 27514, United States
                Department of Genetics, University of North Carolina-Chapel Hill , Chapel Hill, NC 27514, United States
                Department of Biostatistics, Virginia Commonwealth University , Richmond, VA 23298, United States
                Department of Pathology, Virginia Commonwealth University , Richmond, VA 23284, United States
                Author notes
                Corresponding author. Department of Biostatistics, Virginia Commonwealth University, 830 East Main Street, Richmond, VA 23219, United States. E-mail: mdozmorov@ 123456vcu.edu
                Author information
                https://orcid.org/0000-0003-4051-3217
                https://orcid.org/0000-0003-3541-8418
                https://orcid.org/0000-0003-2123-0051
                https://orcid.org/0000-0001-8401-0545
                https://orcid.org/0000-0002-0086-8358
                Article
                btad198
                10.1093/bioinformatics/btad198
                10126321
                37067481
                a98557a9-7caf-4df1-961d-eaf8a7a3c379
                © The Author(s) 2023. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 26 November 2022
                : 16 February 2023
                : 12 April 2023
                : 24 April 2023
                Page count
                Pages: 3
                Funding
                Funded by: George and Lavinia Blick Research Scholarship;
                Categories
                Applications Note
                Databases and Ontologies
                AcademicSubjects/SCI01060

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article