0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Towards the genomic sequence code of DNA fragility for machine learning

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Genomic DNA breakages and the subsequent insertion and deletion mutations are important contributors to genome instability and linked diseases. Unlike the research in point mutations, the relationship between DNA sequence context and the propensity for strand breaks remains elusive. Here, by analyzing the differences and commonalities across myriads of genomic breakage datasets, we extract the sequence-linked rules and patterns behind DNA fragility. We show the overall deconvolution of the sequence influence into short-, mid- and long-range effects, and the stressor-dependent differences in defining the range and compositional effects on DNA fragility. We summarize and release our feature compendium as a library that can be seamlessly incorporated into genomic machine learning procedures, where DNA fragility is of concern, and train a generalized DNA fragility model on cancer-associated breakages. Structural variants (SVs) tend to stabilize regions in which they emerge, with the effect most pronounced for pathogenic SVs. In contrast, the effects of chromothripsis are seen across regions less prone to breakages. We find that viral integration may bring genome fragility, particularly for cancer-associated viruses. Overall, this work offers novel insights into the genomic sequence basis of DNA fragility and presents a powerful machine learning resource to further enhance our understanding of genome (in)stability and evolution.

          Graphical Abstract

          Graphical Abstract

          Related collections

          Most cited references85

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Highly accurate protein structure prediction with AlphaFold

          Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Human Transcription Factors

            Transcription factors (TFs) recognize specific DNA sequences to control chromatin and transcription, forming a complex system that guides expression of the genome. Despite keen interest in understanding how TFs control gene expression, it remains challenging to determine how the precise genomic binding sites of TFs are specified and how TF binding ultimately relates to regulation of transcription. This review considers how TFs are identified and functionally characterized, principally through the lens of a catalog of over 1,600 likely human TFs and binding motifs for two-thirds of them. Major classes of human TFs differ markedly in their evolutionary trajectories and expression patterns, underscoring distinct functions. TFs likewise underlie many different aspects of human physiology, disease, and variation, highlighting the importance of continued effort to understand TF-mediated gene regulation.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The complete sequence of a human genome*

              Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion base pair (bp) sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million bp of sequence containing 1,956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies. Twenty years after the initial drafts, a truly complete sequence of a human genome reveals what has been missing.
                Bookmark

                Author and article information

                Contributors
                Journal
                Nucleic Acids Res
                Nucleic Acids Res
                nar
                Nucleic Acids Research
                Oxford University Press
                0305-1048
                1362-4962
                27 November 2024
                23 October 2024
                23 October 2024
                : 52
                : 21
                : 12798-12816
                Affiliations
                MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford , Oxford, OX3 9DS, UK
                MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford , Oxford, OX3 9DS, UK
                MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford , Oxford, OX3 9DS, UK
                MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine, University of Oxford , Oxford, OX3 9DS, UK
                Author notes
                To whom correspondence should be addressed. Tel: +44 1865 222407; Email: aleksandr.sahakyan@ 123456imm.ox.ac.uk
                Author information
                https://orcid.org/0000-0002-0847-6391
                https://orcid.org/0000-0002-0468-2091
                https://orcid.org/0000-0001-8499-1101
                https://orcid.org/0000-0002-8343-3594
                Article
                gkae914
                10.1093/nar/gkae914
                11602142
                39441076
                e268c1c9-5d1a-4488-bb6d-9a10e4c465bb
                © The Author(s) 2024. Published by Oxford University Press on behalf of Nucleic Acids Research.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 2 October 2024
                : 20 September 2024
                : 18 March 2024
                Page count
                Pages: 19
                Funding
                Funded by: UK Medical Research Council, DOI 10.13039/501100000265;
                Award ID: MC_UU_12025
                Funded by: Oxford University, DOI 10.13039/501100000769;
                Categories
                AcademicSubjects/SCI00010
                Computational Biology

                Genetics
                Genetics

                Comments

                Comment on this article