28
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

      1 , 1 , 2 , 1 , 3 , 4 , 5 , 5 , 1 , 2 , 6 , 1 , 7 , 1 , 1 , 2 , 1 , 8 , 5 , 2 , 1 , 1 , 2 , 9 , 1 , 9 , 1 , 9 , 1 , 1 , 2 , 1 , 2 , 3 , 3 , 3 , 10 , 3 , 3 , 8 , 3 , 1 , 2 , 1 , 2 , 1 , 9 , 11 , 1 , 12 , 1 , 2 , 12 , 3 , 13 , 1 , 9 , 1
      The International Journal of High Performance Computing Applications
      SAGE Publications

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

          Related collections

          Most cited references62

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Highly accurate protein structure prediction with AlphaFold

          Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

            Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community. Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available. Availability and implementation: The code is available under GNU GPL at https://github.com/stamatak/standard-RAxML. Contact: alexandros.stamatakis@h-its.org Supplementary information: Supplementary data are available at Bioinformatics online.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

              The ongoing pandemic spread of a novel human coronavirus, SARS-COV-2, associated with severe pneumonia disease (COVID-19), has resulted in the generation of tens of thousands of virus genome sequences. The rate of genome generation is unprecedented, yet there is currently no coherent nor accepted scheme for naming the expanding phylogenetic diversity of SARS-CoV-2. We present a rational and dynamic virus nomenclature that uses a phylogenetic framework to identify those lineages that contribute most to active spread. Our system is made tractable by constraining the number and depth of hierarchical lineage labels and by flagging and de-labelling virus lineages that become unobserved and hence are likely inactive. By focusing on active virus lineages and those spreading to new locations this nomenclature will assist in tracking and understanding the patterns and determinants of the global spread of SARS-CoV-2.
                Bookmark

                Author and article information

                Contributors
                (View ORCID Profile)
                (View ORCID Profile)
                (View ORCID Profile)
                (View ORCID Profile)
                (View ORCID Profile)
                Journal
                The International Journal of High Performance Computing Applications
                The International Journal of High Performance Computing Applications
                SAGE Publications
                1094-3420
                1741-2846
                November 2023
                October 27 2023
                November 2023
                : 37
                : 6
                : 683-705
                Affiliations
                [1 ]Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, USA
                [2 ]Department of Computer Science, University of Chicago, Hyde Park, IL, USA
                [3 ]NVIDIA Inc., Santa Clara, CA, USA
                [4 ]Harvard University, Cambridge, MA, USA
                [5 ]Cerebras Inc., San Jose, CA, USA
                [6 ]Computer Science Department, Northern Illinois University, DeKalb, IL, USA
                [7 ]New York University, New York, NY, USA
                [8 ]Department of Biochemistry, University of Illinois-Urbana Champaign, Champaign, IL, USA
                [9 ]Argonne Leadership Computing Facility, Argonne National Laboratory, Lemont, IL, USA
                [10 ]Computer Science Department, Technical University of Munich, Munich,Germany
                [11 ]Computer Science Department, University of Illinois Chicago, Chicago, IL, USA
                [12 ]Computing, Environment and Life Sciences Directorate, Argonne National Laboratory, Lemont, IL, USA
                [13 ]Computer Science Department, California Institute of Technology, Pasadena, CA, USA
                Article
                10.1177/10943420231201154
                50930a67-229a-4978-8b50-c2befca14e77
                © 2023

                http://journals.sagepub.com/page/policies/text-and-data-mining-license

                History

                Comments

                Comment on this article