56
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Comparing methods for constructing and representing human pangenome graphs

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs.

          Results

          In this work, we collect all publicly available high-quality human haplotypes and construct the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci.

          Conclusion

          This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.

          Supplementary Information

          The online version contains supplementary material available at 10.1186/s13059-023-03098-2.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: found
          • Article: not found

          Minimap2: pairwise alignment for nucleotide sequences

          Heng Li (2018)
          Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The complete sequence of a human genome*

            Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion base pair (bp) sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million bp of sequence containing 1,956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies. Twenty years after the initial drafts, a truly complete sequence of a human genome reveals what has been missing.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

              The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
                Bookmark

                Author and article information

                Contributors
                francesco.andreace@pasteur.fr
                Journal
                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central (London )
                1474-7596
                1474-760X
                30 November 2023
                30 November 2023
                2023
                : 24
                : 274
                Affiliations
                [1 ]Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015 France
                [2 ]Sorbonne Université, Collège doctoral, ( https://ror.org/02en5vm52) F-75005 Paris, France
                [3 ]Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, F-75015 Paris, France
                Author information
                http://orcid.org/0009-0008-0566-200X
                Article
                3098
                10.1186/s13059-023-03098-2
                10691155
                38037131
                34c48411-ccf3-4bab-bad6-7bab1c97bafb
                © The Author(s) 2023

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 2 January 2023
                : 26 October 2023
                Funding
                Funded by: ANR Full-RNA
                Award ID: ANR-22-CE45-0007
                Award Recipient :
                Funded by: SeqDigger
                Award ID: ANR-19-CE45-0008
                Award Recipient :
                Funded by: Inception
                Award ID: PIA/ANR16-CONV-0005
                Award Recipient :
                Funded by: PRAIRIE
                Award ID: ANR-19-P3IA-0001
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/100010665, H2020 Marie Skłodowska-Curie Actions;
                Award ID: 956229
                Award ID: 872539
                Award Recipient :
                Categories
                Research
                Custom metadata
                © BioMed Central Ltd., part of Springer Nature 2023

                Genetics
                pangenomics,de bruijn graphs,variation graphs,sequence analysis,algorithms
                Genetics
                pangenomics, de bruijn graphs, variation graphs, sequence analysis, algorithms

                Comments

                Comment on this article