17
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Benchmarking atlas-level data integration in single-cell genomics

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.

          Abstract

          This benchmarking study compares 16 methods for integrating complex single-cell RNA and ATAC datasets and provides a guide to method choice.

          Related collections

          Most cited references49

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          SciPy 1.0: fundamental algorithms for scientific computing in Python

          SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Comprehensive Integration of Single-Cell Data

            Single-cell transcriptomics has transformed our ability to characterize cell states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets to better understand cellular identity and function. Here, we develop a strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities. After demonstrating improvement over existing methods for integrating scRNA-seq data, we anchor scRNA-seq experiments with scATAC-seq to explore chromatin differences in closely related interneuron subsets and project protein expression measurements onto a bone marrow atlas to characterize lymphocyte populations. Lastly, we harmonize in situ gene expression and scRNA-seq datasets, allowing transcriptome-wide imputation of spatial gene expression patterns. Our work presents a strategy for the assembly of harmonized references and transfer of information across datasets.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Fast, sensitive, and accurate integration of single cell data with Harmony

              The emerging diversity of single cell RNAseq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies. Here, real biological differences are interspersed with technical differences. We present Harmony, an algorithm that projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms. We show that Harmony requires dramatically fewer computational resources. It is the only currently available algorithm that makes the integration of ~106 cells feasible on a personal computer. We apply Harmony to PBMCs from datasets with large experimental differences, 5 studies of pancreatic islet cells, mouse embryogenesis datasets, and cross-modality spatial integration.
                Bookmark

                Author and article information

                Contributors
                maria.colome@bmc.med.lmu.de
                fabian.theis@helmholtz-muenchen.de
                Journal
                Nat Methods
                Nat Methods
                Nature Methods
                Nature Publishing Group US (New York )
                1548-7091
                1548-7105
                23 December 2021
                23 December 2021
                2022
                : 19
                : 1
                : 41-50
                Affiliations
                [1 ]GRID grid.4567.0, ISNI 0000 0004 0483 2525, Institute of Computational Biology, Helmholtz Zentrum München, , German Research Center for Environmental Health, ; Neuherberg, Germany
                [2 ]GRID grid.5949.1, ISNI 0000 0001 2172 9288, Institute of Medical Informatics, , University of Münster, ; Münster, Germany
                [3 ]GRID grid.6936.a, ISNI 0000000123222966, Department of Mathematics, Technische Universität München, , Garching bei München, ; München, Germany
                [4 ]GRID grid.5253.1, ISNI 0000 0001 0328 4908, Institute of Medical Informatics, , Heidelberg University Hospital, ; Heidelberg, Germany
                [5 ]GRID grid.6936.a, ISNI 0000000123222966, TUM School of Life Sciences Weihenstephan, , Technical University of Munich, ; Freising, Germany
                [6 ]GRID grid.5252.0, ISNI 0000 0004 1936 973X, Biomedical Center (BMC), Physiological Chemistry, Faculty of Medicine, , Ludwig Maximilian University of Munich, ; Planegg-Martinsried, Germany
                Author information
                http://orcid.org/0000-0001-7464-7921
                http://orcid.org/0000-0002-6189-3792
                http://orcid.org/0000-0002-8123-3409
                http://orcid.org/0000-0002-2419-1943
                Article
                1336
                10.1038/s41592-021-01336-8
                8748196
                34949812
                f3c3acaf-9db6-4356-a6ed-958f4dd14f36
                © The Author(s) 2021

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 3 June 2020
                : 1 November 2021
                Funding
                Funded by: FundRef https://doi.org/10.13039/501100009318, Helmholtz Association;
                Award ID: ExNet-0041-Phase2-3 (SyNergy-HMGU)
                Award ID: ZT-I-0007 sparse2big
                Award ID: Initiative and Network Fund
                Award Recipient :
                Funded by: FundRef https://doi.org/10.13039/100010661, EC | Horizon 2020 Framework Programme (EU Framework Programme for Research and Innovation H2020);
                Award ID: 874656
                Award Recipient :
                Funded by: FundRef https://doi.org/10.13039/100000923, Silicon Valley Community Foundation (SVCF);
                Award ID: 182835
                Award Recipient :
                Funded by: FundRef https://doi.org/10.13039/100004440, Wellcome Trust (Wellcome);
                Award ID: 108413/A/15/D
                Award Recipient :
                Funded by: Chan Zuckerberg foundation (grant #2019- 002438, Human Lung Cell Atlas 1.0); Bavarian Ministry of Science and the Arts in the framework of the Bavarian Research Association “ForInter”
                Categories
                Analysis
                Custom metadata
                © The Author(s), under exclusive licence to Springer Nature America, Inc. 2022

                Life sciences
                machine learning,data integration,software,transcriptomics
                Life sciences
                machine learning, data integration, software, transcriptomics

                Comments

                Comment on this article

                scite_

                Similar content247

                Cited by212

                Most referenced authors2,604