0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Data leakage inflates prediction performance in connectome-based machine learning models

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage–involving feature selection, covariate correction, and dependence between subjects–on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

          Abstract

          The effects of data leakage on predictive models in neuroimaging studies are not well understood. Here, the authors show that data leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have more minor effects.

          Related collections

          Most cited references64

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          SciPy 1.0: fundamental algorithms for scientific computing in Python

          SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Array programming with NumPy

            Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves 1 and in the first imaging of a black hole 2 . Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial analysis.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Adjusting batch effects in microarray expression data using empirical Bayes methods.

              Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.
                Bookmark

                Author and article information

                Contributors
                matthew.rosenblatt@yale.edu
                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group UK (London )
                2041-1723
                28 February 2024
                28 February 2024
                2024
                : 15
                : 1829
                Affiliations
                [1 ]Department of Biomedical Engineering, Yale University, ( https://ror.org/03v76x132) New Haven, CT USA
                [2 ]Interdepartmental Neuroscience Program, Yale University, ( https://ror.org/03v76x132) New Haven, CT USA
                [3 ]GRID grid.47100.32, ISNI 0000000419368710, Department of Radiology & Biomedical Imaging, , Yale School of Medicine, ; New Haven, CT USA
                [4 ]Department of Bioengineering, Northeastern University, ( https://ror.org/04t5xt781) Boston, MA USA
                [5 ]Department of Psychology, Northeastern University, ( https://ror.org/04t5xt781) Boston, MA USA
                [6 ]GRID grid.47100.32, ISNI 0000000419368710, Child Study Center, , Yale School of Medicine, ; New Haven, CT USA
                [7 ]Department of Statistics & Data Science, Yale University, ( https://ror.org/03v76x132) New Haven, CT USA
                Author information
                http://orcid.org/0000-0002-3894-6198
                http://orcid.org/0000-0001-5643-9133
                http://orcid.org/0000-0003-4657-0079
                http://orcid.org/0000-0002-4804-5553
                http://orcid.org/0000-0002-6301-1167
                Article
                46150
                10.1038/s41467-024-46150-w
                10901797
                38418819
                e97fc34b-0f89-4430-8b47-368b512666c6
                © The Author(s) 2024

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 3 August 2023
                : 15 February 2024
                Funding
                Funded by: FundRef https://doi.org/10.13039/100000001, National Science Foundation (NSF);
                Award ID: DGE2139841
                Award Recipient :
                Funded by: Gruber Science Fellowship
                Funded by: FundRef https://doi.org/10.13039/100000025, U.S. Department of Health & Human Services | NIH | National Institute of Mental Health (NIMH);
                Award ID: R00MH130894
                Award ID: R01MH121095
                Award Recipient :
                Funded by: U.S. Department of Health & Human Services | NIH | National Institute of Mental Health (NIMH)
                Categories
                Article
                Custom metadata
                © Springer Nature Limited 2024

                Uncategorized
                cognitive neuroscience,neuroscience,computational science
                Uncategorized
                cognitive neuroscience, neuroscience, computational science

                Comments

                Comment on this article