3
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      Submit your digital health research with an established publisher
      - celebrating 25 years of open access

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to the analysis of electronic health records and social media data.

          Objectives

          To demonstrate the potential of NLP beyond these domains, we aimed to develop prediction models based on texts collected from an epidemiological cohort and compare their performance to classical regression methods.

          Methods

          We used data from the British National Child Development Study, where 10,567 children aged 11 years wrote essays about how they imagined themselves as 25-year-olds. Overall, 15% of the data set was set aside as a test set for performance evaluation. Pretrained language models were fine-tuned using AutoTrain (Hugging Face) to predict current reading comprehension score (range: 0-35) and future BMI and physical activity (active vs inactive) at the age of 33 years. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models, including demographic and lifestyle factors of the parents and children from birth to the age of 11 years as predictors.

          Results

          NLP clearly outperformed linear regression when predicting reading comprehension scores (root mean square error: 3.89, 95% CI 3.74-4.05 for NLP vs 4.14, 95% CI 3.98-4.30 and 5.41, 95% CI 5.23-5.58 for regression models with and without general ability score as a predictor, respectively). Predictive performance for physical activity was similarly poor for the 2 methods (area under the receiver operating characteristic curve: 0.55, 95% CI 0.52-0.60 for both) but was slightly better than random assignment, whereas linear regression clearly outperformed the NLP approach when predicting BMI (root mean square error: 4.38, 95% CI 4.02-4.74 for NLP vs 3.85, 95% CI 3.54-4.16 for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as a predictor.

          Conclusions

          Our study demonstrated the potential of using large language models on text collected from epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to the outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies.

          Related collections

          Most cited references34

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The FAIR Guiding Principles for scientific data management and stewardship

          There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            "Why Should I Trust You?"

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

              The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.
                Bookmark

                Author and article information

                Journal
                JMIR Med Inform
                JMIR Med Inform
                medinform
                7
                JMIR Medical Informatics
                JMIR Publications Inc
                2291-9694
                2023
                19 September 2023
                : 11
                : e43638
                Affiliations
                [1] Steno Diabetes Center Copenhagen , Herlev, Denmark
                [2] Department of Public Health, Aarhus University , Aarhus, Denmark
                [3] Steno Diabetes Center Aarhus, Aarhus University Hospital , Aarhus, Denmark
                Author notes
                Correspondence to Adam Hulman, PhD ADAHUL@ 123456rm.dk
                Article
                43638
                10.2196/43638
                10547934
                37787655
                b044da7d-8bf1-43c4-813b-b1a8bfc06377
                © Rasmus Wibaek, Gregers Stig Andersen, Christina C Dahm, Daniel R Witte, Adam Hulman. Originally published in JMIR Medical Informatics ( https://medinform.jmir.org), 19.9.2023.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

                History
                : 21 October 2022
                : 29 June 2023
                : 22 July 2023
                Categories
                Original Paper

                natural language processing,deep learning,language model,epidemiology,cohort study,prediction,nlp,prediction model,child development,regression model,large language model,llm

                Comments

                Comment on this article