0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      An innovative approach of determining the sample data size for machine learning models: a case study on health and safety management for infrastructure workers

      1 , 2 , 1
      Electronic Research Archive
      American Institute of Mathematical Sciences (AIMS)

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          <abstract> <p>Numerical experiment is an essential part of academic studies in the field of transportation management. Using the appropriate sample size to conduct experiments can save both the data collecting cost and computing time. However, few studies have paid attention to determining the sample size. In this research, we use four typical regression models in machine learning and a dataset from transport infrastructure workers to explore the appropriate sample size. By observing 12 learning curves, we conclude that a sample size of 250 can balance model performance with the cost of data collection. Our study can provide a reference when deciding on the sample size to collect in advance.</p> </abstract>

          Related collections

          Most cited references45

          • Record: found
          • Abstract: found
          • Article: not found

          Practical selection of SVM parameters and noise estimation for SVM regression.

          We investigate practical selection of hyper-parameters for support vector machines (SVM) regression (that is, epsilon-insensitive zone and regularization parameter C). The proposed methodology advocates analytic parameter selection directly from the training data, rather than re-sampling approaches commonly used in SVM applications. In particular, we describe a new analytical prescription for setting the value of insensitive zone epsilon, as a function of training sample size. Good generalization performance of the proposed parameter selection is demonstrated empirically using several low- and high-dimensional regression problems. Further, we point out the importance of Vapnik's epsilon-insensitive loss for regression problems with finite samples. To this end, we compare generalization performance of SVM regression (using proposed selection of epsilon-values) with regression using 'least-modulus' loss (epsilon=0) and standard squared loss. These comparisons indicate superior generalization performance of SVM regression under sparse sample settings, for various types of additive noise.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Sample Size Justification

            An important step when designing an empirical study is to justify the sample size that will be collected. The key aim of a sample size justification for such studies is to explain how the collected data is expected to provide valuable information given the inferential goals of the researcher. In this overview article six approaches are discussed to justify the sample size in a quantitative empirical study: 1) collecting data from (almost) the entire population, 2) choosing a sample size based on resource constraints, 3) performing an a-priori power analysis, 4) planning for a desired accuracy, 5) using heuristics, or 6) explicitly acknowledging the absence of a justification. An important question to consider when justifying sample sizes is which effect sizes are deemed interesting, and the extent to which the data that is collected informs inferences about these effect sizes. Depending on the sample size justification chosen, researchers could consider 1) what the smallest effect size of interest is, 2) which minimal effect size will be statistically significant, 3) which effect sizes they expect (and what they base these expectations on), 4) which effect sizes would be rejected based on a confidence interval around the effect size, 5) which ranges of effects a study has sufficient power to detect based on a sensitivity power analysis, and 6) which effect sizes are expected in a specific research area. Researchers can use the guidelines presented in this article, for example by using the interactive form in the accompanying online Shiny app, to improve their sample size justification, and hopefully, align the informational value of a study with their inferential goals.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features

                Bookmark

                Author and article information

                Journal
                Electronic Research Archive
                era
                American Institute of Mathematical Sciences (AIMS)
                2688-1594
                2022
                2022
                : 30
                : 9
                : 3452-3462
                Affiliations
                [1 ]Faculty of Business, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
                [2 ]Department of Building and Real Estate, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
                Article
                10.3934/era.2022176
                70283e79-cf5d-42ef-80a9-8ae338a0851c
                © 2022
                History

                Comments

                Comment on this article