128
views
0
recommends
+1 Recommend
1 collections
    12
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Complexity curve: a graphical measure of data complexity and classifier performance

      research-article

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

          Most cited references30

          • Record: found
          • Abstract: found
          • Article: not found

          Statistical challenges of high-dimensional data.

          Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Complexity measures of supervised classification problems

            Tin Ho, M Basu (2002)
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              The tail of the hypergeometric distribution

                Bookmark

                Author and article information

                Contributors
                Journal
                peerj-cs
                peerj-cs
                PeerJ Comput. Sci.
                PeerJ Computer Science
                PeerJ Comput. Sci.
                PeerJ Inc. (San Francisco, USA )
                2376-5992
                8 August 2016
                : 2
                : e76
                Affiliations
                [1 ]Centre of New Technologies, University of Warsaw , Warsaw, Poland
                [2 ]Institute of Computer Science, Polish Academy of Sciences , Warsaw, Poland
                Article
                cs-76
                10.7717/peerj-cs.76
                957d48e9-aff9-430b-ac2c-bd678de13c0f
                ©2016 Zubek and Plewczynski

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

                History
                : 9 March 2016
                : 4 July 2016
                Funding
                Funded by: European Union
                Award ID: UDA-POKL.04.01.01-00-051/10-00
                Funded by: Polish National Science Centre
                Award ID: 2015/16/T/ST6/00493
                Award ID: 2014/15/B/ST6/05082
                Award ID: 2013/09/B/NZ2/00121
                Funded by: EU COST
                Award ID: BM1405, BM1408
                The study is cofounded by the European Union from resources of the European Social Fund. Project PO KL “Information technologies: Research and their interdisciplinary applications,” Agreement UDA-POKL.04.01.01-00-051/10-00; Polish National Science Centre (grant numbers: 2015/16/T/ST6/00493 to JZ, 2014/15/B/ST6/05082 and 2013/09/B/NZ2/00121 to JZ and DP); EU COST BM1405 and BM1408 actions. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Algorithms and Analysis of Algorithms
                Artificial Intelligence
                Data Mining and Machine Learning

                Computer science
                Learning curves,Data complexity,Data pruning,Hellinger distance,Bias-variance decomposition,Performance measures

                Comments

                Comment on this article