17
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don't. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments.

          Related collections

          Most cited references3

          • Record: found
          • Abstract: not found
          • Article: not found

          Flat Minima

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The Importance of Complexity in Model Selection.

            Model selection should be based not solely on goodness-of-fit, but must also consider model complexity. While the goal of mathematical modeling in cognitive psychology is to select one model from a set of competing models that best captures the underlying mental process, choosing the model that best fits a particular set of data will not achieve this goal. This is because a highly complex model can provide a good fit without necessarily bearing any interpretable relationship with the underlying process. It is shown that model selection based solely on the fit to observed data will result in the choice of an unnecessarily complex model that overfits the data, and thus generalizes poorly. The effect of over-fitting must be properly offset by model selection methods. An application example of selection methods using artificial data is also presented. Copyright 2000 Academic Press.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

              In artificial neural networks, learning from data is a computationally demanding task in which a large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic processes over a cost-function. It is not well understood how learning occurs in these systems, in particular how they avoid getting trapped in configurations with poor computational performance. Here we study the difficult case of networks with discrete weights, where the optimization landscape is very rough even for simple architectures, and provide theoretical and numerical evidence of the existence of rare - but extremely dense and accessible - regions of configurations in the network weight space. We define a novel measure, which we call the "robust ensemble" (RE), which suppresses trapping by isolated configurations and amplifies the role of these dense regions. We analytically compute the RE in some exactly solvable models, and also provide a general algorithmic scheme which is straightforward to implement: define a cost-function given by a sum of a finite number of replicas of the original cost-function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful new algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. The weak dependence on the number of precision bits of the weights leads us to conjecture that very similar reasoning applies to more conventional neural networks. Analogous algorithmic schemes can also be applied to other optimization problems.
                Bookmark

                Author and article information

                Journal
                2017-06-30
                Article
                1706.10239
                28753e06-9cd8-4651-8398-546e4b865f37

                http://creativecommons.org/publicdomain/zero/1.0/

                History
                Custom metadata
                cs.LG cs.AI stat.ML

                Machine learning,Artificial intelligence
                Machine learning, Artificial intelligence

                Comments

                Comment on this article