Towards Understanding Generalization of Deep Learning: Perspective of
  Loss Landscapes

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don't. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments.

Related collections

Most cited references 3

Record: found
Abstract: not found
Article: not found

Flat Minima

Jürgen Schmidhuber, Jürgen Schmidhuber (1997)

0 comments Cited 164 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Importance of Complexity in Model Selection.

BoYoung Myung (2000)

Model selection should be based not solely on goodness-of-fit, but must also consider model complexity. While the goal of mathematical modeling in cognitive psychology is to select one model from a set of competing models that best captures the underlying mental process, choosing the model that best fits a particular set of data will not achieve this goal. This is because a highly complex model can provide a good fit without necessarily bearing any interpretable relationship with the underlying process. It is shown that model selection based solely on the fit to observed data will result in the choice of an unnecessarily complex model that overfits the data, and thus generalizes poorly. The effect of over-fitting must be properly offset by model selection methods. An application example of selection methods using artificial data is also presented. Copyright 2000 Academic Press.

0 comments Cited 77 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

Carlo Baldassi, Christian Borgs, Jennifer T Chayes … (2016)

In artificial neural networks, learning from data is a computationally demanding task in which a large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic processes over a cost-function. It is not well understood how learning occurs in these systems, in particular how they avoid getting trapped in configurations with poor computational performance. Here we study the difficult case of networks with discrete weights, where the optimization landscape is very rough even for simple architectures, and provide theoretical and numerical evidence of the existence of rare - but extremely dense and accessible - regions of configurations in the network weight space. We define a novel measure, which we call the "robust ensemble" (RE), which suppresses trapping by isolated configurations and amplifies the role of these dense regions. We analytically compute the RE in some exactly solvable models, and also provide a general algorithmic scheme which is straightforward to implement: define a cost-function given by a sum of a finite number of replicas of the original cost-function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful new algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. The weak dependence on the number of precision bits of the weights leads us to conjecture that very similar reasoning applies to more conventional neural networks. Analogous algorithmic schemes can also be applied to other optimization problems.

0 comments Cited 23 times – based on 0 reviews

Preprint

     Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 2017-06-30

Article

ArXiV ID: 1706.10239

SO-VID: 28753e06-9cd8-4651-8398-546e4b865f37

License:

http://creativecommons.org/publicdomain/zero/1.0/

History

Custom metadata

Categories cs.LG cs.AI stat.ML

ScienceOpen disciplines: Machine learning,Artificial intelligence

Data availability:

ScienceOpen disciplines: Machine learning, Artificial intelligence

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Read this article at

Abstract

Related collections

Radiology and Natural Language Processing

Most cited references 3

Flat Minima

The Importance of Complexity in Model Selection.

Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 32

Most referenced authors 94