Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing and Neural Networks with Quadratic Activations

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. Several practical studies have shown that pruning an overparameterized model and fine-tuning generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the complexity of trained models, there is very little known about the theory behind this success. In this paper we address this issue by investigating the pruning + fine-tuning framework on the overparameterized matrix sensing problem, with the ground truth denoted \(U_\star \in \mathbb{R}^{d \times r}\) and the overparameterized model \(U \in \mathbb{R}^{d \times k}\) with \(k \gg r\). We study the approximate local minima of the empirical mean square error, augmented with a smooth version of a group Lasso regularizer, \(\sum_{i=1}^k \| U e_i \|_2\) and show that pruning the low \(\ell_2\)-norm columns results in a solution \(U_{\text{prune}}\) which has the minimum number of columns \(r\), yet is close to the ground truth in training loss. Initializing the subsequent fine-tuning phase from \(U_{\text{prune}}\), the resulting solution converges linearly to a generalization error of \(O(\sqrt{rd/n})\) ignoring lower order terms, which is statistically optimal. While our analysis provides insights into the role of regularization in pruning, we also show that running gradient descent in the absence of regularization results in models which {are not suitable for greedy pruning}, i.e., many columns could have their \(\ell_2\) norm comparable to that of the maximum. Lastly, we extend our results for the training and pruning of two-layer neural networks with quadratic activation functions. Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.

Related collections

Author and article information

Journal

Publication date Created: 20 March 2023

Article

ArXiV ID: 2303.11453

SO-VID: 553c07a5-9530-4b52-8bae-8e81bb4edd10

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 60 pages, 2 figures

Categories cs.LG stat.ML

ScienceOpen disciplines: Machine learning,Artificial intelligence

Data availability:

ScienceOpen disciplines: Machine learning, Artificial intelligence

Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing and Neural Networks with Quadratic Activations

Read this article at

Abstract

Related collections

Semantic Knowledge Base

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 121