Machine Learning Center Seminar Series | SDG and Weight Decay Secretly Compress Your Neural Network
Featuring Dr. Tomer Galanti, Massachusetts Institute of Technology
Abstract: Several empirical results have shown that replacing weight matrices with low-rank approximations results in only a small drop in accuracy, suggesting that the weight matrices at convergence may be close to low-rank matrices.
In this talk, we will study the origins of the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training Leaky ReLU neural networks. Our results show that training neural networks with SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we will discuss the relationship between our analysis and other related properties, like sparsity, neural collapse, implicit regularization, generalization and compression.
Joint work with Zachary Siegel, Aparna Gupte, and Tomaso Poggio.
Bio: Tomer Galanti is a Postdoctoral Associate in Prof. Poggio's lab at MIT, where he works on theoretical and algorithmic aspects of deep learning. He previously interned as a Research Scientist with DeepMind's Foundations team. He received his Ph.D. from Tel Aviv University and was the university's youngest Ph.D. graduate in 2022. He received the Deutch Annual Prize in Computer Science in 2018 for his Ph.D. He published numerous papers at top-tier venues, such as NeurIPS, ICLR, ICML, ICCV, ECCV, and JMLR, including an oral presentation paper at NeurIPS 2020.