PhD Defense by Chen Liang
Title: On Parameter Efficiency of Neural Language Models
Date: Nov 27th, 2023
Time: 11am - 1pm ET
Location: Groseclose 226 or https://gatech.zoom.us/j/97505299826
Machine Learning PhD Student
H. Milton Stewart School of Industrial and Systems Engineering
Georgia Institute of Technology
Dr. Tuo Zhao, School of Industrial and Systems Engineering, Georgia Institute of Technology (Advisor)
Dr. Chao Zhang, School of Computational Science and Engineering, Georgia Institute of Technology
Dr. Diyi Yang, Computer Science Department, Stanford University
Dr. Aditya Prakash, School of Computational Science and Engineering, Georgia Institute of Technology
Dr. Yingyan (Celine) Lin, School of Computer Science, Georgia Institute of Technology
Pre-trained neural language models have achieved remarkable capabilities across various natural language understanding and generation tasks. However, the trend of scaling these models to encompass billions of parameters, while enhancing adaptability and emergent capabilities, has brought forth significant deployment challenges. These challenges include constraints in model storage and inference latency for real-world deployment, intensive time and computational costs for task adaptation, and the presence of substantial redundant parameters that affect task adaptability. Inspired by these challenges, this talk will cover methods we have developed to enhance the parameter efficiency of these models, seeking to minimize storage requirements, accelerate inference and adaptation, and enhance generalizability. The content of the talk is organized as follows:
In the first section, we investigate the largely unexplored relationship between parameter redundancy and model generalizability. Observing that removing redundant parameters improves generalizability, we propose an adaptive optimization algorithm for fine-tuning to improve the utilization of the redundant parameters. Experimental results validate increased generalization across various downstream tasks.
In the second section, we propose model compression strategies, such as weight pruning and knowledge distillation, aiming at reducing model storage and accelerating inference. We first developed a reliable iterative pruning method that accounts for uncertainties in training dynamics. Then, we dive into the realm of knowledge distillation, addressing the large teacher-student ``knowledge gap" that often hampers the student's performance. To tackle this, we offer solutions for producing students for specific tasks by selectively distilling task-relevant knowledge. In scenarios demanding student adaptability across diverse tasks, we propose to reduce the knowledge gap by combining iterative pruning with distillation. Our approaches significantly surpass conventional distillation methods at similar compression ratios.
In the last section, we explore cost-effective task adaptation alternatives to expensive fine-tuning. We specifically focus on the hypernetwork approach, which uses an auxiliary hypernetwork to rapidly generate task-specific weights from few-shot demonstrations. We enhance the sample efficiency of the generation process by leveraging weight structure as an inductive bias, yielding superior performance on unseen tasks compared to existing methods.
- Workflow Status:Published
- Created By:Tatianna Richardson
- Modified By:Tatianna Richardson