PhD Defense by Chen Liang

Title: On Parameter Efficiency of Neural Language Models

Date: Nov 27th, 2023

Time: 11am - 1pm ET

Location: Groseclose 226 or https://gatech.zoom.us/j/97505299826

Chen Liang

Machine Learning PhD Student

H. Milton Stewart School of Industrial and Systems Engineering

Georgia Institute of Technology

Committee

Dr. Tuo Zhao, School of Industrial and Systems Engineering, Georgia Institute of Technology (Advisor)

Dr. Chao Zhang, School of Computational Science and Engineering, Georgia Institute of Technology

Dr. Diyi Yang, Computer Science Department, Stanford University

Dr. Aditya Prakash, School of Computational Science and Engineering, Georgia Institute of Technology

Dr. Yingyan (Celine) Lin, School of Computer Science, Georgia Institute of Technology

Abstract

Pre-trained neural language models have achieved remarkable capabilities across various natural language understanding and generation tasks. However, the trend of scaling these models to encompass billions of parameters, while enhancing adaptability and emergent capabilities, has brought forth significant deployment challenges. These challenges include constraints in model storage and inference latency for real-world deployment, intensive time and computational costs for task adaptation, and the presence of substantial redundant parameters that affect task adaptability. Inspired by these challenges, this talk will cover methods we have developed to enhance the parameter efficiency of these models, seeking to minimize storage requirements, accelerate inference and adaptation, and enhance generalizability. The content of the talk is organized as follows:

In the first section, we investigate the largely unexplored relationship between parameter redundancy and model generalizability. Observing that removing redundant parameters improves generalizability, we propose an adaptive optimization algorithm for fine-tuning to improve the utilization of the redundant parameters. Experimental results validate increased generalization across various downstream tasks.

In the second section, we propose model compression strategies, such as weight pruning and knowledge distillation, aiming at reducing model storage and accelerating inference. We first developed a reliable iterative pruning method that accounts for uncertainties in training dynamics. Then, we dive into the realm of knowledge distillation, addressing the large teacher-student ``knowledge gap" that often hampers the student's performance. To tackle this, we offer solutions for producing students for specific tasks by selectively distilling task-relevant knowledge. In scenarios demanding student adaptability across diverse tasks, we propose to reduce the knowledge gap by combining iterative pruning with distillation. Our approaches significantly surpass conventional distillation methods at similar compression ratios.

In the last section, we explore cost-effective task adaptation alternatives to expensive fine-tuning. We specifically focus on the hypernetwork approach, which uses an auxiliary hypernetwork to rapidly generate task-specific weights from few-shot demonstrations. We enhance the sample efficiency of the generation process by leveraging weight structure as an inductive bias, yielding superior performance on unseen tasks compared to existing methods.

Media

No media selected

Summary

On Parameter Efficiency of Neural Language Models

Details

Monday - Monday

Nov 27

2023

Dec 4

2023

11:00am - 01:00pm

Location: Groseclose 226

In campus calendar: No

Sidebar Content

No sidebar content

Groups

Graduate Studies

Status

Workflow Status:Published
Created By:Tatianna Richardson
Created:11/27/2023
Modified By:Tatianna Richardson
Modified:11/27/2023

Mercury (Hg)