event

PhD Defense | Robust and Flexible Reward Modeling for LLM Alignment

Primary tabs

Title: Robust and Flexible Reward Modeling for LLM Alignment

Date: April 21st, 2025

Time:11:00 am – 1:00 pm (EST) 

Location: ISyE Main 224 

Zoom link: https://gatech.zoom.us/j/91835542508

 

Alexander Bukharin

Machine Learning PhD Candidate 

H. Milton Stewart School of Industrial and Systems Engineering

Georgia Institute of Technology 

 

Committee

1. Dr. Tuo Zhao (ISYE, Georgia Tech) (Advisor) 

2. Dr. Chao Zhang (CSE, Georgia Tech) 

3. Dr. Bo Dai (CSE, Georgia Tech) 

4. Dr. Sen Na (ISYE, Georgia Tech)

5. Dr. Olivier Delalleau Liu (NVIDIA) 

 

Abstract

As large language models grow increasingly more capable, ensuring their alignment with human values is of utmost importance. One of the most promising ways to align language models is by designing a reward function that measures alignment with human values, and training the language model to maximize this reward. In this thesis, we focus on two approaches reward design: reward design from external feedback signals and reward learning from human annotated datasets. In this first chapter we develop a reward design framework, HERON, that eases reward function design by exploiting hierarchical relationships between feedback signals. In the second chapter, we propose an algorithm to learn reward functions from datasets with corrupted human annotations. In the last chapter, we develop an adversarial attack approach that automatically discovers flaws in state-of-the-art reward functions, and then subsequently use these attacks to train more robust reward models. Altogether, these contributions advance the scalability and robustness of reward modeling.

Groups

Status

  • Workflow Status:Published
  • Created By:shatcher8
  • Created:04/11/2025
  • Modified By:shatcher8
  • Modified:04/17/2025

Categories

  • No categories were selected.

Keywords

  • No keywords were submitted.