event
PhD Proposal by Ruohao Guo
Primary tabs
Title: Rethinking Safety of Language Models in Interaction
Date: Friday, April 3rd 2026
Time: 9:00 AM – 11:00 AM EST
Location: Coda C1108 Brookhaven
Zoom: https://gatech.zoom.us/j/93471413440
Ruohao Guo
Ph.D. Student
School of Interactive Computing
Georgia Institute of Technology
Committee members
Dr. Alan Ritter (advisor): School of Interactive Computing, Georgia Institute of Technology
Dr. Wei Xu: School of Interactive Computing, Georgia Institute of Technology
Dr. Polo Chau: School of Computational Science & Engineering, Georgia Institute of Technology
Dr. Dan Roth: Department of Computer and Information Science, University of Pennsylvania; Chief AI Scientist at Oracle
Abstract
The rapid advancement of large language models (LLMs) has brought transformative capabilities but has simultaneously introduced critical safety concerns. Prior efforts in AI safety have focused on explicit and direct threats, such as overtly false claims or single-turn attacks. This thesis demonstrates that real-world safety challenges are far more subtle and dynamic, and that current safety mechanisms are inadequate against them. First, I will present our work that studies how LLMs handle implicit misinformation, i.e., the false claims embedded as unchallenged premises in user queries. We reveal that LLMs can reinforce users' misinformed beliefs through interaction, and possessing the factual knowledge alone does not suffice for effective mitigation. Second, I will introduce DialTree, an on-policy reinforcement learning framework that discovers LLMs safety vulnerabilities under multi-turn interactive scenarios. We show that even the most safety-aligned frontier models can be jailbroken by our adaptive and strategic attacks. Third, I develop a meta-tuning approach for generalizable language style understanding, which can improve the foundational capability for safety-relevant tasks such as bias detection and manipulation recognition. Finally, I will briefly discuss my ongoing work on improving safety in multi-turn settings via monitoring evolving trajectories.
Groups
Status
- Workflow status: Published
- Created by: Tatianna Richardson
- Created: 03/31/2026
- Modified By: Tatianna Richardson
- Modified: 03/31/2026
Categories
Keywords
User Data
Target Audience