event

PhD Proposal by Ruohao Guo

Primary tabs

Title: Rethinking Safety of Language Models in Interaction

Date: Friday, April 3rd 2026

Time: 9:00 AM – 11:00 AM EST

Location: Coda C1108 Brookhaven

Zoom: https://gatech.zoom.us/j/93471413440

 

Ruohao Guo

Ph.D. Student

School of Interactive Computing

Georgia Institute of Technology

 

Committee members

Dr. Alan Ritter (advisor): School of Interactive Computing, Georgia Institute of Technology

Dr. Wei Xu: School of Interactive Computing, Georgia Institute of Technology

Dr. Polo Chau: School of Computational Science & Engineering, Georgia Institute of Technology

Dr. Dan Roth: Department of Computer and Information Science, University of Pennsylvania; Chief AI Scientist at Oracle

 

Abstract

The rapid advancement of large language models (LLMs) has brought transformative capabilities but has simultaneously introduced critical safety concerns. Prior efforts in AI safety have focused on explicit and direct threats, such as overtly false claims or single-turn attacks. This thesis demonstrates that real-world safety challenges are far more subtle and dynamic, and that current safety mechanisms are inadequate against them. First, I will present our work that studies how LLMs handle implicit misinformation, i.e., the false claims embedded as unchallenged premises in user queries. We reveal that LLMs can reinforce users' misinformed beliefs through interaction, and possessing the factual knowledge alone does not suffice for effective mitigation. Second, I will introduce DialTree, an on-policy reinforcement learning framework that discovers LLMs safety vulnerabilities under multi-turn interactive scenarios. We show that even the most safety-aligned frontier models can be jailbroken by our adaptive and strategic attacks. Third, I develop a meta-tuning approach for generalizable language style understanding, which can improve the foundational capability for safety-relevant tasks such as bias detection and manipulation recognition. Finally, I will briefly discuss my ongoing work on improving safety in multi-turn settings via monitoring evolving trajectories.

Status

  • Workflow status: Published
  • Created by: Tatianna Richardson
  • Created: 03/31/2026
  • Modified By: Tatianna Richardson
  • Modified: 03/31/2026

Categories

Keywords

User Data

Target Audience