event
PhD Proposal by Yao Dou
Primary tabs
Title: Scalable and Structured Evaluation of Large Language Models
Date: Tuesday, December 16th 2025
Time: 1:00-2:30 PM EST
Location: Coda C1115 Druid Hills
Zoom: https://gatech.zoom.us/j/2123115504?pwd=eTJFZWR1dXN6ZlZ3WGtwRlMzQmZNQT09
Yao Dou
Ph.D. Student
School of Interactive Computing
Georgia Institute of Technology
Committee members
Dr. Wei Xu (advisor): School of Interactive Computing, Georgia Institute of Technology
Dr. Alan Ritter: School of Interactive Computing, Georgia Institute of Technology
Dr. Polo Chau: School of CSE, Georgia Institute of Technology
Dr. Michel Galley: Senior Principal Research Manager at Microsoft Research
Dr. Dipanjan Das: Senior Director of Research at Google Deepmind
Abstract
As large language models (LLMs) move into real-world, open-ended applications, evaluating them becomes both more important and more difficult. My work focuses on developing evaluation methods that go beyond multiple-choice accuracy to handle multi-turn interaction, long-context tasks, and fine-grained text quality. In this thesis proposal, I will first present our efforts on evaluating multi-turn assistants with user simulators, building SimulatorArena to assess how closely simulators behave like humans and how well their evaluations of assistants align with human judgments, and introducing detailed user profiles that capture user background and message style to improve simulator quality. I will next introduce Gavel, an evaluation framework for long-context legal summarization, where case documents often exceed 75K words and summaries are over 800 words. Gavel extracts a checklist of key legal items (e.g., parties, filings, decrees) from model summaries and compares them to human-written references, turning a holistic judgment into fine-grained, more accurate and interpretable evaluation. It also includes a novel agent scaffold that lets LLMs navigate and extract checklist items directly from case documents, achieving competitive performance with far fewer tokens than end-to-end prompting. I will also discuss my work on fine-grained evaluation for other tasks, including SALSA, an edit-based framework for text simplification, and models that detect and abstract self-disclosure spans in social media posts to measure and reduce privacy risk. In addition, I will present LENS, a small learnable metric trained directly on human ratings for text simplification. Finally, I will outline how to further improve agent-based evaluators using reinforcement learning and multi-agent collaboration.
Groups
Status
- Workflow status: Published
- Created by: Tatianna Richardson
- Created: 12/03/2025
- Modified By: Tatianna Richardson
- Modified: 12/03/2025
Categories
Keywords
Target Audience