event

PhD Proposal by Yao Dou

Primary tabs

Title: Scalable and Structured Evaluation of Large Language Models

Date: Tuesday, December 16th 2025

Time: 1:00-2:30 PM EST

Location: Coda C1115 Druid Hills

Zoom: https://gatech.zoom.us/j/2123115504?pwd=eTJFZWR1dXN6ZlZ3WGtwRlMzQmZNQT09

 

Yao Dou

Ph.D. Student

School of Interactive Computing

Georgia Institute of Technology

 

Committee members

Dr. Wei Xu (advisor): School of Interactive Computing, Georgia Institute of Technology

Dr. Alan Ritter: School of Interactive Computing, Georgia Institute of Technology

Dr. Polo Chau: School of CSE, Georgia Institute of Technology

Dr. Michel Galley: Senior Principal Research Manager at Microsoft Research

Dr. Dipanjan Das: Senior Director of Research at Google Deepmind

 

Abstract

As large language models (LLMs) move into real-world, open-ended applications, evaluating them becomes both more important and more difficult. My work focuses on developing evaluation methods that go beyond multiple-choice accuracy to handle multi-turn interaction, long-context tasks, and fine-grained text quality. In this thesis proposal, I will first present our efforts on evaluating multi-turn assistants with user simulators, building SimulatorArena to assess how closely simulators behave like humans and how well their evaluations of assistants align with human judgments, and introducing detailed user profiles that capture user background and message style to improve simulator quality. I will next introduce Gavel, an evaluation framework for long-context legal summarization, where case documents often exceed 75K words and summaries are over 800 words. Gavel extracts a checklist of key legal items (e.g., parties, filings, decrees) from model summaries and compares them to human-written references, turning a holistic judgment into fine-grained, more accurate and interpretable evaluation. It also includes a novel agent scaffold that lets LLMs navigate and extract checklist items directly from case documents, achieving competitive performance with far fewer tokens than end-to-end prompting. I will also discuss my work on fine-grained evaluation for other tasks, including SALSA, an edit-based framework for text simplification, and models that detect and abstract self-disclosure spans in social media posts to measure and reduce privacy risk. In addition, I will present LENS, a small learnable metric trained directly on human ratings for text simplification. Finally, I will outline how to further improve agent-based evaluators using reinforcement learning and multi-agent collaboration.

Status

  • Workflow status: Published
  • Created by: Tatianna Richardson
  • Created: 12/03/2025
  • Modified By: Tatianna Richardson
  • Modified: 12/03/2025

Categories

Keywords

Target Audience