event

PhD Defense by Gaurav Verma

Primary tabs

Title: Robust, Efficient, and Adaptable Multimodal AI for Vertical Applications

Date: Thursday, April 10, 2025
Time: 2:30–4:30 PM Eastern Time (US)
Location: Coda 1315 (Grant Park)
Virtual Meeting: Zoom

Gaurav Verma
https://gaurav22verma.github.io/
Computer Science PhD Candidate
School of Computational Science and Engineering
Georgia Institute of Technology

Committee:
Dr. Srijan Kumar - Advisor, Georgia Tech, Computational Science & Engineering
Dr. Munmun De Choudhury - Georgia Tech, School of Interactive Computing
Dr. Duen Horng (Polo) Chau - Georgia Tech, Computational Science & Engineering
Dr. Chao Zhang - Georgia Tech, Computational Science & Engineering
Dr. Ani Nenkova - Adobe Research, Document Intelligence Lab

Abstract:
Large artificial intelligence (AI) models have garnered attention for their impressive, sometimes superhuman, performance on benchmarks, yet their practical adoption in verticals like web safety and well-being presents many challenges. Issues such as brittleness to realistic input variations, sensitivity to prompt formatting in large language models (LLMs), performance degradation in specialized settings, and limited effectiveness among certain user groups significantly limit the real-world utility of large AI models.

To systematically address these challenges, this thesis introduces a framework for transforming foundational large AI models into real-world solutions by advancing their vertical-agnostic properties and overcoming challenges in vertical-specific applications. Focusing first on vertical-agnostic properties, the thesis advances multimodal AI models—those integrating vision and language—by tackling three critical areas: robustness to realistic data variations, efficient cross-modal mapping, and adaptability to novel tasks. Key contributions include evaluating model robustness to grounded multimodal variations, proposing a method to quantify text visualness for efficient cross-modal retrieval and generation, and developing techniques for rapidly adapting multimodal agents to custom workflows.

Building upon these foundational contributions, the thesis then targets vertical applications, demonstrating the necessity of tailored data, modeling, and evaluation approaches. In collaboration with domain experts in web safety and well-being, it characterizes and detects violence-provoking speech and leverages LLMs to uncover actionable mental well-being insights. Further, it reveals how insights from vertical-specific studies can loop back to improve large AI models, notably by addressing inequities across languages through multimodal learning. The thesis also underscores user interfacing with the emerging capabilities of large AI models as a vital and open area for future exploration.
 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:03/31/2025
  • Modified By:Tatianna Richardson
  • Modified:03/31/2025

Categories

Keywords

Target Audience