{"689625":{"#nid":"689625","#data":{"type":"event","title":"Ph.D. Dissertation Defense - Yiye Chen ","body":[{"value":"\u003Cp\u003E\u003Cstrong\u003ETitle: \u0026nbsp;\u003C\/strong\u003E\u003Cem\u003ELeveraging Pretrained Vision and Language Models for Robotic Manipulation\u003C\/em\u003E\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ECommittee:\u0026nbsp;\u003C\/strong\u003E\u003C\/p\u003E\u003Cp\u003EDr. Patricio Vela, ECE, Advisor\u0026nbsp; \u0026nbsp;\u003C\/p\u003E\u003Cp\u003EDr. Ghassan Alregib, ECE, Reading Committee\u003C\/p\u003E\u003Cp\u003EDr. Zsolt Kira, IC, Reading Committee\u003C\/p\u003E\u003Cdiv\u003EDr. Lu Gan, AE, Additional member\u003C\/div\u003E\u003Cdiv\u003E\u0026nbsp;\u003C\/div\u003E\u003Cp\u003EDr. Sriram Vishwanath, ECE, Additional member\u003C\/p\u003E","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cdiv\u003E\u003Cp\u003EThis thesis investigates how pretrained vision and language models can be effectively integrated into robotic manipulation systems to enhance generalization and scalability. While recent advances in data-driven robotic policies have enabled impressive performance across a wide range of tasks, they remain limited by the high cost of data collection, poor generalization to new environments, and the complexity of integrating perception, planning, and control into unified systems. In contrast, large vision and language models trained on internet-scale data exhibit strong generalization and reasoning capabilities, motivating their incorporation into robotics. To this end, this thesis presents a series of methods that leverage pretrained models across multiple levels of robotic functionality. We first propose a language-conditioned grasping framework that enables task-aware grasp selection through multimodal fusion of visual and linguistic inputs. Next, we introduce a keypoint-based representation for 6-DoF grasp generation, allowing effective transfer of pretrained 2D visual knowledge to 3D manipulation. For high-level reasoning, we develop a schema-guided multi-agent framework that utilizes large language models and scene graphs for spatial understanding and task planning. Finally, we analyze Vision-Language-Action models through the lens of visual conditioning, and propose a training framework that improves visual grounding and downstream task performance.\u003C\/p\u003E\u003C\/div\u003E\u003Cp\u003E\u003Cbr\u003E\u0026nbsp;\u003C\/p\u003E","format":"limited_html"}],"field_summary_sentence":[{"value":"Leveraging Pretrained Vision and Language Models for Robotic Manipulation"}],"uid":"36804","created_gmt":"2026-04-10 17:15:48","changed_gmt":"2026-04-10 17:20:14","author":"jjones779","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2026-04-15T15:30:00-04:00","event_time_end":"2026-04-15T17:30:00-04:00","event_time_end_last":"2026-04-15T17:30:00-04:00","gmt_time_start":"2026-04-15 19:30:00","gmt_time_end":"2026-04-15 21:30:00","gmt_time_end_last":"2026-04-15 21:30:00","rrule":null,"timezone":"America\/New_York"},"location":"Room 523A, TSRB","extras":[],"related_links":[{"url":"https:\/\/teams.microsoft.com\/l\/meetup-join\/19%3ameeting_NzE0NDU0MzAtYTUxNy00Mzc2LWI5ZWMtNWQ3ODViMDA0ZjJk%40thread.v2\/0?context=%7b%22Tid%22%3a%22482198bb-ae7b-4b25-8b7a-6d7f32faa083%22%2c%22Oid%22%3a%221edac2f2-2811-4bf7-9428-883761609d8e%22%7d","title":"Microsoft Teams Link "}],"groups":[{"id":"434381","name":"ECE Ph.D. Dissertation Defenses"}],"categories":[],"keywords":[{"id":"100811","name":"Phd Defense"},{"id":"1808","name":"graduate students"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1788","name":"Other\/Miscellaneous"}],"invited_audience":[{"id":"78771","name":"Public"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[],"email":[],"slides":[],"orientation":[],"userdata":""}}}