{"680341":{"#nid":"680341","#data":{"type":"event","title":"CSE Faculty Candidate Seminar - Yizhong Wang","body":[{"value":"\u003Cp\u003E\u003Cstrong\u003EName: \u003C\/strong\u003EYizhong Wang, Ph.D. candidate from the University of Washington\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003EDate:\u0026nbsp;\u003C\/strong\u003EThursday, February 13, 2025 at 11:00 am\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ELocation:\u003C\/strong\u003E\u0026nbsp;Coda Building, Room 230 (\u003Ca href=\u0022https:\/\/maps.app.goo.gl\/m7KfgEEVWV581bxD9\u0022\u003EGoogle Maps link\u003C\/a\u003E)\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ELink:\u0026nbsp;\u003C\/strong\u003EThe recording of this in-person seminar will be uploaded to\u0026nbsp;\u003Ca href=\u0022https:\/\/mediaspace.gatech.edu\/channel\/School%2Bof%2BComputational%2BScience%2Band%2BEngineering\/259332602\u0022 target=\u0022_blank\u0022\u003ECSE\u0027s MediaSpace\u003C\/a\u003E\u003C\/p\u003E\u003Cp\u003E\u003Cem\u003ECoffee, drinks, and snacks provided!\u003C\/em\u003E\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ETitle:\u0026nbsp;\u003C\/strong\u003EBuilding a Sustainable Data Foundation for AI\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003EAbstract:\u003C\/strong\u003E The rapid expansion of AI is consuming data at an unprecedented scale. However, as pretraining on raw Internet data reaches diminishing returns and downstream applications grow increasingly complex, the question arises: What data paradigm will sustain the next generation of more powerful AI systems? This requires a systematic rethinking of how we structure, create, and share data. In this talk, I will present my research on building language models that go beyond pretraining while improving their generalization through a data-centric perspective. First, I will introduce how unifying NLP tasks through task instructions enables broader generalization. The resulting technique, instruction tuning, redefines how task data should be structured and has transformed how people interact with language models. Second, I will explore synthetic data creation, where models get employed in the data production process. This led to Self-Instruct, the first framework using language models to create diverse tasks and demonstrating model self-improvement. Finally, I will discuss the development of T\u00fclu and OLMo, two representative open models, highlighting the central role of data curation and open collaboration in advancing AI research. Together, these efforts\u2014unifying task representations, leveraging synthetic data, and fostering open data sharing\u2014have shaped current AI research and application landscape. As these trends continue to evolve, they hold the potential to establish a scalable and sustainable data foundation for the future of AI.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003EBio: \u003C\/strong\u003EYizhong Wang is a Ph.D. candidate at the University of Washington, advised by Hannaneh Hajishirzi and Noah Smith. He has also been a student researcher at the Allen Institute for Artificial Intelligence (AI2) for the past 2 years, co-leading the post-training efforts in building fully open language models (OLMo). His research focuses on the fundamental data challenges in AI development and algorithms centered around data, particularly for building more general-purpose models. His work, such as Super-NaturalInstructions, Self-Instruct, and T\u00fclu, has been widely used in building large language models today. He has won multiple paper awards, including ACL 2024 Best Theme Paper, CCL 2020 Best Paper, and ACL 2017 Outstanding Paper. He also serves on the program committee of top NLP and ML conferences and was an area chair for EMNLP 2024.\u003C\/p\u003E","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cp\u003E\u003Cstrong\u003EName: \u003C\/strong\u003EYizhong Wang, Ph.D. candidate from the University of Washington\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003EDate:\u0026nbsp;\u003C\/strong\u003EThursday, February 13, 2024 at 11:00 am\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ELocation:\u003C\/strong\u003E\u0026nbsp;Coda Building, Room 230 (\u003Ca href=\u0022https:\/\/maps.app.goo.gl\/m7KfgEEVWV581bxD9\u0022\u003EGoogle Maps link\u003C\/a\u003E)\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ELink:\u0026nbsp;\u003C\/strong\u003EThe recording of this in-person seminar will be uploaded to\u0026nbsp;\u003Ca href=\u0022https:\/\/mediaspace.gatech.edu\/channel\/School%2Bof%2BComputational%2BScience%2Band%2BEngineering\/259332602\u0022 target=\u0022_blank\u0022\u003ECSE\u0027s MediaSpace\u003C\/a\u003E\u003C\/p\u003E\u003Cp\u003E\u003Cstrong\u003ETitle:\u0026nbsp;\u003C\/strong\u003EBuilding a Sustainable Data Foundation for AI\u003C\/p\u003E","format":"limited_html"}],"field_summary_sentence":[{"value":"Seminar Title: Building a Sustainable Data Foundation for AI"}],"uid":"36319","created_gmt":"2025-02-10 13:16:24","changed_gmt":"2025-02-13 15:33:08","author":"Bryant Wine","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2025-02-13T11:00:00-05:00","event_time_end":"2025-02-13T12:00:00-05:00","event_time_end_last":"2025-02-13T12:00:00-05:00","gmt_time_start":"2025-02-13 16:00:00","gmt_time_end":"2025-02-13 17:00:00","gmt_time_end_last":"2025-02-13 17:00:00","rrule":null,"timezone":"America\/New_York"},"location":"Coda, Room 230","extras":["free_food"],"hg_media":{"676243":{"id":"676243","type":"image","title":"Yizhong Wang.jpg","body":null,"created":"1739193500","gmt_created":"2025-02-10 13:18:20","changed":"1739193500","gmt_changed":"2025-02-10 13:18:20","alt":"CSE Faculty Candidate Seminar Yizhong Wang","file":{"fid":"259986","name":"Yizhong Wang.jpg","image_path":"\/sites\/default\/files\/2025\/02\/10\/Yizhong%20Wang.jpg","image_full_path":"http:\/\/hg.gatech.edu\/\/sites\/default\/files\/2025\/02\/10\/Yizhong%20Wang.jpg","mime":"image\/jpeg","size":4501573,"path_740":"http:\/\/hg.gatech.edu\/sites\/default\/files\/styles\/740xx_scale\/public\/2025\/02\/10\/Yizhong%20Wang.jpg?itok=dagUTxP1"}}},"media_ids":["676243"],"groups":[{"id":"47223","name":"College of Computing"},{"id":"50877","name":"School of Computational Science and Engineering"}],"categories":[],"keywords":[{"id":"166983","name":"School of Computational Science and Engineering"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1795","name":"Seminar\/Lecture\/Colloquium"}],"invited_audience":[{"id":"78761","name":"Faculty\/Staff"},{"id":"177814","name":"Postdoc"},{"id":"78771","name":"Public"},{"id":"174045","name":"Graduate students"},{"id":"78751","name":"Undergraduate students"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[{"value":"\u003Cp\u003EMary High\u003Cbr\u003Emhigh7@gatech.edu\u003C\/p\u003E","format":"limited_html"}],"email":[],"slides":[],"orientation":[],"userdata":""}}}