event

PhD Defense by Annabel Rothschild

Primary tabs

Title: SAFE FROM THE START: DEVELOPING PRO-SOCIAL AI TRAINING DATASETS THROUGH DATA WORKERS’ CRITICAL PERSPECTIVES

Date: Monday, March 24, 2025

Time: 10.00-13.00 ET

In-person Location: TSRB 223 (Spark Studio)

Zoom link: https://gatech.zoom.us/j/91909341662?pwd=7YtWXcSPZXuiXlwgkiJkuObAjrOeWz.1

 

Annabel Rothschild

Ph.D. Candidate, Human-Centered Computing

School of Interactive Computing

College of Computing

Georgia Institute of Technology

 

Committee: 

Dr. Betsy DiSalvo (advisor), College of Computing, Georgia Institute of Technology

Dr. Carl DiSalvo (co-advisor), College of Computing, Georgia Institute of Technology

Dr. Shaowen Bardzell, College of Computing, Georgia Institute of Technology

Dr. Ellen Zegura, College of Computing, Georgia Institute of Technology

Dr. Richmond Wong, Ivan Allen College of Liberal Arts, Georgia Institute of Technology

Dr. Lauren Klein, Department of Quantitative Theory & Methods and English, Emory University

Dr. Ding Wang, Google Research

 

Summary:

AI and ML systems are increasingly ubiquitous, with recent advances in LLMs and image generators, such as OpenAI’s ChatGPT and DALL·E, creating new urgency in future of work conversations [1, 2, 3, 4, 5]. My work explores how the massive datasets used to

train these systems, collected and curated by a global workforce of data workers, come into being. Specifically, I examine what the perspective and lived experience of a data worker contributes to the data labors they perform. 

      The perspectives of data workers who build the datasets for data-intensive systems, such as AI and ML systems, frequently goes unappreciated. Data workers have a unique on-the-ground view of the dataset and how it has been designed and developed, given that they are the executors of this work. Many of the problems we see with “biased” AI and ML systems can be traced back to issues with the dataset on which the system was trained. Consider the case of ImageNet, one of the most impactful computer vision (CV) bench-marking datasets to have been developed, facilitated by the labor of Amazon Mechanical Turk (AMT) workers (Turkers) [6]. The labels Turkers were offered to label images were based on WordNet [7], which has been in wide circulation since 2011. These labels, as demonstrated by Prabhu & Birhane, included terms that are offensive and not safe for work (NSFW), along with a host of nonconsensual pornographic terms [8]. Did the Turkers who annotated ImageNet’s entries come across these terms? Could they have alerted the ImageNet designers to problems with the use of WordNet labels before ImageNet became a critical benchmark dataset for CV systems?

      Having seen the role that data workers equipped with CDL can play in positively shaping datasets, both in technical detail and sociocultural premise, I believe that building healthier, most pro-social AI and ML systems begins with intellectual partnership with data workers in dataset creation and development. My work is motivated by the role that data worker perspective can play when data workers are empowered to practice critical data literacy (CDL), as I observed during my ethnographic fieldwork with DataWorks, a combined work-training program, data services provider, and research platform [9]. CDL goes a step beyond regular data literacy, which refers to a skillset for reading and understanding data statistics and data visualizations [10]. In addition to those skills, practicing CDL requires developing a critical consciousness [11], in the tradition of Paulo Freire [12], which means being able to question how these data summaries were arrived at, what might be behind the motivation for their creation, and whom they benefit. Finally, to practice CDL also requires a workplace that supports this critical practice, namely in the form of encouraging workers to speak up and out about problems or concerns they have with dataset development. 

      My overarching research question (RQ) is: what is the role of perspective in data work, and how can we incorporate the perspective of data workers as partners in dataset contextualization? My consequential work answers three subquestions:

 

• RQ1: why do we need better contextualization practices in data work, and what is the current state of data work annotation practices?

• RQ2: what is the relationship between critical data literacy and properly localized AI and ML systems?

• RQ3: how we can collect and integrate more varied perspectives to relocate our AI and ML systems?

 

Contributions: My work facilitates the development of healthier, more pro-social AI and ML systems. Situated within critical data studies, the work described in this dissertation builds out approaches to the integration of worker perspective in datasets in large-scale dataset development sites.

 

-- 

Annabel Rothschild, she/her

PhD candidate, Georgia Tech

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:03/11/2025
  • Modified By:Tatianna Richardson
  • Modified:03/11/2025

Categories

Keywords

Target Audience