{"678471":{"#nid":"678471","#data":{"type":"news","title":"Minority English Dialects Vulnerable to Automatic Speech Recognition Inaccuracy","body":[{"value":"\u003Cp\u003EThe Automatic Speech Recognition (ASR) models that power voice assistants like Amazon Alexa may have difficulty transcribing English speakers with minority dialects.\u003C\/p\u003E\u003Cp\u003EA study by Georgia Tech and Stanford researchers compared the transcribing performance of leading ASR models for people using Standard American English (SAE) and three minority dialects \u2014 African American Vernacular English (AAVE), Spanglish, and Chicano English.\u003C\/p\u003E\u003Cp\u003EInteractive Computing Ph.D. student \u003Ca href=\u0022https:\/\/camille2019.github.io\/\u0022\u003E\u003Cstrong\u003ECamille Harris\u003C\/strong\u003E\u003C\/a\u003E is the lead author of a paper accepted into the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) this week in Miami.\u003C\/p\u003E\u003Cp\u003EHarris recruited people who spoke each dialect and had them read from a Spotify podcast dataset, which includes podcast audio and metadata. Harris then used three ASR models \u2014 wav2vec 2.0, HUBERT, and Whisper \u2014 to transcribe the audio and compare their performances.\u003C\/p\u003E\u003Cp\u003EFor each model, Harris found SAE transcription significantly outperformed each minority dialect. The models more accurately transcribed men who spoke SAE than women who spoke SAE. Members who spoke Spanglish and Chicano English had the least accurate transcriptions out of the test groups.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003EWhile the models transcribed SAE-speaking women less accurately than their male counterparts, that did not hold true across minority dialects. Minority men had the most inaccurate transcriptions of all demographics in the study.\u003C\/p\u003E\u003Cp\u003E\u201cI think people would expect if women generally perform worse and minority dialects perform worse, then the combination of the two must also perform worse,\u201d Harris said. \u201cThat\u2019s not what we observed.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003E\u201cSometimes minority dialect women performed better than Standard American English. We found a consistent pattern that men of color, particularly Black and Latino men, could be at the highest risk for these performance errors.\u201d\u003C\/p\u003E\u003Ch4\u003E\u003Cstrong\u003EAddressing underrepresentation\u003C\/strong\u003E\u003C\/h4\u003E\u003Cp\u003EHarris said the cause of that outcome starts with the training data used to build these models. Model performance reflected the underrepresentation of minority dialects in the data sets.\u003C\/p\u003E\u003Cp\u003EAAVE performed best under the Whisper model, which Harris said had the most inclusive training data of minority dialects.\u003C\/p\u003E\u003Cp\u003EHarris also looked at whether her findings mirrored existing systems of oppression. Black men have high incarceration rates and are one of the people groups most targeted by police. Harris said there could be a correlation between that and the low rate of Black men enrolled in universities, which leads to less representation in technology spaces.\u003C\/p\u003E\u003Cp\u003E\u201cMinority men performing worse than minority women doesn\u2019t necessarily mean minority men are more oppressed,\u201d she said. \u201cThey may be less represented than minority women in computing and the professional sector that develops these AI systems.\u201d\u003C\/p\u003E\u003Cp\u003EHarris also had to be cautious of a few variables among AAVE, including code-switching and various regional subdialects.\u003C\/p\u003E\u003Cp\u003EHarris noted in her study there were cases of code-switching to SAE. Speakers who code-switched performed better than speakers who did not.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003EHarris also tried to include different regional speakers.\u003C\/p\u003E\u003Cp\u003E\u201cIt\u2019s interesting from a linguistic and history perspective if you look at migration patterns of Black folks \u2014 perhaps people moving from a southern state to a northern state over time creates different linguistic variations,\u201d she said. \u201cThere are also generational variations in that older Black Americans may speak differently from younger folks. I think the variation was well represented in our data. We wanted to be sure to include that for robustness.\u201d\u003C\/p\u003E\u003Ch4\u003E\u003Cstrong\u003ETikTok barriers\u003C\/strong\u003E\u003C\/h4\u003E\u003Cp\u003EHarris said she built her study on a paper she authored that examined user-design barriers and biases faced by Black content creators on TikTok. She presented that paper at the Association of Computing Machinery\u2019s (ACM) 2023 Conference on Computer Supported Cooperative Works.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003EThose content creators depended on TikTok for a significant portion of their income. When providing captions for videos grew in popularity, those creators noticed the ASR tool built into the app inaccurately transcribed them. That forced the creators to manually input their captions, while SAE speakers could use the ASR feature to their benefit.\u003C\/p\u003E\u003Cp\u003E\u201cMinority users of these technologies will have to be more aware and keep in mind that they\u2019ll probably have to do a lot more customization because things won\u2019t be tailored to them,\u201d Harris said.\u003C\/p\u003E\u003Cp\u003EHarris said there are ways that designers of ASR tools could work toward being more inclusive of minority dialects, but cultural challenges could arise.\u003C\/p\u003E\u003Cp\u003E\u201cIt could be difficult to collect more minority speech data, and you have to consider consent with that,\u201d she said. \u201cDevelopers need to be more community-engaged to think about the implications of their models and whether it\u2019s something the community would find helpful.\u201d\u003C\/p\u003E","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cp\u003EInteractive Computing Ph.D. student \u003Ca href=\u0022https:\/\/camille2019.github.io\/\u0022\u003E\u003Cstrong\u003ECamille Harris\u003C\/strong\u003E\u003C\/a\u003E is the lead author of a paper accepted into the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) this week in Miami.\u003C\/p\u003E\u003Cp\u003EHarris recruited people who spoke each dialect and had them read from a Spotify podcast dataset, which includes podcast audio and metadata. Harris then used three ASR models \u2014 wav2vec 2.0, HUBERT, and Whisper \u2014 to transcribe the audio and compare their performances.\u003C\/p\u003E\u003Cp\u003EFor each model, Harris found SAE transcription significantly outperformed each minority dialect. The models more accurately transcribed men who spoke SAE than women who spoke SAE. Members who spoke Spanglish and Chicano English had the least accurate transcriptions out of the test groups.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003EWhile the models transcribed SAE-speaking women less accurately than their male counterparts, that did not hold true across minority dialects. Minority men had the most inaccurate transcriptions of all demographics in the study.\u003C\/p\u003E","format":"limited_html"}],"field_summary_sentence":[{"value":"The Automatic Speech Recognition (ASR) models that power voice assistants like Amazon Alexa may have difficulty transcribing English speakers with minority dialects."}],"uid":"36530","created_gmt":"2024-11-15 18:59:54","changed_gmt":"2024-12-02 16:39:44","author":"Nathan Deen","boilerplate_text":"","field_publication":"","field_article_url":"","location":"Atlanta, GA","dateline":{"date":"2024-11-15T00:00:00-05:00","iso_date":"2024-11-15T00:00:00-05:00","tz":"America\/New_York"},"extras":[],"hg_media":{"675652":{"id":"675652","type":"image","title":"Summit on Responsible Computing, AI, and Society_86A9696-Enhanced-NR.jpg","body":null,"created":"1731697203","gmt_created":"2024-11-15 19:00:03","changed":"1731697203","gmt_changed":"2024-11-15 19:00:03","alt":"Camille Harris","file":{"fid":"259300","name":"Summit on Responsible Computing, AI, and Society_86A9696-Enhanced-NR.jpg","image_path":"\/sites\/default\/files\/2024\/11\/15\/Summit%20on%20Responsible%20Computing%2C%20AI%2C%20and%20Society_86A9696-Enhanced-NR.jpg","image_full_path":"http:\/\/hg.gatech.edu\/\/sites\/default\/files\/2024\/11\/15\/Summit%20on%20Responsible%20Computing%2C%20AI%2C%20and%20Society_86A9696-Enhanced-NR.jpg","mime":"image\/jpeg","size":67965,"path_740":"http:\/\/hg.gatech.edu\/sites\/default\/files\/styles\/740xx_scale\/public\/2024\/11\/15\/Summit%20on%20Responsible%20Computing%2C%20AI%2C%20and%20Society_86A9696-Enhanced-NR.jpg?itok=p5e1wYY6"}}},"media_ids":["675652"],"groups":[{"id":"47223","name":"College of Computing"},{"id":"1188","name":"Research Horizons"},{"id":"50876","name":"School of Interactive Computing"}],"categories":[{"id":"153","name":"Computer Science\/Information Technology and Security"}],"keywords":[{"id":"177001","name":"speech recognition"},{"id":"134041","name":"bias"},{"id":"9153","name":"Research Horizons"},{"id":"188776","name":"go-research"},{"id":"187915","name":"go-researchnews"},{"id":"192863","name":"go-ai"},{"id":"193860","name":"Artifical Intelligence"},{"id":"99601","name":"inequality"}],"core_research_areas":[{"id":"193655","name":"Artificial Intelligence at Georgia Tech"},{"id":"39501","name":"People and Technology"}],"news_room_topics":[{"id":"71901","name":"Society and Culture"}],"event_categories":[],"invited_audience":[],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[{"value":"\u003Cp\u003ENathan Deen\u003C\/p\u003E\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\u003Cp\u003ECommunications Officer\u003C\/p\u003E\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\u003Cp\u003ESchool of Interactive Computing\u003C\/p\u003E","format":"limited_html"}],"email":["ndeen6@gatech.edu"],"slides":[],"orientation":[],"userdata":""}}}