{"675288":{"#nid":"675288","#data":{"type":"news","title":"Episode of \u0027Friends\u0027 Inspires New Tool that Provides Human-like Perception to MLLMs","body":[{"value":"\u003Cp\u003EFor Jitesh Jain, conducting a simple experiment while watching one of his favorite TV series became the genesis of a paper accepted into a prestigious computer vision conference.\u003C\/p\u003E\u003Cp\u003EJain is the creator of VCoder, a new tool that enhances the visual perception capabilities of multimodal large language models (MLLMs). Jain said MLLMs like GPT-4 with vision (GPT-4V) are prone to miss obscure objects that blend in with other objects in an image.\u003C\/p\u003E\u003Cp\u003EJain paused his TV as he watched \u003Cem\u003EThe One with the Halloween Party\u0026nbsp;\u003C\/em\u003Eepisode of the popular TV Series \u003Cem\u003EFriend\u003C\/em\u003Es.\u0026nbsp;\u003C\/p\u003E\u003Cp\u003EChandler stood out the most in a pink bunny costume while holding hands with Ross in a potato costume. As the two prepared for an arm-wrestling match with Joey and Phoebe, multiple groups socialized behind them.\u003C\/p\u003E\u003Cp\u003EJain wondered how accurate GPT-4V would be if he prompted itto describe everything happening in this image.\u003C\/p\u003E\u003Cp\u003E\u201cI watch a lot of TV series, so I frequently think about ways to leverage or include some aspects of those into my work,\u201d said Jain, a Ph.D. student in the School of Interactive Computing. \u201cThe scene was very cluttered, so I thought, what questions could I ask GPT-4 about this show.\u201d\u003C\/p\u003E\u003Cp\u003EOn the surface, the generative AI chatbot knew much about the image. It knew which show and episode it was from and recognized the man in the bunny costume as Chandler. It knew the main characters were hosting a Halloween party.\u003C\/p\u003E\u003Cp\u003EBut when Jain prompted the chatbot to count the number of people in the image, he discovered that GPT-4V and its open-source counterparts fell short of performing the simplest task.\u003C\/p\u003E\u003Cp\u003EIt answered 10 when the correct answer was 14. In the right corner of the image, there is a group of people standing in front of a dark curtain that GPT-4V had missed.\u0026nbsp;\u003C\/p\u003E\u003Ch4\u003E\u003Cstrong\u003EAI Paradox\u003C\/strong\u003E\u003C\/h4\u003E\u003Cp\u003EJain had a theory \u2014 the MLLMs had not been trained for the object perception task and did not have the necessary information to perceive the objects in the foreground and background.\u003C\/p\u003E\u003Cp\u003E\u201cWe started testing it with different pictures, and GPT-4V kept underperforming,\u201d Jain said. \u201cThe key takeaway is that it couldn\u2019t do a simple task such as counting the people in the scene, but it knew complex information such as what was happening and who the characters were. This phenomenon is Moravec\u2019s Paradox in Perception \u2014 the MLLMs visually reason about complex questions but fail at simple object perception tasks like counting.\u201d\u003C\/p\u003E\u003Cp\u003EJain said he has worked on image segmentation tools for the past two years. That includes when he was a research intern at Picsart AI under his now Ph.D. advisor Humphrey Shi, an associate professor in the School of Interactive Computing.\u003C\/p\u003E\u003Cp\u003EThe core idea behind VCoder is to act as a perceptive eye for the MLLM, using segmentation and depth maps obtained through established computer vision frameworks with minimal training costs. The tool also proposes evaluation metrics for object perception tasks like counting and ordering.\u003C\/p\u003E\u003Cp\u003EIts training and evaluation set consists of images from Common Objects in Context (COCO), a widely used object perception dataset. Associate Professor James Hays from the School of Interactive Computing was one of the academic collaborators who worked with Microsoft to create COCO.\u003C\/p\u003E\u003Ch4\u003E\u003Cstrong\u003EImproving MLLMs\u003C\/strong\u003E\u003C\/h4\u003E\u003Cp\u003EThough VCoder didn\u2019t know which show the image was from, it accurately described everything, including the number of people. It showed as much as 10% more accuracy than its nearest competitor.\u003C\/p\u003E\u003Cp\u003EIt could also identify the order of objects in a scene.\u003C\/p\u003E\u003Cp\u003EJain designed VCoder to integrate easily with existing MLLMs. He said augmenting MLLMs with VCoder leads to an MLLM with sound general reasoning and object perception capabilities.\u003C\/p\u003E\u003Cp\u003EHowever, he added he was unsure if integration would happen because companies like Open AI, which created GPT-4V, may overlook it.\u003C\/p\u003E\u003Cp\u003E\u201cThere\u2019s no way to know if they will integrate since GPT-4V is a closed model, and their main motivation is to make a product useful to consumers in general,\u201d he said. \u201cThey often ignore these small issues.\u201d\u003C\/p\u003E\u003Cp\u003EJain\u2019s paper was accepted into the Institute of Electrical and Electronics Engineers\u2019 2024 Conference on Computer Vision and Pattern Recognition (CVPR), occurring June 17-21 in Seattle. CVPR is the highest-ranked conference in computer vision according to Google Scholar.\u003C\/p\u003E","summary":"","format":"limited_html"}],"field_subtitle":"","field_summary":[{"value":"\u003Cp\u003EFor Jitesh Jain, conducting a simple experiment while watching one of his favorite TV series became the genesis of a paper accepted into a prestigious computer vision conference.\u003C\/p\u003E\u003Cp\u003EJain is the creator of VCoder, a new tool that enhances the visual perception capabilities of multimodal large language models (MLLMs). Jain said MLLMs like GPT-4 with vision (GPT-4V) are prone to miss obscure objects that blend in with other objects in an image.\u003C\/p\u003E","format":"limited_html"}],"field_summary_sentence":[{"value":"Jitesh Jain is the creator of VCoder, a new tool that enhances the visual perception capabilities of multimodal large language models (MLLMs)"}],"uid":"36530","created_gmt":"2024-07-01 18:36:09","changed_gmt":"2024-07-01 18:37:57","author":"Nathan Deen","boilerplate_text":"","field_publication":"","field_article_url":"","dateline":{"date":"2024-06-18T00:00:00-04:00","iso_date":"2024-06-18T00:00:00-04:00","tz":"America\/New_York"},"extras":[],"hg_media":{"674279":{"id":"674279","type":"image","title":"2X6A9720.jpg","body":null,"created":"1719858982","gmt_created":"2024-07-01 18:36:22","changed":"1719858982","gmt_changed":"2024-07-01 18:36:22","alt":"Jitesh Jain and Humphrey Shi","file":{"fid":"257775","name":"2X6A9720.jpg","image_path":"\/sites\/default\/files\/2024\/07\/01\/2X6A9720.jpg","image_full_path":"http:\/\/hg.gatech.edu\/\/sites\/default\/files\/2024\/07\/01\/2X6A9720.jpg","mime":"image\/jpeg","size":3563310,"path_740":"http:\/\/hg.gatech.edu\/sites\/default\/files\/styles\/740xx_scale\/public\/2024\/07\/01\/2X6A9720.jpg?itok=RwAeH0kF"}}},"media_ids":["674279"],"groups":[{"id":"47223","name":"College of Computing"},{"id":"50876","name":"School of Interactive Computing"}],"categories":[{"id":"153","name":"Computer Science\/Information Technology and Security"}],"keywords":[],"core_research_areas":[{"id":"193655","name":"Artificial Intelligence at Georgia Tech"}],"news_room_topics":[],"event_categories":[],"invited_audience":[],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[{"value":"\u003Cp\u003ENathan Deen\u003C\/p\u003E\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\u003Cp\u003ECommunications Officer\u003C\/p\u003E\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\u003Cp\u003ESchool of Interactive Computing\u003C\/p\u003E","format":"limited_html"}],"email":[],"slides":[],"orientation":[],"userdata":""}}}