PhD Proposal by Harsh Agrawal

Primary tabs

Title: Towards multi-modal AI systems with 'open-world' cognition.

Date: Friday, September 16, 2022

Time: 11:30 am - 1:00 pm EST

Location (virtual): https://gatech.zoom.us/j/92418212103


Harsh Agrawal

PhD Student in Computer Science

College of Computing
Georgia Institute of Technology


Dr. Dhruv Batra (Advisor, School of Interactive Computing, Georgia Institute of Technology)

Dr. Devi Parikh (School of Interactive Computing, Georgia Institute of Technology)

Dr. James Hays (School of Interactive Computing, Georgia Institute of Technology)

Dr. Alexander Schwing (Department of Electrical and Computer Engineering)

Dr. Peter Anderson (Google)

Dr. Felix Hill (DeepMind)


A long term goal in AI research is to build intelligent systems with 'open-world' cognition. When deployed in the wild, AI systems should generalize to novel concepts and instructions. Such an agent would need to perceive both familiar and unfamiliar concepts present in the environment, combine the capabilities of models trained on different modalities, and incrementally acquire new skills to continuously adapt to the evolving world. In this thesis, we look at how we can combine complementary multi-modal knowledge with suitable forms of reasoning to enable novel concept learning. In Part 1, we show that agents can infer unfamiliar concepts in the presence of other familiar concepts by combining multi-modal knowledge with deductive reasoning. Furthermore, agents can use newly inferred concepts to update its vocabulary of known concepts and infer additional novel concepts incrementally. In Part 2, we look at two realistic tasks that require understanding novel concepts. First, we present a benchmark to evaluate the AI system's capability to describe novel objects present in an image. We argue that models that disentangle 'how to recognize an object' from 'how to talk about it' generalize better to novel objects compared to traditional methods that train on paired image-caption data. Second, we study how embodied agents can combine perception with common-sense knowledge to perform household chores like tidying up the house, without any explicit human instruction, even in the presence of unseen objects in unseen environments. Finally, in the proposed work, we will show that by combining complementary knowledge stored in foundation models trained on different domains (vision only, language only, vision-language), agents can perform zero-shot novel instruction following and continuously adapt to the open world by learning new skills incrementally.


  • Workflow Status: Published
  • Created By: Tatianna Richardson
  • Created: 09/12/2022
  • Modified By: Tatianna Richardson
  • Modified: 09/12/2022


Target Audience

No target audience selected.