event

PhD Defense by Unaiza Ahsan

Primary tabs

Title: Leveraging Mid-Level Representations For Complex Activity Recognition

Unaiza Ahsan
Computer Science Ph.D. Student

School of Interactive Computing
College of Computing
Georgia Institute of Technology

 

Date: Tuesday, Nov 27, 2018
Time: 10:00 AM to 12:00PM (EST) 
Location: College of Computing Building (CCB) 345

Committee:

---------------

Dr. Irfan Essa (Advisor), School of Interactive Computing, Georgia Institute of Technology

Dr. James Hayes, School of Interactive Computing, Georgia Institute of Technology
Dr. Devi Parikh, School of Interactive Computing, Georgia Institute of Technology
Dr. Munmun De Choudhury, School of Interactive Computing, Georgia Institute of Technology

Dr. Zsolt Kira, School of Interactive Computing, Georgia Institute of Technology
Dr. Chen Sun, Google

 

Summary:

---------------

Dynamic scene understanding requires learning representations of the components of the scene including objects, environments, actions and events. Complex activity recognition from images and videos requires annotating large datasets with action labels which is a tedious and expensive task. Thus, there is a need to design a mid-level or intermediate feature representation which does not require millions of labels, yet is able to generalize to semantic-level recognition of activities in visual data. This thesis makes three contributions in this regard. 

First, we propose an event concept-based intermediate representation which learns concepts via the Web and uses this representation to identify events even with a single labeled example. To demonstrate the strength of the proposed approaches, we contribute two diverse social event datasets to the community. We then present a use case of event concepts as a mid-level representation that generalizes to sentiment recognition in diverse social event images. 

Second, we propose to train Generative Adversarial Networks (GANs) with video frames (which does not require labels), use the trained discriminator from GANs as an intermediate representation and finetune it on a smaller labeled video activity dataset to recognize actions in videos. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding or searching for the best video frame sampling technique. 

Our third contribution is a self-supervised learning approach on videos that exploits both spatial and temporal coherency to learn feature representations on video data without any supervision. We demonstrate the transfer learning capability of this model on smaller labeled datasets. We present comprehensive experimental analysis on the self-supervised model to provide insights into the unsupervised pretraining paradigm and how it can help with activity recognition on target datasets which the model has never seen during training. 

 

Status

  • Workflow Status:Published
  • Created By:Tatianna Richardson
  • Created:11/20/2018
  • Modified By:Tatianna Richardson
  • Modified:11/20/2018

Categories

Keywords