Egocentric Video Understanding

Egocentric video understanding aims to computationally interpret videos recorded from a first-person perspective, mimicking human perception of actions, interactions, and environments. Current research heavily focuses on developing robust multimodal models, often leveraging transformer architectures and incorporating data from various modalities (e.g., RGB, depth, audio, IMU) to improve accuracy and efficiency in tasks like action recognition, question answering, and scene understanding. These advancements are significant for applications in assistive robotics, human-computer interaction, and the broader field of artificial intelligence, enabling more natural and intuitive interactions between humans and machines.

Papers