Audio Description

Audio description (AD) research focuses on automatically generating textual or spoken descriptions of visual content in videos, primarily to enhance accessibility for visually impaired individuals. Current research emphasizes leveraging large language models (LLMs) and vision-language models (VLMs) in conjunction with various architectures, including transformers and convolutional neural networks, to generate contextually rich and character-aware descriptions from video data. This work is significant because it addresses a critical need for inclusive media access, and advancements in AD technology have the potential to improve the quality of life for many while also advancing multimodal learning and natural language generation.

Papers