Audio Driven
Audio-driven research focuses on understanding and generating audio signals, often in conjunction with other modalities like text and video. Current efforts concentrate on developing robust models for tasks such as audio-visual representation learning, talking head synthesis (using diffusion models and autoencoders), and audio-to-text/text-to-audio generation (leveraging large language models and neural codecs). These advancements have significant implications for various fields, including film-making, virtual reality, assistive technologies, and multimedia forensics, by enabling more realistic and interactive audio-visual experiences and improving analysis of audio-visual data.
Papers
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu
Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023
Yuqi Li, Yizhi Luo, Xiaoshuai Hao, Chuanguang Yang, Zhulin An, Dantong Song, Wei Yi
Towards Interpretability in Audio and Visual Affective Machine Learning: A Review
David S. Johnson, Olya Hakobyan, Hanna Drimalla