Text Modality
Text modality research explores how textual information can be effectively integrated with other data modalities (e.g., images, audio, video) to improve the performance and capabilities of AI models. Current research focuses on developing multimodal models using transformer architectures and diffusion models, often incorporating techniques like prompt tuning and meta-learning to enhance controllability and generalization. This work is significant because it enables more sophisticated AI systems capable of understanding and generating complex information across various data types, with applications ranging from improved medical diagnosis to more realistic virtual environments.
Papers
RespLLM: Unifying Audio and Text with Multimodal LLMs for Generalized Respiratory Health Prediction
Yuwei Zhang, Tong Xia, Aaqib Saeed, Cecilia Mascolo
Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang
Generalizable Prompt Tuning for Vision-Language Models
Qian Zhang
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation
Sen Fang, Yalin Feng, Sizhou Chen, Xiaofeng Zhang, Teik Toe Teoh
Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
Grant Wardle, Teo Susnjak
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
Zixin Guo, Jian Zhang
Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs
Yang Yuhang, Peng Yizhou, Eng Siong Chng, Xionghu Zhong
ImPoster: Text and Frequency Guidance for Subject Driven Action Personalization using Diffusion Models
Divya Kothandaraman, Kuldeep Kulkarni, Sumit Shekhar, Balaji Vasan Srinivasan, Dinesh Manocha