Multimodal Content
Multimodal content research focuses on understanding and generating content that integrates multiple data modalities, such as text, images, audio, and video, aiming to improve AI's ability to process and interact with the increasingly complex information landscape. Current research emphasizes developing robust multimodal models, often leveraging transformer-based architectures and techniques like contrastive learning and retrieval-augmented generation (RAG), to address challenges in tasks such as misinformation detection, sentiment analysis, and cross-modal understanding. This field is significant due to its potential to enhance various applications, including improved search engines, more effective social media moderation, and the creation of more engaging and informative multimedia content.
Papers
MMInA: Benchmarking Multihop Multimodal Internet Agents
Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria
Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration
Chenwei Lin, Hanjia Lyu, Jiebo Luo, Xian Xu