MLLM Agent

MLLM agents are AI systems combining large language models with multimodal capabilities (e.g., image processing) to perform complex tasks, particularly in interacting with real-world environments like mobile devices or GUIs. Current research focuses on improving agent navigation and decision-making through multi-agent architectures and enhanced cognitive abilities, addressing challenges such as long sequences and error correction. This work is significant because it pushes the boundaries of AI's ability to understand and interact with dynamic, real-world information, with implications for applications ranging from automated assistance to improved security in complex AI systems.

Papers

December 5, 2024

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu
Text to Video Generation Multi Agent Framework LLM Based Multi Agent MLLM Agent

November 3, 2024

Integration of Large Vision Language Models for Efficient Post-disaster Damage Assessment and Reporting
Zhaohui Chen, Elyas Asadi Shamsabadi, Sheng Jiang, Luming Shen, Daniel Dias-da-Costa
Large Vision Language Model Disaster Response Bethesda Report MLLM Agent

June 3, 2024

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
Multi Agent Mobile Device Single Agent Multi Agent Collaboration Mobile Agent Efficient Navigation MLLM Agent

April 8, 2024

Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security
Yihe Fan, Yuxin Cao, Ziyu Zhao, Ziyao Liu, Shaofeng Li
Large Language Model Timely Survey Multimodal Large Language Model Threat Model Input Image MLLM Security MLLM Agent

February 20, 2024

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Yu Kong, Tianlong Chen, Huan Liu
Artificial General Intelligence MLLM Training MLLM Attention MLLM Security MLLM Agent

February 19, 2024

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
Xinbei Ma, Zhuosheng Zhang, Hai Zhao
Action Prediction LLM Based Multi Agent V Coco Sg Graphical User Interface Automation Autonomous Language Agent MLLM Agent

January 19, 2024

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang
Multimodal Large Language Model Comprehensive Benchmark Image Sequence Visual Language Task MLLM Agent

MLLM Agent

Papers

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Integration of Large Vision Language Models for Efficient Post-disaster Damage Assessment and Reporting

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative

CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences