Agent Evaluation
Agent evaluation focuses on assessing the performance and capabilities of autonomous AI agents, particularly those powered by large language models (LLMs). Current research emphasizes developing robust and scalable evaluation frameworks, including novel metrics and benchmarks that account for dynamic environments and multi-turn interactions, often employing techniques like direct preference optimization and population-based comparisons. This work is crucial for advancing the development of reliable and effective AI agents, impacting fields ranging from scientific research (e.g., peer review simulation) to practical applications like mobile UI automation and collaborative robotics. The development of personalized agent tuning methods and the exploration of agent personalities are also significant areas of focus.
Papers
AgentReview: Exploring Peer Review Dynamics with LLM Agents
Yiqiao Jin, Qinlin Zhao, Yiyang Wang, Hao Chen, Kaijie Zhu, Yijia Xiao, Jindong Wang
WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu