User Interface Agent
User interface (UI) agents aim to automate interactions with graphical user interfaces (GUIs) using artificial intelligence, primarily focusing on improving the accuracy and robustness of actions performed within various digital environments. Current research emphasizes visual grounding techniques, leveraging large vision-language models (VLMs) and multimodal large language models (MLLMs) to directly interpret visual information from screenshots, rather than relying solely on text-based representations of the GUI. This shift towards visual perception is driven by the need for more reliable and adaptable agents capable of handling complex, dynamic, and diverse GUI environments, ultimately impacting fields like accessibility, automation, and human-computer interaction.
Papers
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu
Falcon-UI: Understanding GUI Before Following User Instructions
Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji