User Interface Agent

User interface (UI) agents aim to automate interactions with graphical user interfaces (GUIs) using artificial intelligence, primarily focusing on improving the accuracy and robustness of actions performed within various digital environments. Current research emphasizes visual grounding techniques, leveraging large vision-language models (VLMs) and multimodal large language models (MLLMs) to directly interpret visual information from screenshots, rather than relying solely on text-based representations of the GUI. This shift towards visual perception is driven by the need for more reliable and adaptable agents capable of handling complex, dynamic, and diverse GUI environments, ultimately impacting fields like accessibility, automation, and human-computer interaction.

Papers