Recently, the Qwen team at Alibaba introduced two revolutionary products - Mobile-Agent-v3 and GUI-Owl, which aim to address a series of challenges in graphical user interface (GUI) automation.
Modern computing devices widely use graphical user interfaces, yet previous automation methods often relied on complex scripts and manual rules, with less than ideal results. GUI-Owl, a new multimodal agent model, is built upon Qwen2.5-VL and further trained on a large amount of GUI interaction data, aiming to enhance task understanding and execution capabilities.
The design of GUI-Owl aims to handle the diversity and dynamics of GUI environments in the real world. By integrating perception, reasoning, planning, and execution capabilities, it provides a unified policy network. This design enables it to make multi-turn decisions in complex tasks while maintaining clear reasoning processes and adapting to changes in practical applications.
To ensure high-quality data support, the team developed a self-evolving data production pipeline. This pipeline generates realistic application navigation workflows and validates them through human annotations, ensuring the authenticity and effectiveness of the generated data. In addition, the team used various data synthesis strategies to enrich the model's learning content, enabling stronger adaptability and flexibility during task execution.
The Mobile-Agent-v3 framework focuses on multi-agent collaboration, breaking down complex tasks into sub-goals and dynamically updating plans to handle execution feedback. Four specialized agents within the framework - the manager agent, the worker agent, the reflection agent, and the note agent - each have their own roles, improving the efficiency and success rate of task execution. After multiple rounds of testing and evaluation, GUI-Owl and Mobile-Agent-v3 have shown excellent performance on multiple GUI automation benchmarks, especially in cross-platform task completion capabilities.
These innovative tools mark a significant advancement for Alibaba in the field of general GUI automation, and will provide stronger technical support for more extensive application scenarios in the future.
Paper: https://arxiv.org/abs/2508.15144
github: https://github.com/X-PLUG/MobileAgent
Key Points:
🌟 GUI-Owl is a multimodal agent model launched by Alibaba, integrating perception, reasoning, and execution capabilities to adapt to complex GUI environments.
🤖 The Mobile-Agent-v3 framework achieves multi-agent collaboration, enhancing task execution efficiency through dynamic plan updates.
📊 These two products have shown outstanding performance in GUI automation benchmark tests, marking an important breakthrough for Alibaba in the field of automation.