Salesforce has developed a breakthrough technology called CoAct-1 in collaboration with researchers from the University of Southern California. This technology aims to significantly enhance the ability of AI agents to perform complex tasks on computers by combining the advantages of coding and graphical user interface (GUI) operations. This hybrid approach is designed to overcome the fragility of traditional GUI agents, paving the way for more powerful and scalable automation.

AI Music Artificial Intelligence (3)

Pain Points of Traditional AI Agents: Long Tasks and Misclicks

Current computer-based AI agents typically rely on visual language models (VLMs) to perceive the screen and simulate mouse and keyboard operations. While these "click-based" agents can perform various tasks, they often struggle when dealing with applications that have dense menus and complex workflows, such as office productivity suites. Researchers point out that a single misclick or misunderstanding of a UI element can cause an entire task to fail in these scenarios.

To address this challenge, researchers have tried using advanced planners to enhance GUI agents, but this approach still fails to solve operations that can be completed more directly and reliably with just a few lines of code.

QQ20250813-104954.png

CoAct-1: A Hybrid System Based on Multi-Agent Collaboration

To overcome these limitations, the CoAct-1 system was developed. Its core idea is to "combine the intuitive benefits of GUI operations with the precision, reliability, and efficiency of direct system interaction through code." The system consists of a team of three specialized agents working together to complete tasks:

  • Orchestrator: As the central planner, it is responsible for breaking down the user's overall goal into subtasks and assigning them to the most suitable agent.

  • Programmer: Responsible for writing and executing Python or Bash scripts, handling backend operations such as file management or data processing.

  • GUI Operator: Based on VLM, it specializes in front-end tasks that require clicking buttons or navigating interfaces.

This dynamic delegation mechanism allows CoAct-1 to strategically bypass inefficient GUI operations and instead use more robust and efficient code execution, while retaining the necessity of visual interaction. The entire workflow is iterative, with each agent reporting back to the orchestrator after completing a subtask, which then determines the next step.

QQ20250813-105039.png

Performance Leap: Faster and More Efficient

Researchers tested CoAct-1 on the OSWorld benchmark, which includes 369 real-world tasks across browsers, IDEs, and office applications. The results showed that CoAct-1 achieved a 60.76% success rate, setting a new record.

Especially in operating system-level tasks and multi-application workflows, CoAct-1 demonstrated significant performance improvements. More importantly, the system's efficiency has also increased dramatically, with an average of only 10.15 steps required to complete a task, far fewer than the 15.22 steps needed by other leading pure GUI agents. Researchers note that fewer steps not only speed up task completion but also minimize the chance of errors, resulting in more efficient and reliable automation.

From Lab to Enterprise: Potential Applications and Challenges

This technology has tremendous potential for enterprise applications. Ran Xu, Director of AI Research at Salesforce, pointed out that customer support, sales prospecting, automated bookkeeping, and marketing campaign management are perfect use cases. In these scenarios, companies need to handle a variety of tools with and without APIs, and CoAct-1 can flexibly utilize code and screens to provide comprehensive automation solutions.

However, transitioning CoAct-1 from the lab to enterprise environments also presents challenges, including dealing with legacy software, ensuring security, and the necessity of human supervision. Xu emphasized that training agents in sandbox environments is needed to improve their adaptability and establish strong access control and security safeguards to prevent malicious code execution. Ultimately, the "human-in-the-loop" model will be key to ensuring the safe and reliable operation of the agent in the foreseeable future.