Meituan LongCat Team Launches VitaBench: A New Benchmark for Intelligent Agent Evaluation

AIbase基地

Published inAI News · 4 min read · Oct 21, 2025

Recently, the LongCat team of Meituan officially launched a smart agent evaluation benchmark called VitaBench, aimed at multi-interaction tasks, especially applications in complex life scenarios. The launch of VitaBench provides an important infrastructure for the development of smart agents in real-life scenarios.

VitaBench focuses on high-frequency real scenarios such as food delivery, dining in restaurants, and travel. It has built an interactive evaluation environment with 66 tools. The design of evaluation tasks covers complex operations such as ticket purchasing and restaurant reservations, requiring smart agents to demonstrate comprehensive performance in deep reasoning, tool calls, and user interaction during task execution.

Although leading reasoning models have made some progress, research by the LongCat team shows that the success rate of smart agents in complex cross-scenario tasks is still below 30%, indicating a significant gap between current technology and practical application needs. The development of VitaBench aims to address this issue and fill the gap between existing smart agent evaluation benchmarks and real-life application scenarios.

The design of this benchmark is based on an in-depth analysis of three dimensions: reasoning complexity, tool complexity, and interaction complexity. By quantifying these dimensions, the team systematically measures the performance of smart agents in real scenarios. For example, reasoning complexity is mainly evaluated by the need for information integration, the size of the observation space, and the number of reasoning points required; tool complexity considers the dependency relationships and the length of the call chain of tools; and interaction complexity focuses on the ability of the smart agent to respond in multi-turn dialogues.

The construction of VitaBench involves two stages: first, the design of the tool definition, followed by the creation of tasks and the establishment of evaluation criteria. This process ensures the diversity and complexity of tasks while avoiding the limitations of traditional document modes, allowing smart agents to independently reason and make decisions without redundant rules.

Currently, VitaBench is fully open source, and researchers and developers can access related resources through its official website and GitHub. The release of VitaBench marks an important milestone in the field of smart agent evaluation and is expected to promote further application and development of smart agent technology in real-life scenarios.

Project homepage: https://vitabench.github.io

Paper link: https://arxiv.org/abs/2509.26490

Code repository: https://github.com/meituan-longcat/vitabench

Dataset: https://huggingface.co/datasets/meituan- longcat/VitaBench

Leaderboard: https://vitabench.github.io/#Leaderboard

AI Daily: Kimi K2.5 Launches; Alibaba Releases Inference Model Qwen3-Max-Thinking; Claude Deeply Integrates with Office Tools Like Slack

Welcome to the [AI Daily] section! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers to help you gain insights into technical trends and understand innovative AI product applications. Click to learn more about new AI products: https://app.aibase.com/zh1. KimiK2.5 quietly launches with dual upgrades in vision and tool integration. The release of KimiK2.5 marks MoonshotAI's continued efforts in the AI field, with enhanced vision and tool integration capabilities.

Baidu Wenyin APP Launches Industry's First Multi-Person Multi-Agent Group Chat Beta Test

Baidu's Wenxin APP launched a new beta test on January 27, introducing the industry's first 'multi-user, multi-Agent' AI group chat feature. This innovation breaks the traditional one-on-one interaction model, allowing multiple specialized AI agents (e.g., group chat assistants, health managers) to coexist and collaborate in a single chat, forming a multi-dimensional AI think tank. The AI assistants not only deeply understand context but also pos....

AliHealth Medical AI Application Hydrogen Ion Launches New Features, Supporting Daily Global Medical Literature Tracking

Ali Health's AI app 'Hydrogen Ion' introduces 'Dynamic Evidence Positioning' to combat AI misinformation by upgrading static citations to 'live evidence', enabling precise source tracking and real-time validation of timeliness, authority, and logical consistency through daily updates of global medical guidelines and literature.....

Kimi Evolves! Launching K2.5 Model: Visual Understanding, Code Replication, and Agent Cluster Collaboration

Moon's dark side releases open-source Kimi K2.5 model, excelling in vision, code, and general tasks with native multimodal design for complex execution. It lowers interaction barriers, allowing users to upload photos, screenshots, or screen recordings for direct logic understanding, such as completing front-end development tasks via screen recordings.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Meituan LongCat Team Launches VitaBench: A New Benchmark for Intelligent Agent Evaluation

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Kimi K2.5 Launches; Alibaba Releases Inference Model Qwen3-Max-Thinking; Claude Deeply Integrates with Office Tools Like Slack

Study Reveals Employment Winter Began Before ChatGPT Emerged, AI Impact Had Already Been Seen in Early 2022

New Opportunities in the 15th Five-Year Plan: Space Photovoltaics and AI Applications in Construction Are Accelerating into High Gear

The Era of Large Models in Input Methods: Sogou Input Method's AI Users Exceed 100 Million, Voice Accuracy Reaches 98%

Tencent Sogou Input Method Releases Major Version 20.0 with Full AI

Baidu Wenyin APP Launches Industry's First Multi-Person Multi-Agent Group Chat Beta Test

AliHealth Medical AI Application Hydrogen Ion Launches New Features, Supporting Daily Global Medical Literature Tracking

China's First AI Hallucination Infringement Case Concluded: Platform Not Liable, 100,000 Compensation Proposed by AI Invalid

Kimi Evolves! Launching K2.5 Model: Visual Understanding, Code Replication, and Agent Cluster Collaboration

New Height in Robot Perception: The World's First Cross-Body Visual-Tactile Large Model Dataset, Baihu-VTouch, Is Officially Released

AI News Recommendations

AI Daily: Kimi K2.5 Launches; Alibaba Releases Inference Model Qwen3-Max-Thinking; Claude Deeply Integrates with Office Tools Like Slack

Study Reveals Employment Winter Began Before ChatGPT Emerged, AI Impact Had Already Been Seen in Early 2022

New Opportunities in the 15th Five-Year Plan: Space Photovoltaics and AI Applications in Construction Are Accelerating into High Gear

The Era of Large Models in Input Methods: Sogou Input Method's AI Users Exceed 100 Million, Voice Accuracy Reaches 98%

Tencent Sogou Input Method Releases Major Version 20.0 with Full AI

Baidu Wenyin APP Launches Industry's First Multi-Person Multi-Agent Group Chat Beta Test

AliHealth Medical AI Application Hydrogen Ion Launches New Features, Supporting Daily Global Medical Literature Tracking

China's First AI Hallucination Infringement Case Concluded: Platform Not Liable, 100,000 Compensation Proposed by AI Invalid

Kimi Evolves! Launching K2.5 Model: Visual Understanding, Code Replication, and Agent Cluster Collaboration

New Height in Robot Perception: The World's First Cross-Body Visual-Tactile Large Model Dataset, Baihu-VTouch, Is Officially Released

GEO Services