Meituan Releases Meeseeks Evaluation Benchmark! o3-mini Leads, DeepSeek-R1 Surprisingly Lasts, Sparks Discussion

AIbase基地

Published inAI News · 4 min read · Aug 29, 2025

Recently, with the rapid development of large language models such as OpenAI's o series, Claude3.5Sonnet, and DeepSeek-R1, the knowledge and reasoning capabilities of artificial intelligence have attracted widespread attention. However, many users have found that these models sometimes fail to fully follow the input instructions, leading to outputs that are content-rich but do not meet specific format or content requirements. To conduct in-depth research and evaluation of these models' ability to follow instructions, the Meituan M17 team has launched a new benchmark for evaluation - Meeseeks.

Meeseeks focuses on evaluating the ability of large models to follow instructions, adopting an innovative evaluation perspective. Unlike traditional evaluation methods, Meeseeks focuses on whether the model strictly follows the user's instructions, rather than solely assessing the accuracy of the answers. This evaluation framework breaks down the ability to follow instructions into three levels, ensuring depth and breadth in the assessment, specifically including: understanding the core intent of the task, implementing specific types of constraints, and following fine-grained rules.

In recent evaluations, the results based on Meeseeks showed that the reasoning model o3-mini (high) won first place with an absolute advantage, while another version o3-mini (medium) came in second, and Claude3.7Sonnet remained in third place. In contrast, the performance of DeepSeek-R1 and GPT-4o was unsatisfactory, ranking seventh and eighth respectively.

The uniqueness of Meeseeks lies in its broad evaluation coverage and high-difficulty data design. In addition, it introduced a "multi-turn correction" mode, allowing the model to make corrections if the initial response does not meet the requirements. This mode significantly improved the model's self-correction ability, especially after multiple rounds of feedback, where the instruction-following accuracy of all participating models showed significant improvement.

Through the evaluation by Meeseeks, the research team not only revealed differences in the ability of different models to follow instructions, but also provided valuable references for future research on large models.

Moad Community: https://www.modelscope.cn/datasets/ADoubLEN/Meeseeks

GitHub: https://github.com/ADoublLEN/Meeseeks

Huggingface: https://huggingface.co/datasets/meituan/Meeseeks

Junyi Digital Launches New AI Intelligent Entity Platform to Support the Intelligence Upgrade of Government and Transportation Industries!

Junyi Digital launches its self-developed AI intelligent entity platform, targeting government and enterprise customers. The platform integrates functions such as multimodal data analysis, intelligent knowledge base, accurate question-answering, and autonomous task execution, aiming to enhance the level of intelligence in the industry. The platform is deeply integrated with mainstream large models such as Deepseek and can be applied in fields such as smart cities, government affairs, and traffic management.

Trillion-Parameter Peak: Shanghai AI Lab Opens Source the World's Largest Scientific Multimodal Model Intern-S1-Pro

Shanghai Artificial Intelligence Laboratory has released and open-sourced the trillion-parameter scientific multimodal large model ShuRen Intern-S1-Pro, based on the "Integration of General and Specialized" architecture SAGE. It sets a new record for parameter scale in the open-source community and achieves breakthroughs in multiple scientific capabilities, maintaining a leading position in international academic evaluations in the AI4S field.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Meituan Releases Meeseeks Evaluation Benchmark! o3-mini Leads, DeepSeek-R1 Surprisingly Lasts, Sparks Discussion

AIbase基地

This article is from AIbase Daily

AI News Recommendations

The Rise of an AI Voice Giant! ElevenLabs Secures $5 Billion in Funding, Valuation Surges to $11 Billion, Becoming the World's Most Expensive AI Voice Service Provider

Google and Apple Team Up Strongly! The Next Generation AI Model is About to Be Released

First in Olympic History! Milan Winter Olympics Announces the Adoption of Alibaba Qwen's Official Large Model

The Unification of the Programming World? GitHub Integrates Claude and Codex to Open a New Era of Multi-Model Collaboration

First in History! Alibaba Qwen Builds an Official Large Model, Milan Winter Olympics Enter the AI Era

Valuation Surges Nearly Twice in 4 Months! AI Chip Star Cerebras Secures $1 Billion Series H Funding

Legal Industry Panics Following the Release of Anthropic AI Plugins

Junyi Digital Launches New AI Intelligent Entity Platform to Support the Intelligence Upgrade of Government and Transportation Industries!

Anthropic Announces: The Latest Claude Conversation Feature Will No Longer Include Ads

Trillion-Parameter Peak: Shanghai AI Lab Opens Source the World's Largest Scientific Multimodal Model Intern-S1-Pro

AI News Recommendations

The Rise of an AI Voice Giant! ElevenLabs Secures $5 Billion in Funding, Valuation Surges to $11 Billion, Becoming the World's Most Expensive AI Voice Service Provider

Google and Apple Team Up Strongly! The Next Generation AI Model is About to Be Released

First in Olympic History! Milan Winter Olympics Announces the Adoption of Alibaba Qwen's Official Large Model

The Unification of the Programming World? GitHub Integrates Claude and Codex to Open a New Era of Multi-Model Collaboration

First in History! Alibaba Qwen Builds an Official Large Model, Milan Winter Olympics Enter the AI Era

Valuation Surges Nearly Twice in 4 Months! AI Chip Star Cerebras Secures $1 Billion Series H Funding

Legal Industry Panics Following the Release of Anthropic AI Plugins

Junyi Digital Launches New AI Intelligent Entity Platform to Support the Intelligence Upgrade of Government and Transportation Industries!

Anthropic Announces: The Latest Claude Conversation Feature Will No Longer Include Ads

Trillion-Parameter Peak: Shanghai AI Lab Opens Source the World's Largest Scientific Multimodal Model Intern-S1-Pro

GEO Services