Recently, with the rapid development of large language models such as OpenAI's o series, Claude3.5Sonnet, and DeepSeek-R1, the knowledge and reasoning capabilities of artificial intelligence have attracted widespread attention. However, many users have found that these models sometimes fail to fully follow the input instructions, leading to outputs that are content-rich but do not meet specific format or content requirements. To conduct in-depth research and evaluation of these models' ability to follow instructions, the Meituan M17 team has launched a new benchmark for evaluation - Meeseeks.
Meeseeks focuses on evaluating the ability of large models to follow instructions, adopting an innovative evaluation perspective. Unlike traditional evaluation methods, Meeseeks focuses on whether the model strictly follows the user's instructions, rather than solely assessing the accuracy of the answers. This evaluation framework breaks down the ability to follow instructions into three levels, ensuring depth and breadth in the assessment, specifically including: understanding the core intent of the task, implementing specific types of constraints, and following fine-grained rules.
In recent evaluations, the results based on Meeseeks showed that the reasoning model o3-mini (high) won first place with an absolute advantage, while another version o3-mini (medium) came in second, and Claude3.7Sonnet remained in third place. In contrast, the performance of DeepSeek-R1 and GPT-4o was unsatisfactory, ranking seventh and eighth respectively.
The uniqueness of Meeseeks lies in its broad evaluation coverage and high-difficulty data design. In addition, it introduced a "multi-turn correction" mode, allowing the model to make corrections if the initial response does not meet the requirements. This mode significantly improved the model's self-correction ability, especially after multiple rounds of feedback, where the instruction-following accuracy of all participating models showed significant improvement.
Through the evaluation by Meeseeks, the research team not only revealed differences in the ability of different models to follow instructions, but also provided valuable references for future research on large models.
Moad Community: https://www.modelscope.cn/datasets/ADoubLEN/Meeseeks
GitHub: https://github.com/ADoublLEN/Meeseeks
Huggingface: https://huggingface.co/datasets/meituan/Meeseeks