Recently, GitTaskBench, jointly developed by multiple renowned academic institutions such as the Chinese Academy of Sciences, Peking University, and the Hong Kong University of Science and Technology, has been officially launched, marking the beginning of a new era in practical deployment standards for code agents.

Existing evaluation systems often focus on code generation and closed-ended questions, which cannot fully reflect the many challenges developers face in real work, such as environment configuration, dependency management, and integration of resources across repositories. Therefore, GitTaskBench not only focuses on code generation but also includes the entire development process in its evaluation scope, achieving a full-cycle evaluation from repository understanding, environment configuration, incremental development to project-level delivery for the first time.

image.png

The core of this evaluation tool lies in the economic benefit assessment of the "framework × model." It not only provides profound insights for academia and industry but also guides entrepreneurs. Its open-source version covers 7 modalities, 7 domains, 24 sub-domains, and 54 real tasks, using real GitHub repositories as testing foundations. Each task is accompanied by detailed natural language instructions and input-output formats, along with task-specific automated evaluation mechanisms to ensure efficiency and accuracy in evaluation.

In the evaluation framework of GitTaskBench, three dimensions—overall coding ability, task-oriented execution, and autonomous environment configuration—are systematically analyzed. This new evaluation system not only improves the assessment standards for code agents but also provides valuable references for future research.

The most exciting aspect is that GitTaskBench introduces the concept of "cost-effectiveness," quantifying the economic benefits of completing tasks. By combining task completion rate, market value, and quality coefficient, researchers can more accurately assess the actual value of code agents in different fields. This innovation paves the way for future applications of code agents, demonstrating their great potential in cost savings and efficiency improvement.

The release of GitTaskBench will open up a brand-new situation for the evaluation and application of code agents, enabling them to play a greater role in real-world work.

Paper link: https://arxiv.org/pdf/2508.18993