An international research team, including Shanghai Jiao Tong University, officially launched a new benchmark testing tool called SWE-Explore today. This tool first quantitatively reveals significant technical shortcomings of current AI coding agents at the "line-level accuracy" by decoupling code search from the actual repair phase. This study breaks away from the previous single evaluation model that only relied on the "final repair rate," providing a new standard for directly measuring the quality of upstream search in agents, and is driving the evolution of AI software engineering evaluation toward deeper areas.

Traditional benchmarks such as SWE-bench often mask the real defects of agents in the code reading and understanding stages because they only focus on end-to-end results. To address this, the research team extracted consensus code segments from multiple independent solution paths based on the successful operation trajectories of mainstream large models like GPT-5.4, Gemini3Pro, Claude Sonnet4.6, and Kimi K2.6, building a dataset containing 848 defect tasks across 10 programming languages and 203 open-source projects.

QQ20260615-104033.jpg

The evaluation results show that although general coding agents like Claude Code and OpenHands perform well in "file-level" positioning, their core area coverage drops sharply to between 14% and 19% when focusing on specific "code lines." Ablation experiments further confirmed the existence of the "minimum context threshold" effect: when the visible proportion of key core areas is below 50%, the model's repair generally fails; however, once it crosses the threshold of 50% to 75%, the repair success rate shows a dramatic increase.

This research result indicates that the current bottleneck of AI agents is not entirely about patch writing capability but rather about accurately filtering and capturing critical context. In the current industry context where project managers reject half of automated adoption proposals, the "less filtering, more reading" technical direction proposed by SWE-Explore not only points the way for the architecture optimization of next-generation specialized code localization systems (such as CoSIL), but also accelerates the paradigm shift of automated software engineering from "brute-force generation" to "precise retrieval."