ICLR 2025's first round of reviews has just ended, and an Apple paper that claimed "small models surpass GPT-5" in visual reasoning was publicly criticized. Lei Yang, a researcher at Jiechu Star, found during the replication process that the official code omitted image inputs, and after fixing this issue, the accuracy dropped sharply. Then, by randomly checking 20 questions, he found that 6 of them had incorrect Ground Truth (GT) labels — estimating the overall GT error rate to be around 30%.

Lei Yang submitted an issue on GitHub, but it received only two replies before being closed. He then wrote a long article to warn the reviewers. The post quickly spread, and the author team admitted the next day that there were "defects in the data generation process" and urgently uploaded a corrected benchmark, promising to re-run experiments and update results. The incident sparked heated discussions in the academic community: in the era of large models, if automatically generated datasets lack manual quality checks, even the most reputable companies can face failures. Lei Yang reminded his peers, "Before replication, run a small sample 'checkup' first — don't let wrong GT waste computing power and all-nighters."
Reference: https://x.com/diyerxx/status/1994042370376032701


