
This study tested four classic logic puzzles: the Tower of Hanoi, checkers, river crossing, and block world. These puzzles allow precise control over task complexity and are ideal scenarios for measuring the reasoning abilities of language models. The results showed that standard LLMs performed more accurately and efficiently in simple tasks, while the reasoning models, although slightly improving with increased complexity, ultimately collapsed under high complexity as well.
What was even more surprising was that these models not only had their accuracy drop to zero when facing the most complex tasks but also used fewer reasoning tokens. In other words, their willingness and ability to "think" decreased instead.

The research team mapped the models' reasoning trajectories at different levels of complexity, revealing two typical failure modes: overthinking - in simple problems, models continue to generate incorrect alternatives even after finding the correct answer; and thinking collapse - in highly complex problems, the reasoning process abruptly halts, failing to even generate possible paths.
Although reasoning models, with mechanisms like "chains of thought" and "self-reflection," are considered a step toward artificial general intelligence (AGI), Apple's research indicates that these mechanisms have fundamental flaws in scalability. Current reasoning models cannot formulate strategies that are universally applicable. Their "thinking" is more statistical generation rather than true logical deduction.
