Recently, Apple published a controversial paper pointing out significant defects in the reasoning abilities of current large language models (LLMs). This view quickly sparked heated discussions on social media, especially among senior software engineer Sean Goedecke from GitHub, who strongly opposed this conclusion. He argued that Apple's findings were overly simplistic and could not fully reflect the capabilities of reasoning models.

Apple's paper highlighted that LLMs perform inconsistently when tackling benchmark tests such as mathematics and programming. The research team analyzed the performance of reasoning models using the classic Tower of Hanoi puzzle, examining their performance across different levels of complexity. The study found that the models performed well on simple puzzles but often abandoned further reasoning when faced with tasks of higher complexity.

image.png

For example, when dealing with the ten-disk Tower of Hanoi problem, the model considered manually listing each step almost impossible, so it attempted to find "shortcuts," but ultimately failed to provide the correct answer. This discovery suggests that reasoning models sometimes do not lack ability but rather recognize the complexity of the task and choose to abandon it.

However, Sean Goedecke questioned this claim, arguing that the Tower of Hanoi was not the best example for testing reasoning capabilities, and the threshold for model complexity might not be fixed. Additionally, he mentioned that the original purpose of designing reasoning models was to handle reasoning tasks, not to execute thousands of repetitive steps. Using the Tower of Hanoi to test reasoning capabilities is like saying, "If a model cannot write complex poetry, then it lacks language capability," which is unfair.

Although Apple's research revealed some limitations of LLMs in reasoning, it does not mean these models are entirely incapable of reasoning. The real challenge lies in how to better design and evaluate these models to unlock their full potential.