Recently, DeepMind proposed a groundbreaking concept in its latest paper — "Chain of Frames" (CoF), marking another significant step forward in the development of video generation models. This concept is similar to the previous "Chain of Thought" (CoT), which enabled language models to perform symbolic reasoning. In contrast, "Chain of Frames" allows video models to reason in both time and space, as if endowing video generation models with independent thinking capabilities.
In the paper, the DeepMind research team put forward a bold idea: can video generation models possess general visual understanding abilities, similar to current large language models (LLMs), allowing them to handle various visual tasks without specialized training? Currently, the field of machine vision is still in the traditional stage, where different models are required for different tasks, such as object segmentation and object detection, and each task requires re-adjusting the model.
To validate this idea, the research team used a straightforward method: they provided the model only with an initial image and a text instruction, and observed whether it could generate an 720p resolution, 8-second video. This approach is similar to how large language models perform tasks through prompts, aiming to test the model's native general capabilities.
The results showed that DeepMind's Veo3 model performed well on multiple classic visual tasks, demonstrating its perception ability, modeling ability, and manipulation ability. More surprisingly, it excelled in cross-temporal and spatial visual reasoning, successfully planning a series of paths, thereby enabling it to solve complex visual challenges.
In summary, the DeepMind team summarized the following three core conclusions:
Strong universal adaptability: Veo3 can solve many tasks it was not specifically trained for, demonstrating strong general capabilities.
Early signs of visual reasoning: By analyzing the generated videos, Veo3 showed visual reasoning abilities similar to "Chain of Frames," gradually building an understanding of the visual world.
Obvious rapid development trend: Although specific task models perform better, Veo3's capabilities are rapidly improving, indicating that more powerful general visual models may emerge in the future.
In the future, DeepMind believes that general video models may replace specialized models, just as early GPT-3 eventually became a powerful foundational model. With the gradual reduction in costs, the widespread application of video generation models is imminent, heralding a new era in machine vision approaching us.
Paper address: https://papers-pdfs.assets.alphaxiv.org/2509.20328v1.pdf