ComfyAI Company announced that its self-developed O1 video large model has been fully opened to the public starting at midnight today. The model adopts an MVL (Multimodal Vision Language) unified interaction architecture, integrating text, images, and videos in a single input box, and for the first time introduces a Chain-of-Thought reasoning pathway. The official called it "the world's first unified multimodal video large model."

image.png

Different from the conventional step-by-step process in the industry, the O1 model can complete tasks such as text-to-video, image-to-video, local editing, and shot extension in one go without users switching interfaces. A product director of ComfyAI stated that the model uses multi-viewpoint subject construction technology to lock onto the characteristics of people and objects, solving the "feature drift" issue during camera transitions, ensuring continuity in multi-subject scenes.

image.png

Currently, the O1 model is available for experience on ComfyApp and the official website, supporting free setting of 3–10 second durations, targeting short video creators, advertising teams, and individual users. The company revealed that it will later open API interfaces for third-party platforms to integrate. Industry analysts believe that the launch of O1 may further reduce the threshold for AI video production, but whether it can strike a balance between generation quality and cost efficiency remains to be tested by the market.