Zhipu AI releases a bomb: Qingying 2.0 is launched, capable of directly generating 1080P high-definition videos from text, with the longest duration of 10 seconds. The model controls the motion range, camera language, and style. Official tests show that the video quality and stability are close to OpenAI Sora, and the model understands Chinese prompts more accurately and generates faster.

The new version is based on Zhipu's self-developed CogVideoX large model, supporting the generation of multiple videos at once, allowing free specification of camera movements, and even "directing" the visual style—cyber neon, Chinese watercolor, film retro—achieved with just one sentence. Zhipu also released the CogSound audio model, which automatically matches ambient sounds and action sounds after video generation, achieving an "audio-visual integration" AI creation loop.
Qingying 2.0 has been integrated into Zhipu Qingyan App, allowing ordinary users to try it for free; the enterprise version offers APIs and on-premise deployment, enabling industries such as finance, e-commerce, advertising, and film to customize their own video models. Zhipu revealed that over a million videos were generated in the first month of Qingying's launch. This upgrade will reduce the inference cost by another 30%, bringing the "DALL·E of video" into households.
Project address: https://yimingli-page.github.io/




