Recently, the Google DeepMind team collaborated with Brown University to develop a new technology called "force prompting." This technology can generate realistic motion effects without 3D models or physics engines, marking a significant breakthrough in the field of AI video generation.
With this technology, users can manipulate AI-generated video content simply by specifying the direction and intensity of the force. Force prompting can be applied to both global forces (e.g., overall wind) and local forces (e.g., specific point impacts). The input forces enter the system as vector fields and are then converted into natural and fluid movements, greatly enhancing the realism and dynamic performance of the generated videos.
The research team based their work on the CogVideoX-5B-IV video model and added a ControlNet module to handle physical control data. The entire signal is generated through a Transformer architecture, with each video consisting of 49 frames. The training process used only four Nvidia A100 GPUs and took just one day.
Notably, the training data was entirely synthetic, including 15,000 videos of flags waving under different wind conditions, 12,000 videos of rolling spheres, and 11,000 videos of flowers reacting to impacts. These rich synthetic datasets allowed the model to automatically establish correct force-motion relationships when generating videos using physical terms like "wind" or "bubbles" mentioned in text descriptions.
Despite the relatively limited amount of training data, the model demonstrated strong generalization capabilities, adapting to new objects, materials, and scenes, and even mastering some simple physical rules, such as lighter objects moving farther than heavier ones under the same force.
User tests showed that the force prompting technology outperforms baseline models that rely solely on text or motion path control in terms of motion matching and realism, and surpasses PhysDreamer, which is based on real physics simulations. However, there are still some shortcomings in complex scenes, such as smoke sometimes failing to respond correctly to wind forces, and occasional arm movements resembling fabric-like lightness in human bodies.
Demis Hassabis, CEO of DeepMind, stated that the next generation of AI video models (such as Veo3) is gradually understanding physical rules, moving beyond text or image processing to start representing the physical structure of the world. This is considered an important step toward more general AI, where future AI systems could continuously optimize and enhance their abilities through experience learning in simulated environments.
Project page: https://force-prompting.github.io/
Key points:
🌟 The new technology "force prompting" can generate realistic motion videos without 3D models or physics engines.
⚙️ Users can achieve natural and fluid motion expressions by simply operating the direction and intensity of the force.
📈 The model demonstrates strong generalization capabilities, able to adapt to new scenes and objects.