Google announced on April 2 major upgrades to its enterprise video creation app Vids, which now integrates the Veo3.1 video generation model and natural language interaction technology, marking a shift from static generation to dynamic "instruction control." The core of this update is to enhance the interactive capabilities of AI virtual avatars. Users can now issue simple text prompts to guide the avatars to perform specific interactions with products, props, or equipment within scenes, while maintaining visual consistency of the character in dynamic outputs.

Additionally, Vids has further integrated multimodal capabilities. Building on the recent addition of the Lyria3 series of audio models, the integration of Veo3.1 supports the generation of 8-second video clips, offering monthly generation quotas ranging from 10 to 1000 times for regular users and enterprise advanced accounts respectively.

QQ20260403-091357.jpg

To close the workflow loop, Google Vids now includes a direct export feature to YouTube, and it works with a new Chrome screen recording extension to build a complete workflow from content capture to final distribution.

At the same time, the competitive landscape in the AI field continues to intensify. Microsoft also released three base models of the MAI series on the same day, covering speech transcription, audio generation, and video generation across 25 languages, aiming to challenge Google and OpenAI's market position with lower cost barriers.

Since the launch of Vids in 2024, Google has rapidly iterated on 3D cartoon avatars and multilingual support. This fine-grained control based on prompts marks a shift of AI video tools from simple content generation to a more professional and automated directing phase, further reshaping the cost structure and creative boundaries of enterprise content production.