ByteDance has just released its latest multimodal large language model Vidi2, an AI model with 12 billion parameters designed specifically for video understanding. The model can process hours-long raw footage, understand the narrative structure, and generate complete TikTok short videos or movie clips based on simple prompts, seen as a major disruption to the existing video editing industry.

Breakthrough: Fine-Grained Spatiotemporal Localization (STG)

The key to Vidi2 lies in its video understanding capabilities. The new model introduces a fine-grained spatiotemporal localization (STG) feature that can simultaneously identify time stamps and bounding boxes of objects in the video. Given a text query, Vidi2 not only finds the corresponding time period but also accurately marks the location of specific objects within those time frames.

Technically:

  • Spatiotemporal Localization: The model returns a "pipeline" (time index bounding box), tracking specified objects and people at one-second granularity, directly supporting editing, such as tracking a specific person in a crowd.

  • Technical Architecture: Vidi2 upgrades to use Gemma-3 as the backbone network, along with a redesigned adaptive token compression technique, ensuring efficiency while maintaining key details when processing long videos.

Performance Leadership: Obvious Advantages in Long Video Understanding

Vidi2 performs exceptionally well in industry benchmark tests. On the VUE-TR-V2 benchmark used for open-ended temporal retrieval, its overall IoU reached 48.75, especially outperforming commercial models by 17.5 percentage points in **extremely long videos (over one hour)**. In localization tasks (VUE-STG), the model also achieved the best performance with vIoU 32.57 and tIoU 53.19.

QQ20251201-094610.png

From Model to Product: TikTok's "Smart Editor"

Based on Vidi2's powerful capabilities, ByteDance has developed multiple practical automated editing tools, including highlight extraction, story-aware cutting, content-aware layout reconstruction, and multi-angle switching, all of which can run on consumer-grade hardware.

  • TikTok App: The technology has been applied to TikTok's Smart Split feature, which automatically edits, reconstructs, adds subtitles, and converts long videos into short clips suitable for TikTok.

  • AI Outline: This tool can transform simple prompts or trending topics into structured video titles, openings, and outlines.

Industry Impact: ByteDance's AI Flywheel Begins to Turn

AIbase comments that the release of Vidi2 and ByteDance's huge TikTok (1 billion daily active users) data platform advantage has given it massive video data for training and real-time feedback optimization, posing a significant challenge to native AI companies. As the technical flywheel of big platforms starts to turn, traditional AI companies may face greater competitive pressure.

Vidi2 is still in the research phase, and the official stated that a Demo will be released soon.

Link: https://www.alphaxiv.org/abs/2511.19529