ByteDance has announced the launch of its new multi-modal model, Vidi, focusing on video understanding and editing. Its initial core capability is precise temporal retrieval. According to AIbase, Vidi can process visual, audio, and text inputs, supporting the analysis of ultra-long videos up to one hour in length. Its performance on time retrieval tasks surpasses mainstream models like GPT-4 and Gemini. This groundbreaking technology has sparked heated discussions within the AI community, with details released through ByteDance's official channels and GitHub.

1.jpg

Core Functionality: Precise Time Retrieval and Multi-modal Collaboration

Vidi offers a novel solution for video understanding and editing with its powerful time retrieval and multi-modal processing capabilities. AIbase has summarized its main functions:

Precise Time Retrieval: Vidi can accurately locate specific segments within a video based on text prompts or multi-modal input (e.g., "find the 30-second clip of the character dancing"), achieving second-level time resolution and significantly improving content retrieval efficiency.

Ultra-Long Video Support: Supports processing videos up to one hour long, overcoming the memory and computational bottlenecks of traditional models in understanding long-sequence videos, making it suitable for analyzing movies, live streams, or conference recordings.

Multi-modal Input Processing: Integrates visual (frame sequences), audio (speech, background sound), and text (subtitles, descriptions) to achieve cross-modal semantic understanding, such as locating video highlights based on audio emotion.

Efficient Editing Capabilities: Supports video clip editing, rearrangement, and annotation based on time retrieval, simplifying content creation and post-production workflows.

AIbase notes that community tests show Vidi can quickly locate complex scene segments when processing the Youku-mPLUG dataset (10M video-language pairs), surpassing GPT-4's performance on the ActivityNet time retrieval task (approximately 10% accuracy improvement).

Technical Architecture: Innovative Time Encoding and Multi-modal Fusion

Vidi is based on ByteDance's VeOmni framework, combining a video-specific large language model (Vid-LLM) and a time-enhanced transformer architecture. AIbase analysis indicates that its core technologies include:

Time-Enhanced Transformer: Optimizes spatiotemporal relationship modeling of long-sequence videos through temporal embedding and hierarchical attention mechanisms, ensuring high-precision time retrieval.

Multi-modal Encoder: Employs Chat-UniVi's unified visual representation, fusing video frames, audio waveforms, and text embeddings to support cross-modal semantic alignment and reduce information loss.

Efficient Inference Optimization: Utilizes ByteDance's ByteScale distributed training system, combined with 4-bit quantization and dynamic chunking, significantly reducing the computational cost of ultra-long video processing.

Dataset-Driven: Training data includes Youku-mPLUG (10M video-language pairs) and WebVid-10M, covering multiple languages and diverse scenarios to enhance model generalization ability.

AIbase believes that Vidi's time retrieval capabilities benefit from its innovative PHD-CSWA (Chunk-wise Sliding Window Attention) mechanism, consistent with ByteDance's previously released efficient pre-training length scaling technology, making it particularly suitable for long-sequence tasks.

Application Scenarios: From Content Creation to Intelligent Analysis

Vidi's multi-modal capabilities and ultra-long video support open up a wide range of application scenarios. AIbase has summarized its main uses:

Content Creation and Editing: Provides video creators with precise segment location and automatic editing tools, simplifying the production of short videos, vlogs, or movie trailers, such as quickly extracting highlights from live streams.

Intelligent Video Analysis: Supports enterprises in analyzing long conference recordings or surveillance videos, automatically annotating key events (e.g., "the segment discussing the budget"), improving information retrieval efficiency.

Education and Training: Analyzes educational videos, locates specific knowledge points or interactive segments, and generates customized learning segments, suitable for online education platforms.

Entertainment and Recommendation: Optimizes video recommendation systems on platforms like TikTok, improving content matching accuracy through semantic and temporal analysis to enhance user experience.

Community feedback shows that Vidi performs exceptionally well in processing long Chinese videos (such as variety shows), and its multi-language support (covering 8 languages) further expands its global application potential. AIbase observes that Vidi seamlessly integrates with ByteDance's Doubao model ecosystem, providing a solid foundation for commercial deployment.

Getting Started: Open-Source Support, Developer-Friendly

AIbase understands that Vidi's code and pre-trained models will be open-sourced on GitHub (estimated github.com/ByteDance-Seed/Vidi), supporting PyTorch and the VeOmni framework. Developers can quickly get started by following these steps:

Clone the Vidi repository, install Python 3.9+ and NVIDIA CUDA dependencies;

Download the Youku-mPLUG or WebVid-10M dataset and configure the time retrieval task;

Run inference using the provided vidi.yaml script, inputting multi-modal prompts (e.g., "locate the part where the speaker mentions AI");

Export the located segments or editing results, supporting MP4 or JSON formats.

Community-provided Docker images and Hugging Face integration simplify the deployment process. Recommended hardware is NVIDIA A100 (40GB) or RTX3090 (24GB). AIbase suggests developers prioritize testing Vidi's time retrieval functionality on the ActivityNet or EgoSchema datasets to verify its performance advantages.

Performance Comparison: Surpassing GPT-4 and Gemini

Vidi's performance on time retrieval tasks is particularly outstanding. AIbase has compiled a comparison with mainstream models:

Time Retrieval Accuracy: On the ActivityNet dataset, Vidi's accuracy is approximately 10% higher than GPT-4 and approximately 12% higher than Gemini 1.5 Pro, particularly stable in long videos (>30 minutes).

Processing Speed: Vidi processes one-hour videos in an average of 5-7 minutes (128 GPUs), better than GPT-4's 8-10 minutes, thanks to its chunking attention mechanism.

Multi-modal Understanding: In the Youku-mPLUG video question-answering task, Vidi's overall score (combining visual, audio, and text) surpasses Gemini 1.5 Pro by about 5%, comparable to GPT-4.

Community analysis suggests that Vidi's performance advantage stems from its optimization for the video domain, rather than a general-purpose multi-modal design, making it particularly targeted in time perception and long-sequence processing. AIbase predicts that Vidi's open-sourcing will drive further competition in the Vid-LLM field.

Project Address: https://bytedance.github.io/vidi-website/