ByteDance Open Sources VeOmni Framework: A New Tool to Improve Multimodal Training Efficiency

AIbase基地

Published inAI News · 4 min read · Aug 14, 2025

Recently, ByteDance announced the open source of its internally developed VeOmni framework, a unified framework dedicated to multi-modal model training. With the continuous development of artificial intelligence technology, especially the evolution from single-language models to multi-modal models that include text, images, and videos, algorithm engineers face many challenges during the training process, particularly the fragmentation of the training workflow. To address these issues, VeOmni was born.

VeOmni was jointly developed by ByteDance's Seed team and the Volcano Machine Learning platform, aiming to achieve the goals of "unified multi-modal, unified parallel strategy, and unified computing foundation." The framework provides a unified API, integrating various hybrid parallel strategies into one framework, supporting fast training for various models. Whether it is large-scale language models, vision-language models, or video generation models, developers can easily get started.

The framework has significant performance optimization capabilities. For example, it uses a dual optimization strategy for memory computation, which minimizes additional computational overhead while ensuring sufficient memory. In addition, VeOmni adopts a multidimensional parallel system, supporting different parallel primitives, thus effectively reducing memory peaks. The combination of these technologies makes VeOmni perform excellently in actual training, with a training throughput improvement of more than 40% compared to similar open-source solutions.

In terms of distillation acceleration, VeOmni also demonstrates its strong advantages. By integrating various cutting-edge distillation techniques, users can significantly reduce the steps and resource consumption required for model inference, thus accelerating model deployment and application.

The open source of the VeOmni framework not only improves the efficiency of internal model training at ByteDance but also provides a powerful tool for more AI researchers and developers, helping to promote the development of multi-modal AI technology.

Key points:
🌟 VeOmni framework is a unified framework developed by ByteDance specifically for multi-modal model training, aimed at solving the fragmentation issues in the training process.
⚡ This framework significantly improves training efficiency through memory computation and hybrid parallel strategies, with a training throughput increase of over 40%.
🚀 VeOmni integrates cutting-edge distillation technologies, helping users reduce model inference steps and accelerate model deployment.

SenseTime NEO Open Source: Achieve Top Multimodal Model Performance with 1/10 of the Data Volume, Ending the Era of Patchwork AI

SenseTime and NTU S-Lab launch open-source multimodal model NEO, achieving deep vision-language integration via architectural innovation. With only 39M image-text pairs (1/10 of similar models), it attains top-tier visual perception without massive data or extra encoders, advancing efficiency and versatility.....

AI Daily: Beijing Releases the Artificial Intelligence Industry White Paper; Bytedance Releases Video Editing Model Vidi2; Kuaishou to Release Kling Omni

Beijing released the "Artificial Intelligence Industry White Paper (2025)", which expects the core output value to exceed 450 billion yuan. The white paper details the holding of the 2025 China Artificial Intelligence Conference in Beijing, as well as the Beijing Municipal Science and Technology Commission's plans and prospects for the development of the artificial intelligence industry.

ByteDance Launches Groundbreaking AI Model Vidi2: 120 Billion Parameters, Revolutionizing Video Editing

ByteDance has launched the 120 billion parameter video understanding model Vidi2, which can process hours of raw footage, understand the narrative flow, and generate TikTok short videos or movie clips based on prompts. The core breakthrough is the fine-grained spatiotemporal grounding (STG) feature, which can identify spatiotemporal details in videos, and has the potential to revolutionize the video editing industry.

Trae SOLO China Edition Launches with a Bang: Plan Mode + Sub Agent - Draw a作战 Map Before Writing Code, No Matter How Long the Conversation!

ByteDance's AI coding tool Trae SOLO China Edition is launched, introducing five new features: Plan Mode, multi-task parallelism, Sub Agent, DiffView, and context compression, aiming to transform programming into commanding an AI army. Plan Mode allows developers to describe requirements in natural language, and the model automatically plans the steps, enabling planning before execution and improving development efficiency.

ByteDance PICO Strategic Upgrade: Launch Self-Developed Chip and New VR Headset in 2026

ByteDance accelerates the self-development and high-end positioning of VR hardware. The PICO brand under ByteDance plans to launch a new generation of headsets in 2026, equipped with a fully self-developed dedicated chip. This chip was initiated in 2022, completed the first chip return in 2024, and entered mass production. It meets the performance targets, with its core advantage being low latency performance.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

ByteDance Open Sources VeOmni Framework: A New Tool to Improve Multimodal Training Efficiency

AIbase基地

This article is from AIbase Daily

AI News Recommendations

French AI Company Mistral Launches New Model, Aiming to Compete with OpenAI and Google

SenseTime NEO Open Source: Achieve Top Multimodal Model Performance with 1/10 of the Data Volume, Ending the Era of Patchwork AI

AI Daily: Beijing Releases the Artificial Intelligence Industry White Paper; Bytedance Releases Video Editing Model Vidi2; Kuaishou to Release Kling Omni

TikTok Vidi2 Makes a Big Entrance! AI Video Editing Surpasses Gemini 3 Pro, Transforming Hour-Long Footage into a Cinematic Masterpiece in One Click

ByteDance Launches Groundbreaking AI Model Vidi2: 120 Billion Parameters, Revolutionizing Video Editing

Kuaishou Flagship Keye-VL-671B-A37B Launches with Significant Breakthroughs in Multimodal Reasoning Capabilities

Trae SOLO China Edition Launches with a Bang: Plan Mode + Sub Agent - Draw a作战 Map Before Writing Code, No Matter How Long the Conversation!

ByteDance PICO Strategic Upgrade: Launch Self-Developed Chip and New VR Headset in 2026

ByteDance TRAE SOLO Mode China Version Officially Launched, Free Open Source to Assist Efficient Full-Process Development

Google Gemini 3 Quickly Tops LMArena Rankings, Musk and Altman Send Congratulations

GEO Services