ByteDance Launches Sa2VA: Achieving Multimodal Intelligent Segmentation by Combining LLaVA with SAM-2

AIbase基地

Published inAI News · 4 min read · Oct 21, 2025

In recent AI technological advancements, ByteDance has collaborated with research teams from multiple universities to combine the advanced visual language model LLaVA with the segmentation model SAM-2, introducing a new model called Sa2VA. This innovative model not only understands video content but can also precisely track and segment characters and objects in videos based on user instructions.

LLaVA, as an open-source visual language model, excels at macro-level storytelling and content understanding in videos, but it struggles with detailed instructions. SAM-2, on the other hand, is an excellent image segmentation expert that can identify and segment objects in images, but lacks language comprehension capabilities. To address these shortcomings, Sa2VA effectively combines the two models through a simple and efficient "code" system.

The architecture of Sa2VA can be viewed as a dual-core processor: one core is responsible for language understanding and dialogue, while the other handles video segmentation and tracking. When users input instructions, Sa2VA generates specific instruction tokens that are passed to SAM-2 for precise segmentation. This design allows both modules to leverage their respective strengths and engage in effective feedback learning, continuously improving overall performance.

The research team also designed a multi-task joint training curriculum for Sa2VA to enhance its capabilities in image and video understanding. In various public tests, Sa2VA demonstrated outstanding performance, especially in video referential segmentation tasks. It not only achieves accurate segmentation in complex real-world scenarios but also tracks target objects in videos in real-time, showcasing strong dynamic processing capabilities.

In addition, ByteDance has released multiple versions and training tools for Sa2VA, encouraging developers to conduct research and applications. This initiative provides researchers and developers in the AI field with rich resources, promoting the development of multimodal AI technology.

Project:

https://lxtgh.github.io/project/sa2va/

https://github.com/bytedance/Sa2VA

Key Points:
- 🎥 Sa2VA is a new model introduced by ByteDance, combining the advantages of LLaVA and SAM-2, achieving the understanding and segmentation of video content.
- 🔗 The model effectively connects language understanding with image segmentation through a "code" system, enhancing interactive capabilities.
- 🌍 Open resources of Sa2VA provide developers with rich tools, promoting the research and application of multimodal AI technology.

Vidu Launches Q2 Image Creation Suite: 4K Image Generation + Image Editing + Image-to-Video, All Free to Use

Shengshu Technology launches the 'Vidu Q2 Image Creation Suite', integrating three main functions: reference-based image generation, text-to-image generation, and image editing. The new version exceeded 500,000 uses on its first day of release, indicating strong user demand. Vidu Q2 enhances image generation control, allowing precise specification of position, action, and composition in the image, and outputs 4K quality. The new image editing features include local retouching and material replacement, performing excellently in international evaluations.

Amazon Launches Nova 2 Series Models, AI Performance Reaches New Heights!

AWS unveils four self-developed 'Nova2' AI models at re:Invent 2025, covering text, image, video, and speech with built-in web search and code execution, claiming leading price-performance. Nova2 Lite offers cost-effective inference, outperforming Claude Haiku4.5 and GPT-5Mini at about half the cost, while Nova2 Pro targets complex agent tasks.....

Amazon Launches New Nova 2 Model Family with Comprehensive Technological Advantages

At the 2025 re:Invent conference, Amazon Web Services introduced the Nova2 model series, including four new models, offering leading cost-effectiveness in reasoning, multimodal, dialogue AI, code generation, and agent tasks. Among them, Nova2Lite is designed for everyday workloads, supporting text, image, and video input and generating text output. It is a fast and economical reasoning model.

AWS Launches the Nova 2 Series Models and Introduces a $100,000 Nova Forge Custom Training Service

Amazon launched the second-generation self-developed large model family Nova 2 at re:Invent 2025, including four new models: Lite, Pro, Sonic, and Omni, focusing on industry-leading cost-effectiveness, with pricing approximately half that of similar models. At the same time, it announced interconnection with Google Cloud to facilitate customers in calling competing models across platforms.

MIT-based Startup Liquid AI Unveils Enterprise-Level Small Model Training Blueprint LFM2

Liquid AI company released the second generation of Liquid Foundation Models (LFM2) in July 2025, featuring an innovative "liquid" architecture, aiming to become the fastest on-device foundation model in the market. Its efficient training and inference capabilities allow small models to rival large language models in the cloud. LFM2 initially offers dense checkpoint versions with 350M, 700M, and 1.2B parameters.

AI Daily: Beijing Releases the Artificial Intelligence Industry White Paper; Bytedance Releases Video Editing Model Vidi2; Kuaishou to Release Kling Omni

Beijing released the "Artificial Intelligence Industry White Paper (2025)", which expects the core output value to exceed 450 billion yuan. The white paper details the holding of the 2025 China Artificial Intelligence Conference in Beijing, as well as the Beijing Municipal Science and Technology Commission's plans and prospects for the development of the artificial intelligence industry.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services

AI Model Compatibility Checker

AI Deployment Calculator

ByteDance Launches Sa2VA: Achieving Multimodal Intelligent Segmentation by Combining LLaVA with SAM-2

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Vidu Launches Q2 Image Creation Suite: 4K Image Generation + Image Editing + Image-to-Video, All Free to Use

Amazon Launches Nova 2 Series Models, AI Performance Reaches New Heights!

Amazon Launches New Nova 2 Model Family with Comprehensive Technological Advantages

AWS Launches the Nova 2 Series Models and Introduces a $100,000 Nova Forge Custom Training Service

NVIDIA Invests $2 Billion in Strategic Partnership with Synopsys to Drive Transformation in Engineering Design

MIT-based Startup Liquid AI Unveils Enterprise-Level Small Model Training Blueprint LFM2

Nuclear Bomb in Sora 2 Was a Dud: 1 Million Installs in the First Week, 60-Day Retention Almost Zero%

AI Daily: Beijing Releases the Artificial Intelligence Industry White Paper; Bytedance Releases Video Editing Model Vidi2; Kuaishou to Release Kling Omni

TikTok Vidi2 Makes a Big Entrance! AI Video Editing Surpasses Gemini 3 Pro, Transforming Hour-Long Footage into a Cinematic Masterpiece in One Click

Kuaishou Kling Omni Will Be Released This Week: Achieve Director-Level Precise Control and Generate 2-Minute Long Videos with Native Audio

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

ByteDance Launches Sa2VA: Achieving Multimodal Intelligent Segmentation by Combining LLaVA with SAM-2

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Vidu Launches Q2 Image Creation Suite: 4K Image Generation + Image Editing + Image-to-Video, All Free to Use

Amazon Launches Nova 2 Series Models, AI Performance Reaches New Heights!

Amazon Launches New Nova 2 Model Family with Comprehensive Technological Advantages

AWS Launches the Nova 2 Series Models and Introduces a $100,000 Nova Forge Custom Training Service

NVIDIA Invests $2 Billion in Strategic Partnership with Synopsys to Drive Transformation in Engineering Design

MIT-based Startup Liquid AI Unveils Enterprise-Level Small Model Training Blueprint LFM2

Nuclear Bomb in Sora 2 Was a Dud: 1 Million Installs in the First Week, 60-Day Retention Almost Zero%

AI Daily: Beijing Releases the Artificial Intelligence Industry White Paper; Bytedance Releases Video Editing Model Vidi2; Kuaishou to Release Kling Omni

TikTok Vidi2 Makes a Big Entrance! AI Video Editing Surpasses Gemini 3 Pro, Transforming Hour-Long Footage into a Cinematic Masterpiece in One Click

Kuaishou Kling Omni Will Be Released This Week: Achieve Director-Level Precise Control and Generate 2-Minute Long Videos with Native Audio

GEO Services