LLaVA-OneVision-1.5, a Fully Open-Source Multimodal Model That Exceeds Qwen2.5-VL

AIbase基地

Published inAI News · 4 min read · Oct 17, 2025

Recently, the open-source community welcomed LLaVA-OneVision-1.5, a new multimodal model that marks a major technological advancement. The LLaVA (Large Language and Vision Assistant) series has been developed over two years, gradually evolving from simple image-text alignment models into a comprehensive framework capable of handling various input forms such as images and videos.

LLaVA-OneVision-1.5's core philosophy is to provide an open, efficient, and reproducible training framework, allowing users to easily build high-quality vision-language models. Its training process is divided into three stages: first, in the pre-training phase of language-image alignment, the model learns to convert visual features into linguistic word embeddings.

Next, in the second stage "high-quality knowledge learning," the model is trained on 85 million training samples, injecting a large amount of visual and knowledge information, significantly enhancing its capabilities. Finally, in the visual instruction fine-tuning stage, the model is trained using a carefully designed dataset, enabling it to handle various complex visual instructions.

In terms of efficiency, the team adopted an innovative offline parallel data packaging method, significantly improving training efficiency. With 85 million samples, the data processing compression ratio reached as high as 11 times, and the training process could be completed in just 3.7 days. Meanwhile, LLaVA-OneVision-1.5 also uses RICE-ViT as the visual encoder, which has regional perception for visual understanding, making it especially suitable for processing text in documents.

Data is the foundation of model capabilities. The pre-training dataset of LLaVA-OneVision-1.5 is diverse and wide-ranging, and it introduces a "concept-balanced" sampling strategy to ensure balanced performance across various tasks. This model performs excellently in various benchmark tests, especially the 8-billion-parameter version, which outperformed Qwen2.5-VL in 27 benchmarks.

Project:

https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5

https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct

Key Points:
🌟 LLaVA-OneVision-1.5 is the latest open-source multimodal model, capable of handling multiple inputs such as images and videos.
📈 The training process is divided into three stages, aiming to efficiently enhance the model's visual and language comprehension abilities.
🏆 LLaVA-OneVision-1.5 performs excellently in benchmark tests, surpassing the Qwen2.5-VL model.

American Broadcaster Falls into a Harassment Scandal Due to AI Advice, Faces 70 Years in Prison!

A 31-year-old podcaster faces charges for cyberstalking and interstate threats, potentially resulting in 70 years in prison and a $3.5 million fine. He expressed a desire for a 'wife' and extreme anger toward women on social media, referring to ChatGPT as his 'best friend,' highlighting AI's negative role in the case.....

KlingAI Avatar 2.0 Launches and Immediately Becomes a Hit: Singing and Dancing in 5 Minutes with One Click, Digital Humans Officially Bid Farewell to the Stiff Expression Era

Kuaishou's Kling AI launches Avatar2.0, enabling users to create up to 5-minute singing videos from a photo and music. The model enhances digital human expressiveness with natural facial and body movements, moving beyond stiff lip-syncing, marking a shift from static to dynamic AI content creation.....

Google Cloud × Replit Secures Long-term Major Deal: Powered by Claude 3.5 Sonnet + Gemini 1.5 Flash Dual Models, Ambient Programming Officially Challenges Anthropic

Google Cloud and Replit have reached a strategic cooperation, integrating Claude 3.5 Sonnet and Gemini 1.5 Flash into Replit Agent, launching the "Ambient Programming" solution, which competes with Anthropic Claude Code supported by Amazon. The two models have clear divisions of labor: Claude is responsible for strategic architecture and complex system design, while Gemini specializes in fast code completion. This solution runs on Vertex AI and can automatically switch models for enterprises.

Kuaishou Coling Digital Human 2.0 Launches: Create a Virtual Character That Can Speak and Act in Three Steps

Kuaishou's Keling Digital Human 2.0 is now fully launched, enabling users to create expressive digital human videos in just three steps. The new version supports uploading character images, adding voiceovers, and describing performances, generating videos up to 5 minutes long. Compared to the previous version, 2.0 significantly enhances expressiveness with precise control over hand movements and lip-syncing.....

Kuaishou Keling 2.6 Now Fully Released! Audio and Video Created Together, Video, Natural Voice, Matching Sound Effects, Environmental Atmosphere

Kuaishou's Kling AI launches Kling 2.6, the first 'audio-visual simultaneous' model, generating visuals, voice, sound effects, and ambiance together. It offers text-to-audiovisual and image-to-audiovisual creation, enabling users to quickly produce full videos from a sentence or image, enhancing creative experience.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services

AI Model Compatibility Checker

AI Deployment Calculator

LLaVA-OneVision-1.5, a Fully Open-Source Multimodal Model That Exceeds Qwen2.5-VL

AIbase基地

This article is from AIbase Daily

AI News Recommendations

American Broadcaster Falls into a Harassment Scandal Due to AI Advice, Faces 70 Years in Prison!

KlingAI Avatar 2.0 Launches and Immediately Becomes a Hit: Singing and Dancing in 5 Minutes with One Click, Digital Humans Officially Bid Farewell to the Stiff Expression Era

Google Cloud × Replit Secures Long-term Major Deal: Powered by Claude 3.5 Sonnet + Gemini 1.5 Flash Dual Models, Ambient Programming Officially Challenges Anthropic

Kuaishou Coling Digital Human 2.0 Launches: Create a Virtual Character That Can Speak and Act in Three Steps

Zhiyuan Institute Unveils the World's Most Powerful Multimodal World Model Emu3.5, Predicts the Next Second of the Real World in One Click!

James Cameron Reaffirms That 'Avatar: Fire and Ash' Does Not Use AI Technology, Emphasizes the Importance of Live-Action Performances

Google's New AI Gemini3 Pro Receives 69% Positive Feedback in User Trust Test

OpenAI Acquires Neptune, the Experiment Monitoring Tool, Enhancing GPT Iteration Speed by Doubles

Kuaishou Keling 2.6 Now Fully Released! Audio and Video Created Together, Video, Natural Voice, Matching Sound Effects, Environmental Atmosphere

Global Actors Gather at the 2025 Sustainable Social Value Innovation Conference to Explore Solutions for Sustainable Development in the AI Era

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

LLaVA-OneVision-1.5, a Fully Open-Source Multimodal Model That Exceeds Qwen2.5-VL

AIbase基地

This article is from AIbase Daily

AI News Recommendations

American Broadcaster Falls into a Harassment Scandal Due to AI Advice, Faces 70 Years in Prison!

KlingAI Avatar 2.0 Launches and Immediately Becomes a Hit: Singing and Dancing in 5 Minutes with One Click, Digital Humans Officially Bid Farewell to the Stiff Expression Era

Google Cloud × Replit Secures Long-term Major Deal: Powered by Claude 3.5 Sonnet + Gemini 1.5 Flash Dual Models, Ambient Programming Officially Challenges Anthropic

Kuaishou Coling Digital Human 2.0 Launches: Create a Virtual Character That Can Speak and Act in Three Steps

Zhiyuan Institute Unveils the World's Most Powerful Multimodal World Model Emu3.5, Predicts the Next Second of the Real World in One Click!

James Cameron Reaffirms That 'Avatar: Fire and Ash' Does Not Use AI Technology, Emphasizes the Importance of Live-Action Performances

Google's New AI Gemini3 Pro Receives 69% Positive Feedback in User Trust Test

OpenAI Acquires Neptune, the Experiment Monitoring Tool, Enhancing GPT Iteration Speed by Doubles

Kuaishou Keling 2.6 Now Fully Released! Audio and Video Created Together, Video, Natural Voice, Matching Sound Effects, Environmental Atmosphere

Global Actors Gather at the 2025 Sustainable Social Value Innovation Conference to Explore Solutions for Sustainable Development in the AI Era

GEO Services