ByteDance and universities launch Sa2VA, integrating LLaVA for video understanding and SAM-2 for precise object segmentation, enhancing video analysis through complementary capabilities.....
LLaVA-OneVision-1.5, a breakthrough multimodal model, evolved over two years from basic image-text alignment to handling images/videos. It offers an open, efficient training framework for building high-quality vision-language models via three-stage training.....
AI highlights: 1. Alibaba's Qwen-Image excels in Chinese text rendering. 2. ChatGPT hits 700M users, OpenAI earns $12B. 3. Anthropic tests Claude 4.1. 4. Zread.ai by Zhipu. 5. xAI's Grok Imagine4 for text-to-video. 6. Character.AI's social features. 7. Alibaba & Nankai's LLaVA-Scissor. 8. Beijing's humanoid robot vision. 9. 8 AI models in Kaggle chess. 10. OpenMind's OM1 OS.....
No description available
LLaVA-Mini is a large-scale multimodal model designed for efficient comprehension of images and videos.
A visual language model capable of step-by-step reasoning.
A visual-language model that intelligently processes both image and text information.
Research on video instruction tuning and synthetic data.
Baidu
-
Input tokens/M
Output tokens/M
32
Context Length
Alibaba
$0.5
Google
$0.7
$1.4
131
Tencent
$4
$12
28
Deepseek
$1
8
01-ai
4
200
Baichuan
Bytedance
$5
$9
256
lmms-lab
LLaVA-OneVision-1.5 is a series of fully open-source large multimodal models that achieve advanced performance at a lower cost by training on native resolution images. This model demonstrates excellent performance in multiple multimodal benchmark tests, surpassing competitors such as Qwen2.5-VL.
rp-yu
Dimple is the first discrete diffusion multimodal large language model (DMLLM) that combines autoregressive and diffusion training paradigms. After training on the same dataset as LLaVA-NEXT, it outperforms LLaVA-NEXT-7B by 3.9%.
Marwan02
This model is a GGUF format conversion of llava-hf/llava-1.5-7b-hf, supporting image-to-text generation tasks.
rogerxi
Spatial-LLaVA-7B is a multimodal model fine-tuned based on the LLaVA model, focusing on improving the ability of spatial relationship reasoning and suitable for multimodal research and chatbot development.
SpursgoZmy
Table LLaVA 7B is an open-source multimodal chatbot specifically designed to understand table images and can perform various table-related tasks such as table question answering, table cell description, and structure understanding. This model is based on the LLaVA-v1.5 architecture, using CLIP-ViT-L-336px as the visual encoder and Vicuna-v1.5-7B as the base large language model.
mradermacher
This project provides weighted/matrix quantized versions of the llava-1.5-13b-hf model, including various quantization types to meet the usage requirements in different scenarios.
tsunghanwu
SESAME is an open-source multimodal model, fine-tuned on various instruction-based image localization (segmentation) datasets based on the LLaVA model.
This is a static quantized version of the llava-hf/llava-1.5-13b-hf model, offering multiple quantization type options to help users use this vision-language model more efficiently. The model supports image understanding and text generation tasks.
nkkbr
ViCA-7B is a vision-language model fine-tuned specifically for visual-spatial reasoning in indoor video environments. Built on the LLaVA-Video-7B-Qwen2 architecture and trained using the ViCA-322K dataset, it emphasizes structured spatial annotation and instruction-based complex reasoning tasks.
aiden200
A fine-tuned version based on the lmms-lab/llava-onevision-qwen2-7b-ov model, supporting video-text-to-text conversion tasks.
nezahatkorkmaz
Based on Microsoft LLaVA-Med v1.5 (Mistral 7B) architecture, customized for Turkish-language medical visual question answering (VQA)
MLAdaptiveIntelligence
LLaVAction is a multimodal large language model evaluation and training framework for action recognition, based on the Qwen2 language model architecture, supporting first-person perspective video understanding.
LLaVAction is a multimodal large language model for action recognition, based on the Qwen2 language model, trained on the EPIC-KITCHENS-100-MQA dataset.
FriendliAI
LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.
Isotr0py
This model is based on the Transformers library, and its specific purpose and functionality require further information to be supplemented.
YuchengShi
A multimodal foundation model fine-tuned based on LLaVA-Med v1.5 Mistral-7B, optimized for analyzing chest X-ray images and detecting pneumonia
A multimodal foundation model fine-tuned based on LLaVA-1.5-7B, optimized for plant leaf disease detection and interpretation
X-iZhang
LLaVA-Med is an open-source large vision-language model optimized for biomedical applications, built on the LLaVA framework, enhanced through curriculum learning, and fine-tuned for open-ended biomedical question answering tasks.
zhibinlan
LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.
LLaVE is a multimodal embedding model based on the LLaVA-OneVision-0.5B model, with a parameter scale of 0.5B, capable of embedding text, images, multiple images, and videos.