Researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University have open-sourced the multimodal large model LLaVA-1.5, which has demonstrated exceptional performance in 11 benchmark tests, including visual question answering and image captioning tasks. LLaVA-1.5 requires only 8 A100 GPUs to complete training within a day, showcasing significant performance. The researchers proposed a method of adding output format prompts during the fine-tuning process, enabling the model to better adapt to different tasks. LLaVA-1.5's powerful multimodal understanding capabilities challenge the status of GPT-4V.
Confronting GPT-4V! Zhejiang University Alumni Open Source Multimodal Model LLaVA-1.5, 13 Billion Parameters Trained in One Day on 8 A100 GPUs

新智元
This article is from AIbase Daily
Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.