Researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University have open-sourced the multimodal large model LLaVA-1.5, which has demonstrated exceptional performance in 11 benchmark tests, including visual question answering and image captioning tasks. LLaVA-1.5 requires only 8 A100 GPUs to complete training within a day, showcasing significant performance. The researchers proposed a method of adding output format prompts during the fine-tuning process, enabling the model to better adapt to different tasks. LLaVA-1.5's powerful multimodal understanding capabilities challenge the status of GPT-4V.