Researchers from Anhui University of Engineering, Nanyang Technological University, and Lehigh University have open-sourced a multi-modal large model known as TinyGPT-V. This model can match the performance of models with hundreds of billions of parameters, yet it requires only a 24G GPU for training. TinyGPT-V is primarily composed of a large language model, Phi-2, a visual encoder, and a linear projection layer. The researchers have conducted a multi-faceted evaluation of TinyGPT-V's performance, demonstrating its robust capabilities across various visual-language tasks.