The CogVLM-17B multi-modal model, developed in collaboration between Tsinghua University and DeepSeek AI, has achieved state-of-the-art performance on multiple benchmarks. This model has realized deep integration, enhancing the performance of visual-language models and supporting functions such as object detection and text recognition. The article also mentions the competition and development of other multi-modal models, signaling intense competition in the multi-modal AI field. CogVLM-17B is poised to challenge the leadership position of GPT-4V.