As the integration of vision and language deepens, understanding text in images has become a new frontier in the multimodal field. KOSMOS-2.5 is a groundbreaking multimodal model that employs a unified Transformer framework to achieve end-to-end understanding of text in images. The model demonstrates exceptional performance on various text-intensive image tasks, including document text recognition and Markdown generation. The goal of KOSMOS-2.5 is to further enhance the model's ability to interpret and generate explanations from text in images, applying it to more practical scenarios. Through joint multi-task training, KOSMOS-2.5's multimodal understanding capabilities have been strengthened.