A multimodal model developed by the ictnlp team that enhances performance with only one visual token. It is open source and free, suitable for scenarios requiring rapid and accurate understanding of visual content.