The latest high-resolution AI model, Griffon v2, integrates text and visual cues to provide flexible object references. The team has enhanced multimodal perception capabilities by using a downsampling projector. This model excels in tasks such as reference expression generation, phrase localization, and reference expression understanding, outperforming expert models. It features a visual-language coreference structure, demonstrating superior performance in object detection and object counting.