Currently, AI image understanding has an underlying weakness.

When asked "What is in this picture," it can give a detailed answer. But when asked "Where is the left hind leg of the panda in the image," it starts to become vague. This isn't an issue with individual models, but rather a long-standing problem across the entire visual-language large model field—strong global understanding, weak local localization.

Google DeepMind proposed the TIPSv2 solution in their latest paper, specifically to tackle this tough challenge.

image.png

The research team discovered an counterintuitive phenomenon: in fine segmentation tasks, smaller "student models" often outperform larger "teacher models." The reason is that the distillation process removes the masking mechanism, forcing the model to learn all details of the entire image, forming "full-area supervision." Inspired by this finding, TIPSv2 made three key improvements.

The first is iBOT++. Traditional pre-training only calculates loss for masked regions in the image, leaving visible areas in a "neglected" state, leading to drifting local semantics. iBOT++ requires the model to provide precise supervision for all visible areas, effectively upgrading from a "puzzle game" to "careful reading of the entire text." This single change alone improved zero-shot segmentation performance by 14.1 percentage points.

The second is Head-only EMA. Traditional self-supervised training requires maintaining two almost identical large models in memory, which is very resource-intensive. TIPSv2 found that the image-text contrastive loss itself is sufficient to stabilize the backbone network, so EMA only needs to act on the final projection head, and the backbone no longer needs to be duplicated. As a result, the training parameter count was reduced by about 42%, making it faster with almost no loss in performance.

The third is multi-granularity text pairing. During training, short web descriptions, medium detailed descriptions, and long descriptions generated by Gemini are randomly mixed and fed into the model, alternating between easy and difficult tasks, preventing the model from "slacking off" due to overly simple tasks while ensuring no details are lost.

The final results are quite solid. TIPSv2 completed frozen evaluation on nine tasks and 20 authoritative datasets. Zero-shot semantic segmentation set a new industry benchmark, and image-text retrieval and classification outperformed comparison models with 56% more parameters. Pure visual tasks also ranked among the top performers.

Currently, the code and model weights of TIPSv2 are fully open-sourced. For teams working on medical imaging, autonomous driving, industrial inspection, and other fields requiring high-precision image understanding, this solution is worth careful evaluation.

Paper link: https://www.alphaxiv.org/abs/2604.12012