Google DeepMind's New Research TIPSv2: Let AI Truly Understand Images, Not Just Take a Quick Glance

AIbase基地

Published inAI News · 5 min read · Apr 16, 2026

Currently, AI image understanding has an underlying weakness.

When asked "What is in this picture," it can give a detailed answer. But when asked "Where is the left hind leg of the panda in the image," it starts to become vague. This isn't an issue with individual models, but rather a long-standing problem across the entire visual-language large model field—strong global understanding, weak local localization.

Google DeepMind proposed the TIPSv2 solution in their latest paper, specifically to tackle this tough challenge.

The research team discovered an counterintuitive phenomenon: in fine segmentation tasks, smaller "student models" often outperform larger "teacher models." The reason is that the distillation process removes the masking mechanism, forcing the model to learn all details of the entire image, forming "full-area supervision." Inspired by this finding, TIPSv2 made three key improvements.

The first is iBOT++. Traditional pre-training only calculates loss for masked regions in the image, leaving visible areas in a "neglected" state, leading to drifting local semantics. iBOT++ requires the model to provide precise supervision for all visible areas, effectively upgrading from a "puzzle game" to "careful reading of the entire text." This single change alone improved zero-shot segmentation performance by 14.1 percentage points.

The second is Head-only EMA. Traditional self-supervised training requires maintaining two almost identical large models in memory, which is very resource-intensive. TIPSv2 found that the image-text contrastive loss itself is sufficient to stabilize the backbone network, so EMA only needs to act on the final projection head, and the backbone no longer needs to be duplicated. As a result, the training parameter count was reduced by about 42%, making it faster with almost no loss in performance.

The third is multi-granularity text pairing. During training, short web descriptions, medium detailed descriptions, and long descriptions generated by Gemini are randomly mixed and fed into the model, alternating between easy and difficult tasks, preventing the model from "slacking off" due to overly simple tasks while ensuring no details are lost.

The final results are quite solid. TIPSv2 completed frozen evaluation on nine tasks and 20 authoritative datasets. Zero-shot semantic segmentation set a new industry benchmark, and image-text retrieval and classification outperformed comparison models with 56% more parameters. Pure visual tasks also ranked among the top performers.

Currently, the code and model weights of TIPSv2 are fully open-sourced. For teams working on medical imaging, autonomous driving, industrial inspection, and other fields requiring high-precision image understanding, this solution is worth careful evaluation.

Paper link: https://www.alphaxiv.org/abs/2604.12012

The AI World is Too Competitive: Microsoft Copilot Plans to Introduce DeepSeek Model Under Cost Pressure

Microsoft's enterprise AI system Copilot Cowork is now globally available, with over half of Fortune 500 companies deploying it during preview. To cater to businesses of all sizes, Microsoft is restructuring its business model beyond a single solution, signaling a major AI strategy shift and strong market penetration.....

Amazon sells self-developed Trainium chips to external parties, AI computing market has broad prospects

Amazon's AI chief revealed discussions to sell its self-developed Trainium AI chip to external companies, extending beyond its exclusive AWS cloud offering. This shift shows Amazon adapting its strategy from cloud-only sales to direct chip sales, aiming to meet evolving AI infrastructure needs and customer demands.....

WeChat Gradually Launches Native AI Assistant, Large Models Fully Activate National-Level Application Ecosystem

WeChat's new native AI assistant, 'Xiaowei', has started a phased internal test. The interface features a dialog window accessible through an icon in the top-left corner. It supports text or voice commands to directly control WeChat's native functions and launch mini programs, such as sending messages on behalf of friends, marking a low-key attempt by WeChat to deeply integrate AI capabilities.

AWS Launches Two New Services, Continuum and Context, to Fully Strengthen AI Agent Security and Business Background Capabilities

At the New York AWS Summit, Amazon launched Continuum and Context services to address security risks and lack of context in AI agent production deployment. Continuum automatically fixes code vulnerabilities, sorted by business impact, with isolated validation and remediation; Context generates a shared knowledge graph from enterprise data.....

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Ranking Monitor

AI Conversation Insight

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

LLM API Proxy Checker

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

Google DeepMind's New Research TIPSv2: Let AI Truly Understand Images, Not Just Take a Quick Glance

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Amazon AI Computing Power New Strategy: Self-developed Trainium Chip May Start Spot Sales

Breaking the AI Barrier: Liu Qiangdong Announces Global Sharing of JD Tech, the Phoenix Plan Empowering Blue-Collar Workers' Transition

The AI World is Too Competitive: Microsoft Copilot Plans to Introduce DeepSeek Model Under Cost Pressure

Amazon sells self-developed Trainium chips to external parties, AI computing market has broad prospects

WeChat Gradually Launches Native AI Assistant, Large Models Fully Activate National-Level Application Ecosystem

AI Pioneers' Mass Migration: Nobel Laureate John Jumper Leaves DeepMind to Join Anthropic

Afu Calls for Losing 50 Million Catties through Scientific Weight Loss, Experts Say the Difficulty Lies in Staying Committed - AI Can Help

Tesla Files Megapod Trademark Application, May Launch Integrated Modular AI Data Center Hardware

AWS Launches Two New Services, Continuum and Context, to Fully Strengthen AI Agent Security and Business Background Capabilities

AI-Powered Office Suite: Grok Officially Integrates with Microsoft Office

AI News Recommendations

Amazon AI Computing Power New Strategy: Self-developed Trainium Chip May Start Spot Sales

Breaking the AI Barrier: Liu Qiangdong Announces Global Sharing of JD Tech, the Phoenix Plan Empowering Blue-Collar Workers' Transition

The AI World is Too Competitive: Microsoft Copilot Plans to Introduce DeepSeek Model Under Cost Pressure

Amazon sells self-developed Trainium chips to external parties, AI computing market has broad prospects

WeChat Gradually Launches Native AI Assistant, Large Models Fully Activate National-Level Application Ecosystem

AI Pioneers' Mass Migration: Nobel Laureate John Jumper Leaves DeepMind to Join Anthropic

Afu Calls for Losing 50 Million Catties through Scientific Weight Loss, Experts Say the Difficulty Lies in Staying Committed - AI Can Help

Tesla Files Megapod Trademark Application, May Launch Integrated Modular AI Data Center Hardware

AWS Launches Two New Services, Continuum and Context, to Fully Strengthen AI Agent Security and Business Background Capabilities

AI-Powered Office Suite: Grok Officially Integrates with Microsoft Office