Meta Releases V-JEPA 2: New Breakthroughs in Video Understanding and Zero-Shot Robot Control Lead the Future!

AIbase基地

Published inAI News · 10 min read · Jun 12, 2025

Meta AI research team has once again made a breakthrough in the field of artificial intelligence, officially releasing the new video understanding model — V-JEPA2 (Video Joint Embedding Predictive Architecture2) on June 11, 2025. Led by Meta's chief AI scientist Yann LeCun, this model, through its innovative self-supervised learning technology and zero-shot robotic control capabilities, opens up new possibilities for video understanding and physical world modeling. AIbase provides an in-depth analysis of this cutting-edge technology and its potential impact.

V-JEPA2: The "World Model" for Video Understanding

V-JEPA2 is a non-generative AI model focused on video understanding. It can judge what is happening in videos and predict future developments by observing video content. Unlike traditional video analysis models, V-JEPA2 simulates human cognition by self-supervised learning to extract abstract representations from massive unannotated videos, building an intrinsic understanding of the physical world. This "world model" architecture enables it not only to understand object interactions in videos but also to predict object motion trajectories and scene changes.

Facebook Metaverse meta

According to Meta’s official introduction, during the training process, V-JEPA2 used over 1 million hours of video data covering various scenarios and interaction contents. This large-scale training endowed the model with strong generalization capabilities, enabling it to adapt to new tasks and unfamiliar environments without additional training.

Technological Innovation: Five Highlights Drive Future AI

The technological breakthroughs of V-JEPA2 are embodied in the following five core aspects:

Self-Supervised Learning: V-JEPA2 does not rely on a large amount of labeled data. Instead, it extracts knowledge from unlabeled videos through self-supervised learning, significantly reducing data preparation costs.

Occlusion Prediction Mechanism: By randomly occluding certain regions in videos, the model is trained to predict the occluded content, similar to "fill-in-the-blank questions," thereby learning the deep semantics of videos.

Abstract Representation Learning: Unlike traditional pixel-level reconstruction, V-JEPA2 focuses on learning the abstract meaning of videos, understanding the relationships and dynamic changes between objects rather than simply memorizing visual details.

World Model Architecture: The model builds an intrinsic understanding of the physical world, enabling it to "imagine" how objects move and interact, such as predicting the trajectory of a ball's bounce or the results of object collisions.

Efficient Transfer Capability: Based on an understanding of the physical world, V-JEPA2 can quickly adapt to new tasks, showcasing strong zero-shot learning capabilities, particularly outstanding in the robotics control domain.

These innovations enable V-JEPA2 to perform exceptionally well in tasks such as video classification, action recognition, and spatiotemporal action detection, outperforming traditional models while improving training efficiency by 1.5 to 6 times.

Zero-Shot Robotic Control: A Bridge Between AI and the Real World

One of the most notable applications of V-JEPA2 is zero-shot robotic control. Traditional robot control models (such as YOLO) require extensive training for specific tasks, whereas V-JEPA2, with its powerful transfer capabilities and understanding of the physical world, can control robots to complete new tasks without prior specialized training. For example, robots can understand the environment in real-time based on video input and execute operations such as moving objects or navigating unfamiliar scenes.

Meta stated that V-JEPA2’s "world model" capability holds great potential in the robotics field. For instance, robots can understand physical laws like gravity and collisions by observing videos, thus completing complex tasks in the real world, such as cooking or household assistance. This feature lays the foundation for the development of future intelligent robots and augmented reality (AR) devices.

Performance Comparison: A Leap in Speed and Efficiency

According to Meta’s official data, V-JEPA2 performs excellently in multiple benchmark tests, especially in action understanding and video tasks, surpassing traditional models based on ViT-L/16 and Hiera-L encoders. Compared to NVIDIA’s Cosmos model, V-JEPA2 is 30 times faster in training, demonstrating excellent efficiency advantages. Additionally, the model performs particularly well in low-sample scenarios, achieving high precision with only a small amount of labeled data, showcasing its strong generalization capabilities.

Open Source Sharing: Promoting Global AI Research

In line with the philosophy of open science, Meta released V-JEPA2 under the CC-BY-NC license, making it freely available for use by global researchers and developers. The model code is publicly available on GitHub and supports running on platforms such as Google Colab and Kaggle. Moreover, Meta also released three physical reasoning benchmark tests (MVPBench, IntPhys2, and CausalVQA), providing standardized evaluation tools for research in video understanding and robotic control domains.

FUTURE OUTLOOK: A MILESTONE TOWARD UNIVERSAL INTELLIGENCE

The release of V-JEPA2 is an important step for Meta in pursuing **Advanced Machine Intelligence (AMI)**. In a video, Yann LeCun stated, “The world model will usher in a new era of robotics technology, allowing AI agents to complete real-world tasks without massive training data.” In the future, Meta plans to further expand V-JEPA2’s functions by adding audio analysis and long video understanding capabilities, providing stronger support for applications such as AR glasses and virtual assistants.

AIbase believes that the launch of V-JEPA2 is not only a technical breakthrough in the field of video understanding but also marks AI’s transition from single-task processing to universal intelligence. Its zero-shot robotic control capability provides infinite possibilities for the development of robotics, metaverses, and intelligent interactive devices.

AIbase Conclusion

With its innovative self-supervised learning and world model architecture, Meta’s V-JEPA2 brings disruptive changes to the fields of video understanding and robotic control. From live streaming e-commerce to smart homes, the wide application prospects of this model are highly anticipated.

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Discover new AI products: https://top.aibase.com/1. Qwen-TTS Launches with a Major Breakthrough in Dialect Speech Synthesis, Achieving Realism Close to Human Voices. The Qwen-TTS model, developed by Alibaba's Tongyi team, has made significant breakthroughs in the field of speech synthesis.

PerMAXity: AI-Driven Investment Analysis and Automated Comprehensive Financial Reports

Recently, Perplexity launched a new feature called PerMAXity, which enables the creation of laboratories through scheduled tasks, allowing users to obtain comprehensive financial reports for their investment portfolios without the need for manual analysts. This innovative feature has attracted widespread attention due to its efficiency and intelligence. PerMAXity: A New Benchmark in Automated Financial Analysis. PerMAXity is a groundbreaking feature introduced by Perplexity, allowing users to automatically generate detailed financial reports for each asset in their investment portfolio through pre-designed scheduled tasks. Regardless

Meta Establishes a Superintelligence Lab to Lead a New Era in Artificial Intelligence

Meta is undergoing a major internal restructuring, deciding to consolidate all artificial intelligence-related teams into a new unit called "Meta Superintelligence Labs." This information was disclosed by Bloomberg, according to an internal memo from Meta, which shows that CEO Mark Zuckerberg hopes to focus the company's efforts on developing "superintelligence" artificial intelligence.

NoteGen Makes Its Debut: An AI-Powered Cross-Platform Note-Taking Tool, Marking a New Era in Knowledge Management

In the digital age, efficient note-taking tools have become an essential part of knowledge management. Recently, a cross-platform AI note-taking software called NoteGen has quickly gained popularity. It supports five major platforms: Windows, MacOS, Linux, iOS, and Android, and offers free multi-device data synchronization. With native Markdown formatting and strong integration with third-party large models, it redefines the note-taking experience. Full platform support and free synchronization seamlessly connect NoteGen, thanks to its powerful cross-platform compatibility.

Chai-2 Makes a Shocking Debut: AI-Powered Zero-Shot Antibody Design, Accelerating Drug Development by Hundreds of Times

Artificial intelligence once again stirs up the field of drug development! Chai Discovery recently launched a new AI model called Chai-2, which has drawn widespread attention with its breakthrough technology in molecular design. Chai-2 achieves zero-shot antibody design with a success rate of 16%-20%, hundreds of times higher than traditional methods, shortening the drug development cycle from months or even years to just two weeks. Zero-shot antibody design breaks through traditional bottlenecks. Chai-2 is a multi-modal generative AI model developed by Chai Discovery, specifically designed for...

TEN Agent Open Source TEN VAD and Turn Detection Enable Ultra-Low Latency for Speech AI

The TEN Agent team recently announced that its core models **TEN Voice Activity Detection (VAD)** and **TEN Turn Detection** are now open source, providing powerful technical support for building real-time, multimodal speech AI agents. This move marks a significant advancement in the TEN framework's efforts to promote the democratization and open-source collaboration of speech interaction technology. The following is the latest information compiled by AIbase, offering an in-depth analysis of these two core models.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Meta Releases V-JEPA 2: New Breakthroughs in Video Understanding and Zero-Shot Robot Control Lead the Future!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

The Revolution of Large Models! How Gemini 2.5 Pro is Transforming the Way We Process Information

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Taobao's New Recommendation Large Model RecGPT Launches, Shopping Experience Upgraded Again

PerMAXity: AI-Driven Investment Analysis and Automated Comprehensive Financial Reports

Meta Establishes a Superintelligence Lab to Lead a New Era in Artificial Intelligence

NoteGen Makes Its Debut: An AI-Powered Cross-Platform Note-Taking Tool, Marking a New Era in Knowledge Management

Microsoft Launches MAI-DxO AI System, Medical Diagnosis Accuracy Increases Fourfold

TEN VAD Shocks Open Source: Enterprise-Level Speech Detection Tool, Creating a Super Intelligent AI Voice Assistant!

Chai-2 Makes a Shocking Debut: AI-Powered Zero-Shot Antibody Design, Accelerating Drug Development by Hundreds of Times

TEN Agent Open Source TEN VAD and Turn Detection Enable Ultra-Low Latency for Speech AI