AIbase Report The University of Hong Kong and Kuaishou Clever team recently published a groundbreaking paper titled "Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval," proposing a revolutionary "Context-as-Memory" approach that successfully solves the core challenge of scene consistency control in long video generation.

Innovative Concept: Treating Historical Context as a "Memory" Carrier

The core innovation of this study lies in treating historically generated context as "memory," using context learning technology to learn contextual conditions, thereby achieving high-level consistency control over scenes in long videos. The research team found that video generation models can implicitly learn 3D priors in video data without explicit 3D modeling assistance, a concept that aligns with Google's Genie3.

Technical Breakthrough: FOV-Based Memory Retrieval Mechanism Significantly Improves Efficiency

To address the computational burden caused by theoretically infinitely long historical frame sequences, the research team proposed a memory retrieval mechanism based on the field of view (FOV) of the camera trajectory. This mechanism can intelligently select frames from all historical frames that are highly relevant to the current generated video as memory conditions, significantly improving computational efficiency and reducing training costs.

Through a dynamic retrieval strategy, the system determines the relevance between predicted frames and historical frames based on the overlap relationship of the camera trajectory's FOV, greatly reducing the number of contexts that need to be learned, achieving a qualitative leap in model training and inference efficiency.

Data Construction and Application Scenarios

The research team collected a diverse set of long video datasets with precise camera trajectory annotations using Unreal Engine 5, providing a solid foundation for technical validation. Users only need to provide an initial image to freely explore the generated virtual world along a set camera trajectory.

Performance Exceeds Existing Methods

Experimental results show that Context-as-Memory maintains excellent static scene memory at time scales of tens of seconds and demonstrates good generalization across different scenes. Compared to existing SOTA methods, this technology achieves significant performance improvements in scene memory for long video generation and effectively maintains memory continuity in unseen open-domain scenarios.

This breakthrough marks an important step forward for AI video generation technology toward longer time sequences and higher consistency, opening up new possibilities for application areas such as virtual world construction and film production.