Recently, researchers from Nanyang Technological University, Peking University's Wang Xuan Institute of Computer Technology, and the Shanghai Artificial Intelligence Laboratory jointly open-sourced a long-memory world model called "WORLDMEM." This new model aims to address the issue of long-term consistency in virtual environments, especially maintaining the coherence of 3D spaces even when perspectives change or time passes, thereby significantly enhancing user experience.
The core of WORLDMEM lies in its innovative memory mechanism. This mechanism constructs a repository containing multiple memory units, each storing scene information and state data related to specific times. Through this mechanism, the model can effectively extract information from previously observed scenes and reconstruct accurate scenes when perspectives or time change. This approach breaks through the limitations of traditional methods that rely on short-term context windows, making it possible to retain environmental details for a long time.
When generating new scenes, WORLDMEM's memory mechanism can quickly retrieve the most relevant information from a large memory library. This process involves complex reasoning and matching to ensure that the extracted information aligns with the current time, perspective, and scene state. For example, when a virtual character moves around the environment and returns to its original position, the model quickly finds the previous memory frame to ensure scene consistency and coherence.
Additionally, WORLDMEM has the ability to dynamically update as the virtual world evolves, with new scenes and information continuously added to the memory library. This feature ensures the model accurately records the latest environmental states, thus improving the quality of scene generation. The model uses an architecture based on conditional diffusion transformers, which can integrate external action signals to achieve first-person perspective generation in virtual worlds, allowing characters to move and interact flexibly within the virtual environment.
WORLDMEM also uses diffusion forcing technology for training, enabling the model to perform long-term simulations over time. This training method ensures the coherence of scene generation and allows the model to effectively respond to different action instructions and scene changes. By projecting action signals into the embedding space and combining denoising time step embeddings, the model enhances its responsiveness to action signals.
The release of WORLDMEM marks an important advancement in virtual environment simulation technology, providing strong support for future virtual reality applications.
Open source address: https://github.com/xizaoqu/WorldMem
Key points:
🌍 WORLDMEM is an open-source long-memory world model aimed at improving consistency and coherence in virtual environments.
🔍 The model's core memory mechanism can effectively store and retrieve scene information, breaking through the limitations of traditional methods.
🔄 WORLDMEM has dynamic updating capabilities, continuously optimizing scene generation quality as the environment changes.