Recently, VAE (Variational Autoencoder) has been facing an embarrassing situation of being gradually phased out in the tech world. With the collaboration between Tsinghua University and Kuaishou's Ling team, a new generative model called SVG (VAE-free latent diffusion model) has been introduced. This innovation not only achieved an amazing 6200% improvement in training efficiency but also saw a 3500% leap in generation speed.

The decline of VAE in the field of image generation mainly stems from the "semantic entanglement" issue. In other words, when we try to change just one feature of an image (such as the color of a cat), other features (such as body size or expression) are often affected, resulting in inaccurate generated images. To solve this problem, the SVG model developed by Tsinghua University and Kuaishou took a different approach, actively building a feature space that integrates semantics and details.

image.png

In the design of the SVG model, the team first used the DINOv3 pre-trained model as a semantic extractor. This model, trained through large-scale self-supervised learning, can effectively identify and separate features of different categories, solving the semantic confusion in traditional VAE models. Additionally, to supplement details, the team specially designed a lightweight residual encoder to ensure that detail information does not conflict with semantic features. The key distribution alignment mechanism further enhances the fusion of these two types of features, ensuring the high quality of the generated images.

image.png

Experimental results show that the SVG model comprehensively surpasses traditional VAE approaches in terms of generation quality and multi-task generalizability. On the ImageNet dataset, the SVG model achieved a FID value (a metric measuring the similarity between generated and real images) of 6.57 after only 80 training cycles, far exceeding VAE models of similar scale; in terms of inference efficiency, the SVG model also demonstrated excellent performance, generating clear images with fewer sampling steps. Moreover, the feature space of the SVG model can be directly used for various visual tasks such as image classification and semantic segmentation without additional fine-tuning, greatly improving application flexibility.

The new technology developed by Tsinghua University and Kuaishou not only brings revolutionary changes to the field of image generation but also shows great potential in multimodal generation tasks.

Paper link: https://arxiv.org/pdf/2510.15301