| --- |
| language: en |
| license: mit |
| tags: |
| - diffusion |
| - autoencoder |
| - feature-space |
| - svg |
| references: |
| - https://arxiv.org/abs/2510.15301 |
| --- |
| |
| # SVG: Latent Diffusion Model without Variational Autoencoder |
|
|
| ## Model Description |
|
|
| SVG is a latent diffusion model framework that replaces the traditional VAE latent space with semantically structured features from self-supervised vision models (e.g., DINOv3). This design improves generative capability and downstream transferability while maintaining efficiency comparable to standard VAE-based latent diffusion models. |
|
|
| Key features: |
|
|
| - Replaces low-dimensional VAE latent space with high-dimensional semantic feature space. |
| - Includes a lightweight residual encoder for refining fine-grained details. |
| - Enables strong generation and perception performance. |
|
|
|
|
| ## How to Use |
|
|
| For code, and instructions, see the GitHub repository: |
|
|
| [https://github.com/shiml20/SVG](https://github.com/shiml20/SVG) |
|
|
|
|
| Official project page: |
|
|
| [https://howlin-wang.github.io/svg/](https://howlin-wang.github.io/svg/) |
|
|
| Arxiv paper: |
|
|
| [https://arxiv.org/abs/2510.15301](https://arxiv.org/abs/2510.15301) |