Diffusion

Add Gaussian noise and then reverse
Forward diffusion process:
- Gradually adds noise to input
Reverse denoising process: - Learns to generate data by denoising

Training Objective

Network structure
Diffusion models often use U-Net architectures with ResNet blocks and Self-Attention Layer to represent
Time representation: Sinusoidal positional embeddings or random Fourier features
- Time features are fed to the residual blocks using either simple spatial addition or using adaptive group normalization layers

Latent Diffusion Models
Map data into compressed latent space
Train diffusion model efficiently in latent space

Advantages
- Compressed latent space: Train diffusion model in lower resolution
- Computationally more efficient
- Regularized smooth / compressed latent space
- Easier task for diffusion model and faster sampling
- Flexibility
- Autoencoders can be tailored to data
Condition information is fed into the latent diffusion model by cross-attention
Query:
- Visual features from U-Net
- Key and value: text features
Compression and Encoding
Diffusion Model

Latent Diffusion Model

Diffusion Transformers
- Replace U-Net with Transformers
