Variational Autoencoder

首次发布: 2025-05-03

... 次访问

/

Variational Autoencoder (VAE)#

Mathematical Establishment#

Assume a data generation process, where $z \sim p(z)$ represent latent variables (certain features like size, color, position) and $x \sim p_{real}(x)$ represent real observed variables (e.g. images, videos, texts). Our aim is to generate an observed variable via latent variables and a generator. Here, we bridge the latent distribution and the observed distribution with a conditional probability function $p_\theta(x|z)$ , which is a mathematical description of the generator.

Thus, we establish an man-made observed distribution

p_\theta(x) = \int_{\mathcal{Z}}p(z)p_\theta(x | z)dz

and we wish this distribution to match the real observed distribution $p_{real}(x)$

Optimization Metrics#

To measure the similarities between the distributions, we typically use the Kullback-Leibler (KL) divergence, and we try to minimize the divergence

\theta^* = \argmin_\theta D_{KL}[p_{real}(x) \| p_\theta(x)] = \int_{\mathcal{X}} p_{real}(x) \log \frac{p_{real}(x)}{p_\theta(x)} dx

However, this is intractable since we don’t know $p_{real}(x)$ explicitly. To find feasible methods, try decompose the divergence

\begin{aligned} \theta^* &= \argmin_\theta D_{KL}[p_{real}(x) \| p_\theta(x)]\\ &= \argmin_\theta \int_{\mathcal{X}} -p_{real}(x) \log p_\theta(x) dx + \text{const}\\ &= \argmax_\theta \int_{\mathcal{X}} p_{real}(x) \log p_\theta(x) dx \\ &= \argmax_\theta \mathbb{E}_{x \sim p_{real}(x)}[\log p_\theta(x)] \end{aligned}

So instead, we could maximize the log-likelihood of our data under the model

\argmax_\theta \mathbb{E}_{x \sim p_{real}(x)}[\log p_\theta(x)]

Okay, now, another problem emerged. Computing $p_\theta(x) = \int p(z)p_\theta(x|z)dz$ is also intractable due to hardness of the integral over all possible latent variables. Detailed reasons are listed as follows:

High-dimensional integration: The latent space $z$ is typically high-dimensional, making the integral computationally expensive or impossible to evaluate analytically.
Complex posterior distribution: Even if we could compute $p_\theta(x)$ , we would still need to compute the posterior $p_\theta(z|x) = \frac{p(z)p_\theta(x|z)}{p_\theta(x)}$ for inference, which requires the same intractable integral in the denominator.
No closed-form solution: For most practical choices of $p(z)$ and $p_\theta(x|z)$ (e.g., Gaussian distributions with neural network parameterization), the integral has no closed-form solution.
Sampling inefficiency: Monte Carlo sampling methods would require an impractical number of samples to get good estimates, especially in high dimensions where most of the probability mass is concentrated in a small region.

Variational Inference Solution#

To address these computational challenges, VAE introduces variational inference by approximating the intractable posterior $p_\theta(z|x)$ with a tractable variational distribution $q_\phi(z|x)$ parameterized by $\phi$ (typically implemented as an encoder neural network).

The key insight is to establish an exact decomposition of the log-likelihood. Starting from the log-likelihood we want to maximize

\log p_\theta(x) = \log \int_{\mathcal{Z}} p(z)p_\theta(x|z)dz

We introduce the variational distribution $q_\phi(z|x)$ and derive

\begin{aligned} \log p_\theta(x) &= \log p_\theta(x) \int_{\mathcal{Z}} q_\phi(z|x) dz \\ &= \int_{\mathcal{Z}} q_\phi(z|x) \log p_\theta(x) dz \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x)] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{p_\theta(z|x)}\right] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x,z) \cdot q_\phi(z|x)}{p_\theta(z|x) \cdot q_\phi(z|x)}\right] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right] + \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log p(z) + \log p_\theta(x|z) - \log q_\phi(z|x)\right] + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)] + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] \end{aligned}

Therefore, we have the exact decomposition

\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)]

where the Evidence Lower Bound (ELBO) is

\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]

Key insights from this decomposition

Since $D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] \geq 0$ , we have $\log p_\theta(x) \geq \mathcal{L}(\theta, \phi; x)$ , hence ELBO is indeed a lower bound.
The equality $\log p_\theta(x) = \mathcal{L}(\theta, \phi; x)$ holds if and only if $q_\phi(z|x) = p_\theta(z|x)$ , meaning the variational posterior perfectly matches the true posterior.
Maximizing ELBO w.r.t. $\phi$ minimizes $D_{KL}[q_\phi(z|x) \| p_\theta(z|x)]$ , making $q_\phi(z|x)$ a better approximation to the true posterior.

The ELBO consists of two terms

Reconstruction term: $\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)]$ - encourages the decoder to reconstruct the input
Regularization term: $D_{KL}[q_\phi(z|x) \| p(z)]$ - keeps the learned latent distribution close to the prior

Since maximizing $\log p_\theta(x)$ is intractable, we instead maximize the tractable ELBO as a surrogate objective.

\argmax_\theta \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]

Final Optimization Objective#

Therefore, the final optimization objective function is:

\argmax_{\theta,\phi} \mathbb{E}_{x\sim p_{\text{real}}(x)}[\mathcal{L}(\theta, \phi; x)]

where the ELBO for each sample is:

\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]

This can be equivalently written as:

\argmax_{\theta,\phi} \mathbb{E}_{x\sim p_{\text{real}}(x)}\left[\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]\right]

or in expanded form:

\argmax_{\theta,\phi} \mathbb{E}_{x\sim p_{\text{real}}(x), z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - \mathbb{E}_{x\sim p_{\text{real}}(x)}[D_{KL}[q_\phi(z|x) \| p(z)]]

Key points:

We optimize both $\theta$ (decoder parameters) and $\phi$ (encoder parameters) simultaneously
The KL divergence term also requires expectation over $x$ , since $q_\phi(z|x)$ depends on $x$
Important: We assume $p(z) = \mathcal{N}(0, I)$ is a fixed standard Gaussian prior. Since this distribution is analytically known, the KL divergence $D_{KL}[q_\phi(z|x) \| p(z)]$ can be computed in closed form without sampling from $z$
In practice, both expectations over $p_{\text{real}}(x)$ are approximated using mini-batch sampling from the training dataset

Why no sampling over $z$ in the KL term:

The KL divergence $D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)]$ has a closed-form analytical solution
For Gaussian $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x)I)$ , we get:

D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)] = \frac{1}{2}\sum_{j=1}^d \left(1 + \log \sigma^2_j - \mu^2_j - \sigma^2_j\right)

This eliminates the need for Monte Carlo sampling in the regularization term
Only the reconstruction term $\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)]$ requires sampling (via reparameterization trick)

VAE Architecture#

The VAE consists of two neural networks:

Encoder $q_\phi(z|x)$ : Maps input $x$ to latent distribution parameters (typically mean and variance for Gaussian)
Decoder $p_\theta(x|z)$ : Maps latent variable $z$ back to reconstruction of $x$

The training objective becomes:

\max_{\theta,\phi} \mathbb{E}_{x \sim p_{real}(x)}[\mathcal{L}(\theta, \phi; x)]

Encoder Network#

The encoder typically parameterizes a diagonal Gaussian distribution:

q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \sigma^2_\phi(x)I)

where:

$\mu_\phi(x) \in \mathbb{R}^d$ is the mean vector output by the encoder
$\sigma^2_\phi(x) \in \mathbb{R}^d$ is the variance vector (often parameterized as $\log \sigma^2$ for numerical stability)
$d$ is the dimensionality of the latent space

Decoder Network#

The decoder defines the likelihood of the data given the latent variable

p_\theta(x|z) = \mathcal{N}(x; \mu_\theta(z), \sigma^2_\theta I) \quad \text{(for continuous data)}

or

p_\theta(x|z) = \text{Bernoulli}(x; p_\theta(z)) \quad \text{(for binary data)}

Reparameterization Trick#

To enable backpropagation through the stochastic sampling process, VAE employs the reparameterization trick

Instead of sampling $z \sim q_\phi(z|x)$ directly, we

Sample noise: $\epsilon \sim \mathcal{N}(0, I)$
Transform deterministically: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$

This makes the sampling operation differentiable w.r.t. $\phi$ .

Training Process#

Forward pass:
- Encode: $(x) \rightarrow (\mu_\phi(x), \sigma_\phi(x))$
- Sample: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon$
- Decode: $(z) \rightarrow \hat{x} = \mu_\theta(z)$
Loss computation:
- Reconstruction loss: $-\log p_\theta(x|z) \approx \|x - \hat{x}\|^2$ (for Gaussian decoder)
- KL regularization: $D_{KL}[q_\phi(z|x) \| p(z)]$
Backpropagation: Update both $\theta$ and $\phi$ using gradient descent

Practical Implementation#

The ELBO for a single sample becomes:

\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)]

For Gaussian encoder and decoder, the KL term has a closed form:

D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)] = \frac{1}{2}\sum_{j=1}^d \left(1 + \log \sigma^2_j - \mu^2_j - \sigma^2_j\right)

Conditional VAE (CVAE)#

Motivation#

Standard VAE generates random samples from $p(z)$ . Conditional VAE allows controlled generation by conditioning on additional information $c$ (e.g., class labels, attributes).

Mathematical Framework#

The conditional generative model becomes:

p_\theta(x|c) = \int p(z)p_\theta(x|z,c)dz

Both encoder and decoder are conditioned:

Conditional encoder: $q_\phi(z|x,c)$
Conditional decoder: $p_\theta(x|z,c)$

Modified ELBO#

\mathcal{L}(\theta, \phi; x, c) = \mathbb{E}_{z \sim q_\phi(z|x,c)}[\log p_\theta(x|z,c)] - D_{KL}[q_\phi(z|x,c) \| p(z)]

Applications#

Class-conditional generation: Generate images of specific classes
Style transfer: Control artistic style while preserving content
Text-to-image: Generate images from textual descriptions

Vector Quantized VAE (VQ-VAE)#

Motivation#

Standard VAE suffers from posterior collapse - the latent codes may be ignored during generation. VQ-VAE addresses this with discrete latent representations.

Key Innovation: Vector Quantization#

Instead of continuous latent variables, VQ-VAE uses a discrete codebook:

Codebook: $\mathcal{C} = \{e_k\}_{k=1}^K$ where $e_k \in \mathbb{R}^d$
Quantization: $\text{VQ}(z) = e_k$ where $k = \argmin_j \|z - e_j\|$

Architecture Changes#

Encoder: x → z_e (continuous)
Vector Quantization: z_e → z_q (discrete)
Decoder: z_q → x̂

Training Objective#

\mathcal{L} = \|x - x̂\|^2 + \|\text{sg}[z_e] - e_k\|^2 + \beta\|z_e - \text{sg}[e_k]\|^2

where:

Reconstruction loss: $\|x - x̂\|^2$
Codebook loss: Updates codebook vectors toward encoder outputs
Commitment loss: Encourages encoder outputs to commit to codebook entries
$\text{sg}[\cdot]$ denotes stop-gradient operation

Advantages of VQ-VAE#

No posterior collapse: Discrete codes are always used
Better reconstruction: Avoids blurry outputs common in VAE
Interpretable latents: Discrete codes often correspond to meaningful features
Hierarchical modeling: Can be stacked for multi-scale representations

VQ-VAE-2#

Extends VQ-VAE with:

Hierarchical quantization: Multiple resolution levels
PixelCNN decoder: Autoregressive modeling of quantized codes
Better sample quality: Competitive with GANs on image generation

Comparison Summary#

Method	Latent Space	Pros	Cons
VAE	Continuous, Gaussian	Simple, stable training	Blurry outputs, posterior collapse
CVAE	Continuous, conditional	Controllable generation	Still suffers from VAE limitations
VQ-VAE	Discrete, codebook	Sharp outputs, no collapse	More complex training, discrete optimization

Applications and Impact#

Image generation: DALL-E uses VQ-VAE for tokenizing images
Audio modeling: VQ-VAE for speech and music synthesis
Representation learning: Learning disentangled and interpretable features
Data compression: Efficient encoding of high-dimensional data