1312 words
7 minutes
Variational Autoencoder
首次发布: 2025-05-03
... 次访问

Variational Autoencoder (VAE)#

Mathematical Establishment#

Assume a data generation process, where zp(z)z \sim p(z) represent latent variables (certain features like size, color, position) and xpreal(x)x \sim p_{real}(x) represent real observed variables (e.g. images, videos, texts). Our aim is to generate an observed variable via latent variables and a generator. Here, we bridge the latent distribution and the observed distribution with a conditional probability function pθ(xz)p_\theta(x|z), which is a mathematical description of the generator.

Thus, we establish an man-made observed distribution

pθ(x)=Zp(z)pθ(xz)dzp_\theta(x) = \int_{\mathcal{Z}}p(z)p_\theta(x | z)dz

and we wish this distribution to match the real observed distribution preal(x)p_{real}(x)

Optimization Metrics#

To measure the similarities between the distributions, we typically use the Kullback-Leibler (KL) divergence, and we try to minimize the divergence

θ=arg minθDKL[preal(x)pθ(x)]=Xpreal(x)logpreal(x)pθ(x)dx\theta^* = \argmin_\theta D_{KL}[p_{real}(x) \| p_\theta(x)] = \int_{\mathcal{X}} p_{real}(x) \log \frac{p_{real}(x)}{p_\theta(x)} dx

However, this is intractable since we don’t know preal(x)p_{real}(x) explicitly. To find feasible methods, try decompose the divergence

θ=arg minθDKL[preal(x)pθ(x)]=arg minθXpreal(x)logpθ(x)dx+const=arg maxθXpreal(x)logpθ(x)dx=arg maxθExpreal(x)[logpθ(x)]\begin{aligned} \theta^* &= \argmin_\theta D_{KL}[p_{real}(x) \| p_\theta(x)]\\ &= \argmin_\theta \int_{\mathcal{X}} -p_{real}(x) \log p_\theta(x) dx + \text{const}\\ &= \argmax_\theta \int_{\mathcal{X}} p_{real}(x) \log p_\theta(x) dx \\ &= \argmax_\theta \mathbb{E}_{x \sim p_{real}(x)}[\log p_\theta(x)] \end{aligned}

So instead, we could maximize the log-likelihood of our data under the model

arg maxθExpreal(x)[logpθ(x)]\argmax_\theta \mathbb{E}_{x \sim p_{real}(x)}[\log p_\theta(x)]

Okay, now, another problem emerged. Computing pθ(x)=p(z)pθ(xz)dzp_\theta(x) = \int p(z)p_\theta(x|z)dz is also intractable due to hardness of the integral over all possible latent variables. Detailed reasons are listed as follows:

  1. High-dimensional integration: The latent space zz is typically high-dimensional, making the integral computationally expensive or impossible to evaluate analytically.

  2. Complex posterior distribution: Even if we could compute pθ(x)p_\theta(x), we would still need to compute the posterior pθ(zx)=p(z)pθ(xz)pθ(x)p_\theta(z|x) = \frac{p(z)p_\theta(x|z)}{p_\theta(x)} for inference, which requires the same intractable integral in the denominator.

  3. No closed-form solution: For most practical choices of p(z)p(z) and pθ(xz)p_\theta(x|z) (e.g., Gaussian distributions with neural network parameterization), the integral has no closed-form solution.

  4. Sampling inefficiency: Monte Carlo sampling methods would require an impractical number of samples to get good estimates, especially in high dimensions where most of the probability mass is concentrated in a small region.

Variational Inference Solution#

To address these computational challenges, VAE introduces variational inference by approximating the intractable posterior pθ(zx)p_\theta(z|x) with a tractable variational distribution qϕ(zx)q_\phi(z|x) parameterized by ϕ\phi (typically implemented as an encoder neural network).

The key insight is to establish an exact decomposition of the log-likelihood. Starting from the log-likelihood we want to maximize

logpθ(x)=logZp(z)pθ(xz)dz\log p_\theta(x) = \log \int_{\mathcal{Z}} p(z)p_\theta(x|z)dz

We introduce the variational distribution qϕ(zx)q_\phi(z|x) and derive

logpθ(x)=logpθ(x)Zqϕ(zx)dz=Zqϕ(zx)logpθ(x)dz=Ezqϕ(zx)[logpθ(x)]=Ezqϕ(zx)[logpθ(x,z)pθ(zx)]=Ezqϕ(zx)[logpθ(x,z)qϕ(zx)pθ(zx)qϕ(zx)]=Ezqϕ(zx)[logpθ(x,z)qϕ(zx)]+Ezqϕ(zx)[logqϕ(zx)pθ(zx)]=Ezqϕ(zx)[logp(z)+logpθ(xz)logqϕ(zx)]+DKL[qϕ(zx)pθ(zx)]=Ezqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)p(z)]+DKL[qϕ(zx)pθ(zx)]\begin{aligned} \log p_\theta(x) &= \log p_\theta(x) \int_{\mathcal{Z}} q_\phi(z|x) dz \\ &= \int_{\mathcal{Z}} q_\phi(z|x) \log p_\theta(x) dz \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x)] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{p_\theta(z|x)}\right] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x,z) \cdot q_\phi(z|x)}{p_\theta(z|x) \cdot q_\phi(z|x)}\right] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right] + \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}\left[\log p(z) + \log p_\theta(x|z) - \log q_\phi(z|x)\right] + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] \\ &= \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)] + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] \end{aligned}

Therefore, we have the exact decomposition

logpθ(x)=L(θ,ϕ;x)+DKL[qϕ(zx)pθ(zx)]\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) + D_{KL}[q_\phi(z|x) \| p_\theta(z|x)]

where the Evidence Lower Bound (ELBO) is

L(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)p(z)]\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]

Key insights from this decomposition

  1. Since DKL[qϕ(zx)pθ(zx)]0D_{KL}[q_\phi(z|x) \| p_\theta(z|x)] \geq 0, we have logpθ(x)L(θ,ϕ;x)\log p_\theta(x) \geq \mathcal{L}(\theta, \phi; x), hence ELBO is indeed a lower bound.

  2. The equality logpθ(x)=L(θ,ϕ;x)\log p_\theta(x) = \mathcal{L}(\theta, \phi; x) holds if and only if qϕ(zx)=pθ(zx)q_\phi(z|x) = p_\theta(z|x), meaning the variational posterior perfectly matches the true posterior.

  3. Maximizing ELBO w.r.t. ϕ\phi minimizes DKL[qϕ(zx)pθ(zx)]D_{KL}[q_\phi(z|x) \| p_\theta(z|x)], making qϕ(zx)q_\phi(z|x) a better approximation to the true posterior.

The ELBO consists of two terms

  1. Reconstruction term: Ezqϕ(zx)[logpθ(xz)]\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - encourages the decoder to reconstruct the input
  2. Regularization term: DKL[qϕ(zx)p(z)]D_{KL}[q_\phi(z|x) \| p(z)] - keeps the learned latent distribution close to the prior

Since maximizing logpθ(x)\log p_\theta(x) is intractable, we instead maximize the tractable ELBO as a surrogate objective.

arg maxθL(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)p(z)]\argmax_\theta \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]

Final Optimization Objective#

Therefore, the final optimization objective function is:

arg maxθ,ϕExpreal(x)[L(θ,ϕ;x)]\argmax_{\theta,\phi} \mathbb{E}_{x\sim p_{\text{real}}(x)}[\mathcal{L}(\theta, \phi; x)]

where the ELBO for each sample is:

L(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)p(z)]\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]

This can be equivalently written as:

arg maxθ,ϕExpreal(x)[Ezqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)p(z)]]\argmax_{\theta,\phi} \mathbb{E}_{x\sim p_{\text{real}}(x)}\left[\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| p(z)]\right]

or in expanded form:

arg maxθ,ϕExpreal(x),zqϕ(zx)[logpθ(xz)]Expreal(x)[DKL[qϕ(zx)p(z)]]\argmax_{\theta,\phi} \mathbb{E}_{x\sim p_{\text{real}}(x), z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - \mathbb{E}_{x\sim p_{\text{real}}(x)}[D_{KL}[q_\phi(z|x) \| p(z)]]

Key points:

  1. We optimize both θ\theta (decoder parameters) and ϕ\phi (encoder parameters) simultaneously
  2. The KL divergence term also requires expectation over xx, since qϕ(zx)q_\phi(z|x) depends on xx
  3. Important: We assume p(z)=N(0,I)p(z) = \mathcal{N}(0, I) is a fixed standard Gaussian prior. Since this distribution is analytically known, the KL divergence DKL[qϕ(zx)p(z)]D_{KL}[q_\phi(z|x) \| p(z)] can be computed in closed form without sampling from zz
  4. In practice, both expectations over preal(x)p_{\text{real}}(x) are approximated using mini-batch sampling from the training dataset

Why no sampling over zz in the KL term:

  • The KL divergence DKL[qϕ(zx)N(0,I)]D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)] has a closed-form analytical solution
  • For Gaussian qϕ(zx)=N(μϕ(x),σϕ2(x)I)q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x)I), we get:
DKL[qϕ(zx)N(0,I)]=12j=1d(1+logσj2μj2σj2)D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)] = \frac{1}{2}\sum_{j=1}^d \left(1 + \log \sigma^2_j - \mu^2_j - \sigma^2_j\right)
  • This eliminates the need for Monte Carlo sampling in the regularization term
  • Only the reconstruction term Ezqϕ(zx)[logpθ(xz)]\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] requires sampling (via reparameterization trick)

VAE Architecture#

The VAE consists of two neural networks:

  1. Encoder qϕ(zx)q_\phi(z|x): Maps input xx to latent distribution parameters (typically mean and variance for Gaussian)
  2. Decoder pθ(xz)p_\theta(x|z): Maps latent variable zz back to reconstruction of xx

The training objective becomes:

maxθ,ϕExpreal(x)[L(θ,ϕ;x)]\max_{\theta,\phi} \mathbb{E}_{x \sim p_{real}(x)}[\mathcal{L}(\theta, \phi; x)]

Encoder Network#

The encoder typically parameterizes a diagonal Gaussian distribution:

qϕ(zx)=N(z;μϕ(x),σϕ2(x)I)q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \sigma^2_\phi(x)I)

where:

  • μϕ(x)Rd\mu_\phi(x) \in \mathbb{R}^d is the mean vector output by the encoder
  • σϕ2(x)Rd\sigma^2_\phi(x) \in \mathbb{R}^d is the variance vector (often parameterized as logσ2\log \sigma^2 for numerical stability)
  • dd is the dimensionality of the latent space

Decoder Network#

The decoder defines the likelihood of the data given the latent variable

pθ(xz)=N(x;μθ(z),σθ2I)(for continuous data)p_\theta(x|z) = \mathcal{N}(x; \mu_\theta(z), \sigma^2_\theta I) \quad \text{(for continuous data)}

or

pθ(xz)=Bernoulli(x;pθ(z))(for binary data)p_\theta(x|z) = \text{Bernoulli}(x; p_\theta(z)) \quad \text{(for binary data)}

Reparameterization Trick#

To enable backpropagation through the stochastic sampling process, VAE employs the reparameterization trick

Instead of sampling zqϕ(zx)z \sim q_\phi(z|x) directly, we

  1. Sample noise: ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)
  2. Transform deterministically: z=μϕ(x)+σϕ(x)ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon

This makes the sampling operation differentiable w.r.t. ϕ\phi.

Training Process#

  1. Forward pass:

    • Encode: (x)(μϕ(x),σϕ(x))(x) \rightarrow (\mu_\phi(x), \sigma_\phi(x))
    • Sample: z=μϕ(x)+σϕ(x)ϵz = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon
    • Decode: (z)x^=μθ(z)(z) \rightarrow \hat{x} = \mu_\theta(z)
  2. Loss computation:

    • Reconstruction loss: logpθ(xz)xx^2-\log p_\theta(x|z) \approx \|x - \hat{x}\|^2 (for Gaussian decoder)
    • KL regularization: DKL[qϕ(zx)p(z)]D_{KL}[q_\phi(z|x) \| p(z)]
  3. Backpropagation: Update both θ\theta and ϕ\phi using gradient descent

Practical Implementation#

The ELBO for a single sample becomes:

L(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]DKL[qϕ(zx)N(0,I)]\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)]

For Gaussian encoder and decoder, the KL term has a closed form:

DKL[qϕ(zx)N(0,I)]=12j=1d(1+logσj2μj2σj2)D_{KL}[q_\phi(z|x) \| \mathcal{N}(0,I)] = \frac{1}{2}\sum_{j=1}^d \left(1 + \log \sigma^2_j - \mu^2_j - \sigma^2_j\right)

Conditional VAE (CVAE)#

Motivation#

Standard VAE generates random samples from p(z)p(z). Conditional VAE allows controlled generation by conditioning on additional information cc (e.g., class labels, attributes).

Mathematical Framework#

The conditional generative model becomes:

pθ(xc)=p(z)pθ(xz,c)dzp_\theta(x|c) = \int p(z)p_\theta(x|z,c)dz

Both encoder and decoder are conditioned:

  • Conditional encoder: qϕ(zx,c)q_\phi(z|x,c)
  • Conditional decoder: pθ(xz,c)p_\theta(x|z,c)

Modified ELBO#

L(θ,ϕ;x,c)=Ezqϕ(zx,c)[logpθ(xz,c)]DKL[qϕ(zx,c)p(z)]\mathcal{L}(\theta, \phi; x, c) = \mathbb{E}_{z \sim q_\phi(z|x,c)}[\log p_\theta(x|z,c)] - D_{KL}[q_\phi(z|x,c) \| p(z)]

Applications#

  • Class-conditional generation: Generate images of specific classes
  • Style transfer: Control artistic style while preserving content
  • Text-to-image: Generate images from textual descriptions

Vector Quantized VAE (VQ-VAE)#

Motivation#

Standard VAE suffers from posterior collapse - the latent codes may be ignored during generation. VQ-VAE addresses this with discrete latent representations.

Key Innovation: Vector Quantization#

Instead of continuous latent variables, VQ-VAE uses a discrete codebook:

  1. Codebook: C={ek}k=1K\mathcal{C} = \{e_k\}_{k=1}^K where ekRde_k \in \mathbb{R}^d
  2. Quantization: VQ(z)=ek\text{VQ}(z) = e_k where k=arg minjzejk = \argmin_j \|z - e_j\|

Architecture Changes#

Encoder: x → z_e (continuous)
Vector Quantization: z_e → z_q (discrete)
Decoder: z_q → x̂

Training Objective#

L=xx^2+sg[ze]ek2+βzesg[ek]2\mathcal{L} = \|x - x̂\|^2 + \|\text{sg}[z_e] - e_k\|^2 + \beta\|z_e - \text{sg}[e_k]\|^2

where:

  • Reconstruction loss: xx^2\|x - x̂\|^2
  • Codebook loss: Updates codebook vectors toward encoder outputs
  • Commitment loss: Encourages encoder outputs to commit to codebook entries
  • sg[]\text{sg}[\cdot] denotes stop-gradient operation

Advantages of VQ-VAE#

  1. No posterior collapse: Discrete codes are always used
  2. Better reconstruction: Avoids blurry outputs common in VAE
  3. Interpretable latents: Discrete codes often correspond to meaningful features
  4. Hierarchical modeling: Can be stacked for multi-scale representations

VQ-VAE-2#

Extends VQ-VAE with:

  • Hierarchical quantization: Multiple resolution levels
  • PixelCNN decoder: Autoregressive modeling of quantized codes
  • Better sample quality: Competitive with GANs on image generation

Comparison Summary#

MethodLatent SpaceProsCons
VAEContinuous, GaussianSimple, stable trainingBlurry outputs, posterior collapse
CVAEContinuous, conditionalControllable generationStill suffers from VAE limitations
VQ-VAEDiscrete, codebookSharp outputs, no collapseMore complex training, discrete optimization

Applications and Impact#

  • Image generation: DALL-E uses VQ-VAE for tokenizing images
  • Audio modeling: VQ-VAE for speech and music synthesis
  • Representation learning: Learning disentangled and interpretable features
  • Data compression: Efficient encoding of high-dimensional data
Variational Autoencoder
https://adalovelemon.github.io/blog/en/posts/content/coursenotes/generativeai/vae/vae/
Author
Ada Lovelemon
Published at
2025-05-03

Comments Section