Variational Autoencoder (VAE)
Mathematical Establishment
Assume a data generation process, where represent latent variables (certain features like size, color, position) and represent real observed variables (e.g. images, videos, texts). Our aim is to generate an observed variable via latent variables and a generator. Here, we bridge the latent distribution and the observed distribution with a conditional probability function , which is a mathematical description of the generator.
Thus, we establish an man-made observed distribution
and we wish this distribution to match the real observed distribution
Optimization Metrics
To measure the similarities between the distributions, we typically use the Kullback-Leibler (KL) divergence, and we try to minimize the divergence
However, this is intractable since we don’t know explicitly. To find feasible methods, try decompose the divergence
So instead, we could maximize the log-likelihood of our data under the model
Okay, now, another problem emerged. Computing is also intractable due to hardness of the integral over all possible latent variables. Detailed reasons are listed as follows:
High-dimensional integration: The latent space is typically high-dimensional, making the integral computationally expensive or impossible to evaluate analytically.
Complex posterior distribution: Even if we could compute , we would still need to compute the posterior for inference, which requires the same intractable integral in the denominator.
No closed-form solution: For most practical choices of and (e.g., Gaussian distributions with neural network parameterization), the integral has no closed-form solution.
Sampling inefficiency: Monte Carlo sampling methods would require an impractical number of samples to get good estimates, especially in high dimensions where most of the probability mass is concentrated in a small region.
Variational Inference Solution
To address these computational challenges, VAE introduces variational inference by approximating the intractable posterior with a tractable variational distribution parameterized by (typically implemented as an encoder neural network).
The key insight is to establish an exact decomposition of the log-likelihood. Starting from the log-likelihood we want to maximize
We introduce the variational distribution and derive
Therefore, we have the exact decomposition
where the Evidence Lower Bound (ELBO) is
Key insights from this decomposition
Since , we have , hence ELBO is indeed a lower bound.
The equality holds if and only if , meaning the variational posterior perfectly matches the true posterior.
Maximizing ELBO w.r.t. minimizes , making a better approximation to the true posterior.
The ELBO consists of two terms
- Reconstruction term: - encourages the decoder to reconstruct the input
- Regularization term: - keeps the learned latent distribution close to the prior
Since maximizing is intractable, we instead maximize the tractable ELBO as a surrogate objective.
Final Optimization Objective
Therefore, the final optimization objective function is:
where the ELBO for each sample is:
This can be equivalently written as:
or in expanded form:
Key points:
- We optimize both (decoder parameters) and (encoder parameters) simultaneously
- The KL divergence term also requires expectation over , since depends on
- Important: We assume is a fixed standard Gaussian prior. Since this distribution is analytically known, the KL divergence can be computed in closed form without sampling from
- In practice, both expectations over are approximated using mini-batch sampling from the training dataset
Why no sampling over in the KL term:
- The KL divergence has a closed-form analytical solution
- For Gaussian , we get:
- This eliminates the need for Monte Carlo sampling in the regularization term
- Only the reconstruction term requires sampling (via reparameterization trick)
VAE Architecture
The VAE consists of two neural networks:
- Encoder : Maps input to latent distribution parameters (typically mean and variance for Gaussian)
- Decoder : Maps latent variable back to reconstruction of
The training objective becomes:
Encoder Network
The encoder typically parameterizes a diagonal Gaussian distribution:
where:
- is the mean vector output by the encoder
- is the variance vector (often parameterized as for numerical stability)
- is the dimensionality of the latent space
Decoder Network
The decoder defines the likelihood of the data given the latent variable
or
Reparameterization Trick
To enable backpropagation through the stochastic sampling process, VAE employs the reparameterization trick
Instead of sampling directly, we
- Sample noise:
- Transform deterministically:
This makes the sampling operation differentiable w.r.t. .
Training Process
Forward pass:
- Encode:
- Sample:
- Decode:
Loss computation:
- Reconstruction loss: (for Gaussian decoder)
- KL regularization:
Backpropagation: Update both and using gradient descent
Practical Implementation
The ELBO for a single sample becomes:
For Gaussian encoder and decoder, the KL term has a closed form:
Conditional VAE (CVAE)
Motivation
Standard VAE generates random samples from . Conditional VAE allows controlled generation by conditioning on additional information (e.g., class labels, attributes).
Mathematical Framework
The conditional generative model becomes:
Both encoder and decoder are conditioned:
- Conditional encoder:
- Conditional decoder:
Modified ELBO
Applications
- Class-conditional generation: Generate images of specific classes
- Style transfer: Control artistic style while preserving content
- Text-to-image: Generate images from textual descriptions
Vector Quantized VAE (VQ-VAE)
Motivation
Standard VAE suffers from posterior collapse - the latent codes may be ignored during generation. VQ-VAE addresses this with discrete latent representations.
Key Innovation: Vector Quantization
Instead of continuous latent variables, VQ-VAE uses a discrete codebook:
- Codebook: where
- Quantization: where
Architecture Changes
Encoder: x → z_e (continuous)
Vector Quantization: z_e → z_q (discrete)
Decoder: z_q → x̂Training Objective
where:
- Reconstruction loss:
- Codebook loss: Updates codebook vectors toward encoder outputs
- Commitment loss: Encourages encoder outputs to commit to codebook entries
- denotes stop-gradient operation
Advantages of VQ-VAE
- No posterior collapse: Discrete codes are always used
- Better reconstruction: Avoids blurry outputs common in VAE
- Interpretable latents: Discrete codes often correspond to meaningful features
- Hierarchical modeling: Can be stacked for multi-scale representations
VQ-VAE-2
Extends VQ-VAE with:
- Hierarchical quantization: Multiple resolution levels
- PixelCNN decoder: Autoregressive modeling of quantized codes
- Better sample quality: Competitive with GANs on image generation
Comparison Summary
| Method | Latent Space | Pros | Cons |
|---|---|---|---|
| VAE | Continuous, Gaussian | Simple, stable training | Blurry outputs, posterior collapse |
| CVAE | Continuous, conditional | Controllable generation | Still suffers from VAE limitations |
| VQ-VAE | Discrete, codebook | Sharp outputs, no collapse | More complex training, discrete optimization |
Applications and Impact
- Image generation: DALL-E uses VQ-VAE for tokenizing images
- Audio modeling: VQ-VAE for speech and music synthesis
- Representation learning: Learning disentangled and interpretable features
- Data compression: Efficient encoding of high-dimensional data

