Reverse KL Divergence

首次发布: 2025-06-25

... 次访问

Definitions#

The KL Divergence (KLD) is defined as

\mathbb{D}_{\text{KL}}[p(\cdot) || q(\cdot)] = \mathbb{E}_{x\sim p(\cdot)}\left[\log\frac{p(x)}{q(x)}\right] = \int p(x) \log\frac{p(x)}{q(x)}dx

This is the standard form we typically use to measure how different two distributions are, which we call the forward KLD.

But there’s another way to look at it - what if we swap the roles of $p$ and $q$ ?

\mathbb{D}_{\text{KL}}[q(\cdot) || p(\cdot)] = \mathbb{E}_{x\sim q(\cdot)}\left[\log\frac{q(x)}{p(x)}\right] = \int q(x) \log\frac{q(x)}{p(x)}dx

This gives us what we call the reverse KLD. While they might look similar, they behave quite differently in practice.

Experimental Results#

Setup#

We fitted a 2-component Gaussian mixture to approximate a 3-component target distribution with peaks at positions [-2, 0, 2] and weights [0.3, 0.4, 0.3]. The KLD values were computed using Monte Carlo sampling with 20,000 samples to ensure numerical accuracy, rather than numerical integration which can suffer from discretization errors.

Click to Download Experiment Script

Observed Phenomena#

Forward KLD Results:

Means: [-1.98, 1.97] - positioned near outer target modes
Standard deviations: [0.41, 0.42] - wider to cover intermediate regions
Weights: [0.52, 0.48] - balanced between components
Loss: 0.0012 - very low

Reverse KLD Results:

Means: [-2.001, 0.010] - focused on the two strongest modes
Standard deviations: [0.299, 0.418] - matching target component widths
Weights: [0.426, 0.574] - proportional to target mode strengths
Loss: 0.3524 - significantly higher

Key Behavioral Differences#

Forward KLD creates a bimodal approximation that positions components to provide coverage across all target modes. The components are placed strategically to minimize the maximum approximation error across the entire distribution support.
Reverse KLD produces a selective approximation, focusing computational resources on the two most prominent modes (x = -2 and x = 0) while completely ignoring the rightmost peak at x = 2.
Forward KLD uses broader components with balanced weights to ensure no region of significant target mass is left uncovered, while Reverse KLD uses component parameters that closely match the characteristics of the selected target modes.
Loss magnitude difference: Forward KLD achieves much lower loss values, indicating successful coverage of the target distribution, while Reverse KLD accepts higher loss in exchange for concentrated, high-fidelity approximation of selected modes.

Theoretical Explanation#

The key to understanding these different behaviors lies in examining the mathematical forms and the role of the weighting function.

Mathematical Analysis#

Let’s rewrite the two KLD formulations with emphasis on their weighting:

\text{Forward KLD:}\quad \mathbb{D}_{\text{KL}}[p \| q] = \int \underbrace{p(x)}_{\text{weight}} \log\frac{p(x)}{q(x)} \, dx

\text{Reverse KLD:}\quad \mathbb{D}_{\text{KL}}[q \| p] = \int \underbrace{q(x)}_{\text{weight}} \log\frac{q(x)}{p(x)} \, dx

The crucial insight is that the first distribution in the KLD acts as the weighting function for the expectation.

Penalty Mechanisms#

Forward KLD is weighted by $p(x)$ :

High penalty when $p(x)$ is large but $q(x)$ is small (since $\log\frac{p(x)}{q(x)} \to +\infty$ )
Low penalty when $p(x)$ is small, regardless of $q(x)$
Consequence: $q$ must “cover” all regions where $p$ has significant mass
Behavior: Zero-avoiding, mode-averaging

Reverse KLD is weighted by $q(x)$ :

High penalty when $q(x)$ is large but $p(x)$ is small (since $\log\frac{q(x)}{p(x)} \to +\infty$ )
Low penalty when $q(x)$ is small, regardless of $p(x)$
Consequence: $q$ avoids placing mass where $p$ has little mass
Behavior: Zero-forcing, mode-seeking

Asymmetric Risk#

Consider what happens when $p(x) \gg q(x)$ versus $q(x) \gg p(x)$ :

Scenario	Forward KLD	Reverse KLD
$p(x) \gg q(x)$	High penalty ( $p(x) \log\frac{p(x)}{q(x)}$ large)	Low penalty ( $q(x)$ small)
$q(x) \gg p(x)$	Low penalty ( $p(x)$ small)	High penalty ( $q(x) \log\frac{q(x)}{p(x)}$ large)

This asymmetry explains the observed behaviors:

Forward KLD forces $q$ to “stretch” and cover all modes of $p$
Reverse KLD forces $q$ to “concentrate” and avoid low-probability regions of $p$

Information-Theoretic Perspective#

From an information theory standpoint:

Forward KLD measures the extra bits needed when using code $q$ to compress data from $p$
Reverse KLD measures the extra bits needed when using code $p$ to compress data from $q$

When optimizing $q$ :

Forward KLD ensures $q$ can efficiently encode all data that $p$ might generate
Reverse KLD ensures that data $q$ generates can be efficiently encoded by $p$

This fundamental asymmetry drives the dramatically different optimization behaviors observed in our experiments.