605 个字词
3 分钟
Reverse KL Divergence
首次发布: 2025-06-25
... 次访问

Definitions#

The KL Divergence (KLD) is defined as

DKL[p()q()]=Exp()[logp(x)q(x)]=p(x)logp(x)q(x)dx\mathbb{D}_{\text{KL}}[p(\cdot) || q(\cdot)] = \mathbb{E}_{x\sim p(\cdot)}\left[\log\frac{p(x)}{q(x)}\right] = \int p(x) \log\frac{p(x)}{q(x)}dx

This is the standard form we typically use to measure how different two distributions are, which we call the forward KLD.

But there’s another way to look at it - what if we swap the roles of pp and qq?

DKL[q()p()]=Exq()[logq(x)p(x)]=q(x)logq(x)p(x)dx\mathbb{D}_{\text{KL}}[q(\cdot) || p(\cdot)] = \mathbb{E}_{x\sim q(\cdot)}\left[\log\frac{q(x)}{p(x)}\right] = \int q(x) \log\frac{q(x)}{p(x)}dx

This gives us what we call the reverse KLD. While they might look similar, they behave quite differently in practice.

Experimental Results#

Setup#

We fitted a 2-component Gaussian mixture to approximate a 3-component target distribution with peaks at positions [-2, 0, 2] and weights [0.3, 0.4, 0.3]. The KLD values were computed using Monte Carlo sampling with 20,000 samples to ensure numerical accuracy, rather than numerical integration which can suffer from discretization errors.

Click to Download Experiment Script

Observed Phenomena#

Forward KLD Results:

  • Means: [-1.98, 1.97] - positioned near outer target modes
  • Standard deviations: [0.41, 0.42] - wider to cover intermediate regions
  • Weights: [0.52, 0.48] - balanced between components
  • Loss: 0.0012 - very low

Reverse KLD Results:

  • Means: [-2.001, 0.010] - focused on the two strongest modes
  • Standard deviations: [0.299, 0.418] - matching target component widths
  • Weights: [0.426, 0.574] - proportional to target mode strengths
  • Loss: 0.3524 - significantly higher

Key Behavioral Differences#

  1. Forward KLD creates a bimodal approximation that positions components to provide coverage across all target modes. The components are placed strategically to minimize the maximum approximation error across the entire distribution support.

  2. Reverse KLD produces a selective approximation, focusing computational resources on the two most prominent modes (x = -2 and x = 0) while completely ignoring the rightmost peak at x = 2.

  3. Forward KLD uses broader components with balanced weights to ensure no region of significant target mass is left uncovered, while Reverse KLD uses component parameters that closely match the characteristics of the selected target modes.

  4. Loss magnitude difference: Forward KLD achieves much lower loss values, indicating successful coverage of the target distribution, while Reverse KLD accepts higher loss in exchange for concentrated, high-fidelity approximation of selected modes.

Theoretical Explanation#

The key to understanding these different behaviors lies in examining the mathematical forms and the role of the weighting function.

Mathematical Analysis#

Let’s rewrite the two KLD formulations with emphasis on their weighting:

Forward KLD:DKL[pq]=p(x)weightlogp(x)q(x)dx\text{Forward KLD:}\quad \mathbb{D}_{\text{KL}}[p \| q] = \int \underbrace{p(x)}_{\text{weight}} \log\frac{p(x)}{q(x)} \, dxReverse KLD:DKL[qp]=q(x)weightlogq(x)p(x)dx\text{Reverse KLD:}\quad \mathbb{D}_{\text{KL}}[q \| p] = \int \underbrace{q(x)}_{\text{weight}} \log\frac{q(x)}{p(x)} \, dx

The crucial insight is that the first distribution in the KLD acts as the weighting function for the expectation.

Penalty Mechanisms#

Forward KLD is weighted by p(x)p(x):

  • High penalty when p(x)p(x) is large but q(x)q(x) is small (since logp(x)q(x)+\log\frac{p(x)}{q(x)} \to +\infty)
  • Low penalty when p(x)p(x) is small, regardless of q(x)q(x)
  • Consequence: qq must “cover” all regions where pp has significant mass
  • Behavior: Zero-avoiding, mode-averaging

Reverse KLD is weighted by q(x)q(x):

  • High penalty when q(x)q(x) is large but p(x)p(x) is small (since logq(x)p(x)+\log\frac{q(x)}{p(x)} \to +\infty)
  • Low penalty when q(x)q(x) is small, regardless of p(x)p(x)
  • Consequence: qq avoids placing mass where pp has little mass
  • Behavior: Zero-forcing, mode-seeking

Asymmetric Risk#

Consider what happens when p(x)q(x)p(x) \gg q(x) versus q(x)p(x)q(x) \gg p(x):

ScenarioForward KLDReverse KLD
p(x)q(x)p(x) \gg q(x)High penalty (p(x)logp(x)q(x)p(x) \log\frac{p(x)}{q(x)} large)Low penalty (q(x)q(x) small)
q(x)p(x)q(x) \gg p(x)Low penalty (p(x)p(x) small)High penalty (q(x)logq(x)p(x)q(x) \log\frac{q(x)}{p(x)} large)

This asymmetry explains the observed behaviors:

  • Forward KLD forces qq to “stretch” and cover all modes of pp
  • Reverse KLD forces qq to “concentrate” and avoid low-probability regions of pp

Information-Theoretic Perspective#

From an information theory standpoint:

  • Forward KLD measures the extra bits needed when using code qq to compress data from pp
  • Reverse KLD measures the extra bits needed when using code pp to compress data from qq

When optimizing qq:

  • Forward KLD ensures qq can efficiently encode all data that pp might generate
  • Reverse KLD ensures that data qq generates can be efficiently encoded by pp

This fundamental asymmetry drives the dramatically different optimization behaviors observed in our experiments.

Reverse KL Divergence
https://adalovelemon.github.io/blog/posts/content/coursenotes/mathslab/reversekld/
Author
Ada Lovelemon
Published at
2025-06-25

留言板