824 个字词
4 分钟
Autoregressive Models
首次发布: 2025-05-05
... 次访问

Conditional Distribution Modeling#

Mathematical Foundation#

Chain Rule Decomposition#

Any joint distribution can be decomposed into a product of conditional distributions using the chain rule:

p(x1,x2,,xn)=p(x1)p(x2x1)p(x3x1,x2)p(xnx1,,xn1)p(x_1, x_2, \dots, x_n) = p(x_1)p(x_2 | x_1) p(x_3 | x_1, x_2) \dots p(x_n | x_1, \dots, x_{n-1})

This decomposition is particularly well-suited for modeling sequences of tokens, where each token xix_i represents a discrete element such as words, image patches, or amino acids.

Sequential Properties#

While the chain rule decomposition applies universally to multivariate distributions, sequential data introduces important constraints. For sequences with temporal dependencies, the variables are not interchangeable due to their inherent ordering.

Extension to Grid-Structured Data#

Unlike sequential data with natural temporal ordering, grid-structured data (images, videos, spatial data) lacks inherent sequential structure. However, autoregressive modeling can still be applied by imposing an artificial ordering on spatial elements.

For images, this is achieved by linearizing the 2D pixel grid into a 1D sequence using various strategies:

  • Raster scan ordering: left-to-right, top-to-bottom traversal
  • Spiral ordering: starting from center or corner positions
  • Random permutation: shuffling pixel positions randomly

After linearization, we apply the same conditional chain:

p(I)=p(x1)p(x2x1)p(x3x1,x2)p(xnx1,,xn1)p(I) = p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots p(x_n|x_1,\ldots,x_{n-1})

where II represents the image and xix_i denotes the ii-th pixel in the chosen ordering.

Key considerations for grid-structured data:

  • Spatial correlation: neighboring pixels exhibit stronger correlations than distant ones
  • Dimensional mismatch: natural 2D structure versus imposed 1D ordering
  • Order sensitivity: different linearization strategies can significantly impact modeling behavior

Despite these challenges, autoregressive image models such as PixelRNN and PixelCNN have successfully demonstrated the ability to capture complex image distributions and generate high-quality samples.

Conditional Distribution Parameterization#

Naive Parameterization Approach#

Each conditional distribution in the chain rule decomposition could be parameterized independently with its own parameter set θi\theta_i:

pθ1,θ2,,θn(x1,,xn)=pθ1(x1)pθ2(x2x1)pθn(xnx1,,xn1)p_{\theta_1, \theta_2, \dots, \theta_n}(x_1, \dots, x_n) = p_{\theta_1}(x_1) p_{\theta_2}(x_2 | x_1)\dots p_{\theta_n}(x_n| x_1, \dots, x_{n-1})

However, this approach introduces significant computational challenges:

  • Parameter explosion: Each position requires its own parameter set, leading to O(n)O(n) parameter scaling
  • Memory overhead: Storing and managing separate networks for each conditional distribution
  • Training complexity: Optimizing multiple independent parameter sets simultaneously
  • Generalization issues: Limited parameter sharing reduces the model’s ability to learn common patterns

Weight Sharing Solution#

To address these limitations, modern autoregressive models employ weight sharing, where all conditional distributions share the same parameter set θ\theta:

pθ(x1,,xn)=pθ(x1)pθ(x2x1)pθ(xnx1,,xn1)p_\theta(x_1, \dots, x_n) = p_\theta(x_1) p_\theta(x_2 | x_1)\dots p_\theta(x_n| x_1, \dots, x_{n-1})

Benefits of weight sharing:

  • Parameter efficiency: Fixed parameter count regardless of sequence length
  • Translation invariance: Model learns position-agnostic conditional patterns
  • Better generalization: Shared parameters enable learning from all positions simultaneously
  • Scalability: Enables processing of variable-length sequences without architectural changes

Autoregressive Models#

Definition and Method#

The term “autoregressive” combines “auto” (self) and “regression” (prediction). The model uses its own previous outputs as inputs for generating the next prediction, creating a self-referential chain where outputs become inputs.

Autoregressive models predict sequences by using previously generated elements to predict the next one, enabling generation of sequences of arbitrary length through iterative application.

Inductive Bias#

The inductive bias in autoregressive models stems from the assumption that the conditional distribution function remains stationary across all sequence positions. The same neural network architecture and parameters θ\theta model every conditional probability pθ(xix1,,xi1)p_\theta(x_i | x_1, \ldots, x_{i-1}) regardless of position ii.

Key aspects of this inductive bias:

  1. Position-invariant conditional patterns: The model assumes that the mechanism for predicting the next token given previous context follows the same pattern throughout the sequence

  2. Shared representation learning: Using identical parameters across positions enables the model to learn generalizable features

  3. Implicit stationarity assumption: The conditional relationship p(xtx<t)p(x_t | x_{<t}) is assumed consistent across time steps

Benefits:

  • Efficient parameter utilization: Enables learning from all sequence positions simultaneously
  • Generalization to variable-length sequences: The same model handles sequences of any length
  • Transfer of learned patterns: Knowledge from one position informs predictions at other positions

Limitations:

  • Position-specific patterns: Some sequences may have position-dependent conditional structures this bias cannot capture
  • Oversimplification: Real-world sequences may exhibit non-stationary conditional dependencies

Dynamic Generation Process#

The autoregressive generation process creates a dynamic dependency chain:

Step 1: x1pθ(x1)x_1 \sim p_\theta(x_1) Step 2: x2pθ(x2x1)x_2 \sim p_\theta(x_2 | x_1) Step 3: x3pθ(x3x1,x2)x_3 \sim p_\theta(x_3 | x_1, x_2) Step t: xtpθ(xtx1,x2,,xt1)x_t \sim p_\theta(x_t | x_1, x_2, \ldots, x_{t-1})

This creates the dependency chain:

Time step 1:x1Time step 2:x1x2Time step 3:x1,x2x3Time step t:x1,x2,,xt1xt\begin{align} \text{Time step 1:} \quad &x_1 \\ \text{Time step 2:} \quad &x_1 \rightarrow x_2 \\ \text{Time step 3:} \quad &x_1, x_2 \rightarrow x_3 \\ &\vdots \\ \text{Time step t:} \quad &x_1, x_2, \ldots, x_{t-1} \rightarrow x_t \end{align}

Training vs. Inference#

During Training (Teacher Forcing):

  • The model receives the entire ground truth sequence as input
  • All conditional probabilities are computed in parallel
  • The model learns conditional distribution patterns without generating sequences step-by-step

During Inference (True Autoregression):

  • The model generates tokens sequentially, one at a time
  • Each new token is sampled from the learned distribution and fed back as input
  • Prediction errors can compound through the generation process, leading to exposure bias

The training phase resembles intensive practice on decomposition problems, while the real autoregressive challenge emerges during inference when the model must generate coherent sequences using only its own predictions as context.

Limitations#

Autoregressive modeling has inherent limitations for sequences with complex contextual dependencies. In natural language, a word’s meaning often depends on both preceding and subsequent context. This bidirectional dependency means that autoregressive models, conditioning only on previous tokens, may miss crucial contextual information from future tokens.

This limitation has motivated the development of bidirectional models that capture dependencies in both directions.

留言板