Conditional Distribution Modeling#

Mathematical Foundation#

Chain Rule Decomposition#

Any joint distribution can be decomposed into a product of conditional distributions using the chain rule:

p(x_1, x_2, \dots, x_n) = p(x_1)p(x_2 | x_1) p(x_3 | x_1, x_2) \dots p(x_n | x_1, \dots, x_{n-1})

This decomposition is particularly well-suited for modeling sequences of tokens, where each token $x_i$ represents a discrete element such as words, image patches, or amino acids.

Sequential Properties#

While the chain rule decomposition applies universally to multivariate distributions, sequential data introduces important constraints. For sequences with temporal dependencies, the variables are not interchangeable due to their inherent ordering.

Extension to Grid-Structured Data#

Unlike sequential data with natural temporal ordering, grid-structured data (images, videos, spatial data) lacks inherent sequential structure. However, autoregressive modeling can still be applied by imposing an artificial ordering on spatial elements.

For images, this is achieved by linearizing the 2D pixel grid into a 1D sequence using various strategies:

Raster scan ordering: left-to-right, top-to-bottom traversal
Spiral ordering: starting from center or corner positions
Random permutation: shuffling pixel positions randomly

After linearization, we apply the same conditional chain:

p(I) = p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots p(x_n|x_1,\ldots,x_{n-1})

where $I$ represents the image and $x_i$ denotes the $i$ -th pixel in the chosen ordering.

Key considerations for grid-structured data:

Spatial correlation: neighboring pixels exhibit stronger correlations than distant ones
Dimensional mismatch: natural 2D structure versus imposed 1D ordering
Order sensitivity: different linearization strategies can significantly impact modeling behavior

Despite these challenges, autoregressive image models such as PixelRNN and PixelCNN have successfully demonstrated the ability to capture complex image distributions and generate high-quality samples.

Conditional Distribution Parameterization#

Naive Parameterization Approach#

Each conditional distribution in the chain rule decomposition could be parameterized independently with its own parameter set $\theta_i$ :

p_{\theta_1, \theta_2, \dots, \theta_n}(x_1, \dots, x_n) = p_{\theta_1}(x_1) p_{\theta_2}(x_2 | x_1)\dots p_{\theta_n}(x_n| x_1, \dots, x_{n-1})

However, this approach introduces significant computational challenges:

Parameter explosion: Each position requires its own parameter set, leading to $O(n)$ parameter scaling
Memory overhead: Storing and managing separate networks for each conditional distribution
Training complexity: Optimizing multiple independent parameter sets simultaneously
Generalization issues: Limited parameter sharing reduces the model’s ability to learn common patterns

To address these limitations, modern autoregressive models employ weight sharing, where all conditional distributions share the same parameter set $\theta$ :

p_\theta(x_1, \dots, x_n) = p_\theta(x_1) p_\theta(x_2 | x_1)\dots p_\theta(x_n| x_1, \dots, x_{n-1})

Benefits of weight sharing:

Parameter efficiency: Fixed parameter count regardless of sequence length
Translation invariance: Model learns position-agnostic conditional patterns
Better generalization: Shared parameters enable learning from all positions simultaneously
Scalability: Enables processing of variable-length sequences without architectural changes

Autoregressive Models#

Definition and Method#

The term “autoregressive” combines “auto” (self) and “regression” (prediction). The model uses its own previous outputs as inputs for generating the next prediction, creating a self-referential chain where outputs become inputs.

Autoregressive models predict sequences by using previously generated elements to predict the next one, enabling generation of sequences of arbitrary length through iterative application.

Inductive Bias#

The inductive bias in autoregressive models stems from the assumption that the conditional distribution function remains stationary across all sequence positions. The same neural network architecture and parameters $\theta$ model every conditional probability $p_\theta(x_i | x_1, \ldots, x_{i-1})$ regardless of position $i$ .

Key aspects of this inductive bias:

Position-invariant conditional patterns: The model assumes that the mechanism for predicting the next token given previous context follows the same pattern throughout the sequence
Shared representation learning: Using identical parameters across positions enables the model to learn generalizable features
Implicit stationarity assumption: The conditional relationship $p(x_t | x_{<t})$ is assumed consistent across time steps

Benefits:

Efficient parameter utilization: Enables learning from all sequence positions simultaneously
Generalization to variable-length sequences: The same model handles sequences of any length
Transfer of learned patterns: Knowledge from one position informs predictions at other positions

Limitations:

Position-specific patterns: Some sequences may have position-dependent conditional structures this bias cannot capture
Oversimplification: Real-world sequences may exhibit non-stationary conditional dependencies

Dynamic Generation Process#

The autoregressive generation process creates a dynamic dependency chain:

Step 1: $x_1 \sim p_\theta(x_1)$ Step 2: $x_2 \sim p_\theta(x_2 | x_1)$ Step 3: $x_3 \sim p_\theta(x_3 | x_1, x_2)$ Step t: $x_t \sim p_\theta(x_t | x_1, x_2, \ldots, x_{t-1})$

This creates the dependency chain:

\begin{align} \text{Time step 1:} \quad &x_1 \\ \text{Time step 2:} \quad &x_1 \rightarrow x_2 \\ \text{Time step 3:} \quad &x_1, x_2 \rightarrow x_3 \\ &\vdots \\ \text{Time step t:} \quad &x_1, x_2, \ldots, x_{t-1} \rightarrow x_t \end{align}

Training vs. Inference#

During Training (Teacher Forcing):

The model receives the entire ground truth sequence as input
All conditional probabilities are computed in parallel
The model learns conditional distribution patterns without generating sequences step-by-step

During Inference (True Autoregression):

The model generates tokens sequentially, one at a time
Each new token is sampled from the learned distribution and fed back as input
Prediction errors can compound through the generation process, leading to exposure bias

The training phase resembles intensive practice on decomposition problems, while the real autoregressive challenge emerges during inference when the model must generate coherent sequences using only its own predictions as context.

Limitations#

Autoregressive modeling has inherent limitations for sequences with complex contextual dependencies. In natural language, a word’s meaning often depends on both preceding and subsequent context. This bidirectional dependency means that autoregressive models, conditioning only on previous tokens, may miss crucial contextual information from future tokens.

This limitation has motivated the development of bidirectional models that capture dependencies in both directions.