Conditional Distribution Modeling
Mathematical Foundation
Chain Rule Decomposition
Any joint distribution can be decomposed into a product of conditional distributions using the chain rule:
This decomposition is particularly well-suited for modeling sequences of tokens, where each token represents a discrete element such as words, image patches, or amino acids.
Sequential Properties
While the chain rule decomposition applies universally to multivariate distributions, sequential data introduces important constraints. For sequences with temporal dependencies, the variables are not interchangeable due to their inherent ordering.
Extension to Grid-Structured Data
Unlike sequential data with natural temporal ordering, grid-structured data (images, videos, spatial data) lacks inherent sequential structure. However, autoregressive modeling can still be applied by imposing an artificial ordering on spatial elements.
For images, this is achieved by linearizing the 2D pixel grid into a 1D sequence using various strategies:
- Raster scan ordering: left-to-right, top-to-bottom traversal
- Spiral ordering: starting from center or corner positions
- Random permutation: shuffling pixel positions randomly
After linearization, we apply the same conditional chain:
where represents the image and denotes the -th pixel in the chosen ordering.
Key considerations for grid-structured data:
- Spatial correlation: neighboring pixels exhibit stronger correlations than distant ones
- Dimensional mismatch: natural 2D structure versus imposed 1D ordering
- Order sensitivity: different linearization strategies can significantly impact modeling behavior
Despite these challenges, autoregressive image models such as PixelRNN and PixelCNN have successfully demonstrated the ability to capture complex image distributions and generate high-quality samples.
Conditional Distribution Parameterization
Naive Parameterization Approach
Each conditional distribution in the chain rule decomposition could be parameterized independently with its own parameter set :
However, this approach introduces significant computational challenges:
- Parameter explosion: Each position requires its own parameter set, leading to parameter scaling
- Memory overhead: Storing and managing separate networks for each conditional distribution
- Training complexity: Optimizing multiple independent parameter sets simultaneously
- Generalization issues: Limited parameter sharing reduces the model’s ability to learn common patterns
Weight Sharing Solution
To address these limitations, modern autoregressive models employ weight sharing, where all conditional distributions share the same parameter set :
Benefits of weight sharing:
- Parameter efficiency: Fixed parameter count regardless of sequence length
- Translation invariance: Model learns position-agnostic conditional patterns
- Better generalization: Shared parameters enable learning from all positions simultaneously
- Scalability: Enables processing of variable-length sequences without architectural changes
Autoregressive Models
Definition and Method
The term “autoregressive” combines “auto” (self) and “regression” (prediction). The model uses its own previous outputs as inputs for generating the next prediction, creating a self-referential chain where outputs become inputs.
Autoregressive models predict sequences by using previously generated elements to predict the next one, enabling generation of sequences of arbitrary length through iterative application.
Inductive Bias
The inductive bias in autoregressive models stems from the assumption that the conditional distribution function remains stationary across all sequence positions. The same neural network architecture and parameters model every conditional probability regardless of position .
Key aspects of this inductive bias:
Position-invariant conditional patterns: The model assumes that the mechanism for predicting the next token given previous context follows the same pattern throughout the sequence
Shared representation learning: Using identical parameters across positions enables the model to learn generalizable features
Implicit stationarity assumption: The conditional relationship is assumed consistent across time steps
Benefits:
- Efficient parameter utilization: Enables learning from all sequence positions simultaneously
- Generalization to variable-length sequences: The same model handles sequences of any length
- Transfer of learned patterns: Knowledge from one position informs predictions at other positions
Limitations:
- Position-specific patterns: Some sequences may have position-dependent conditional structures this bias cannot capture
- Oversimplification: Real-world sequences may exhibit non-stationary conditional dependencies
Dynamic Generation Process
The autoregressive generation process creates a dynamic dependency chain:
Step 1: Step 2: Step 3: Step t:
This creates the dependency chain:
Training vs. Inference
During Training (Teacher Forcing):
- The model receives the entire ground truth sequence as input
- All conditional probabilities are computed in parallel
- The model learns conditional distribution patterns without generating sequences step-by-step
During Inference (True Autoregression):
- The model generates tokens sequentially, one at a time
- Each new token is sampled from the learned distribution and fed back as input
- Prediction errors can compound through the generation process, leading to exposure bias
The training phase resembles intensive practice on decomposition problems, while the real autoregressive challenge emerges during inference when the model must generate coherent sequences using only its own predictions as context.
Limitations
Autoregressive modeling has inherent limitations for sequences with complex contextual dependencies. In natural language, a word’s meaning often depends on both preceding and subsequent context. This bidirectional dependency means that autoregressive models, conditioning only on previous tokens, may miss crucial contextual information from future tokens.
This limitation has motivated the development of bidirectional models that capture dependencies in both directions.

