t-SNE - Ada Lovelemon

t-SNE

首次发布: 2025-12-27

... 次访问

/

The core objective of t-SNE (t-distributed Stochastic Neighbor Embedding) is to reduce the dimensionality of high-dimensional data while preserving local neighborhood structure.

More precisely, t-SNE tries to make “who is close to whom” in high-dimensional space look similar in the low-dimensional embedding. It is primarily a visualization method (2D/3D), not a general-purpose dimensionality reduction for downstream metrics.

What t-SNE preserves (and what it doesn’t)

Preserves: local neighborhoods / nearest-neighbor relationships.
Does not promise: global distances, cluster sizes, or meaningful absolute axes.
Typical outcome: points form visually separated islands, but inter-island distances are not reliable.

1. High-Dimensional Similarities $p_{j\mid i}$ #

Let the dataset be $\mathcal{X} = \{x_i\}_{i=1}^N$ , with $x_i \in \mathbb{R}^D$ .

For each center point $x_i$ , t-SNE defines a conditional probability that $x_j$ is a neighbor of $x_i$ using a Gaussian kernel with a point-specific bandwidth $\sigma_i$ :

p_{j\mid i} = \frac{\exp\left(-\frac{\lVert x_i - x_j \rVert^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{\lVert x_i - x_k \rVert^2}{2\sigma_i^2}\right)},\quad p_{i\mid i}=0.

Why per-point $\sigma_i$ ? Because data density varies across the space; a single global kernel width often fails.

Perplexity and choosing $\sigma_i$ #

Instead of picking $\sigma_i$ directly, t-SNE chooses it so that the conditional distribution $P_i = \{p_{j\mid i}\}_{j\neq i}$ has a target perplexity:

\mathrm{Perp}(P_i) = 2^{H(P_i)},\quad H(P_i) = -\sum_{j \neq i} p_{j\mid i}\,\log_2 p_{j\mid i}.

Intuition: perplexity is an “effective number of neighbors”. In practice, $\sigma_i$ is found by binary search to match the desired perplexity.

Symmetrization $p_{ij}$ #

t-SNE uses a symmetric joint distribution over pairs:

p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N},\quad p_{ii}=0,\quad \sum_{i\neq j} p_{ij}=1.

2. Low-Dimensional Similarities $q_{ij}$ #

Let $y_i \in \mathbb{R}^d$ be the low-dimensional embedding ( $d=2$ or $3$ ). Instead of a Gaussian, t-SNE uses a heavy-tailed Student- $t$ distribution (with 1 degree of freedom, i.e. Cauchy):

q_{ij} = \frac{\left(1 + \lVert y_i - y_j \rVert^2\right)^{-1}}{\sum_{k \neq l} \left(1 + \lVert y_k - y_l \rVert^2\right)^{-1}},\quad q_{ii}=0.

This choice addresses the crowding problem: in low dimensions, many moderately-close points compete for limited area. Heavy tails allow moderately distant points to stay separated without forcing everyone into the center.

3. Objective: KL Divergence $\mathrm{KL}(P\,\|\,Q)$ #

t-SNE fits the embedding by minimizing:

\mathcal{L}(Y) = \mathrm{KL}(P\,\|\,Q) = \sum_{i\neq j} p_{ij} \log\frac{p_{ij}}{q_{ij}}.

Important asymmetry: $\mathrm{KL}(P\,\|\,Q)$ heavily penalizes when a high-probability neighbor in $P$ is far apart in $Q$ (i.e. it prioritizes local neighbor preservation).

4. Gradient (Attractive vs Repulsive Forces)#

The gradient w.r.t. a point $y_i$ has a clean “forces” form:

\frac{\partial \mathcal{L}}{\partial y_i} = 4\sum_{j\neq i} (p_{ij}-q_{ij})\,\frac{(y_i-y_j)}{1+\lVert y_i-y_j\rVert^2}.

Attractive term: $p_{ij}$ pulls close neighbors together.
Repulsive term: $q_{ij}$ pushes points apart (preventing collapse).

5. The Algorithm (Practical t-SNE)#

In practice, t-SNE is optimized by gradient descent with momentum and a few well-known tricks.

Step-by-step#

(Optional but recommended) Preprocess $X$ : standardize features; often apply PCA to 30–50 dims.
Compute approximate kNN graph (for speed) and distances.
For each $i$ , binary search $\sigma_i$ to match the chosen perplexity, yielding $p_{j\mid i}$ .
Symmetrize to get $p_{ij}$ .
Initialize $y_i$ (random or PCA).
Optimize $\mathcal{L}(Y)$ with gradient descent.

Two common training tricks#

(a) Early exaggeration#

For an initial phase, replace $p_{ij}$ with $\alpha p_{ij}$ , where $\alpha>1$ (often 4–12). This encourages clusters to separate early before fine local fitting.

(b) Learning rate and momentum#

Learning rate too small: points may clump and move slowly. Learning rate too large: embedding may “explode” and become unstable.

6. Complexity and Accelerations#

Naive t-SNE uses all pairwise interactions and costs $O(N^2)$ in memory/time.

Common accelerations:

Barnes–Hut t-SNE: approximates repulsive forces with a quadtree/octree, typically $O(N\log N)$ (mostly for 2D/3D).
FIt-SNE / FFT-based: uses interpolation + FFT for faster repulsive force computation, often near $O(N)$ in practice.
Approximate neighbors: compute $p_{ij}$ using kNN only (sparse $P$ ), which is crucial for large $N$ .

7. Engineering Notes (Hyperparameters + Pitfalls)#

Recommended preprocessing#

Standardize features (mean 0, variance 1) unless distances already meaningful.
PCA to 30–50 dims often improves stability and speed (and denoises).

Perplexity: how to pick#

Small (5–30): focuses on very local structure; more fragmented islands.
Medium (30–100): smoother global neighborhood; fewer islands.
Rule of thumb: perplexity should be smaller than $N/3$ and large enough to capture the neighborhood scale you care about.

Common failure modes#

Interpreting island distances as meaningful global geometry.
Interpreting island size/density as real density (t-SNE distorts densities).
Comparing two different runs without fixing randomness: different seeds can produce different layouts.
Applying t-SNE directly to raw pixels or unnormalized features; PCA/standardization typically helps.

About “new points” (out-of-sample)#

Classic t-SNE is non-parametric: it optimizes $\{y_i\}$ for the training set only. If you need to embed new samples:

Use parametric t-SNE (learn a neural net to predict $y$ ).
Or use libraries that support adding points approximately (e.g. openTSNE).

8. Minimal Code Examples (Python)#

scikit-learn#

import numpy as np
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# X: [N, D]
X = ...
X = StandardScaler().fit_transform(X)

Z = TSNE(
    # Depending on your sklearn version, this may be `max_iter` instead of `n_iter`.
    n_components=2,
    perplexity=30,
    learning_rate="auto",
    init="pca",
    max_iter=1000,
    random_state=42,
).fit_transform(X)

Notes:

init="pca" usually improves stability.
learning_rate="auto" is often a good default in recent sklearn.

openTSNE (often faster / more flexible)#

from openTSNE import TSNE
from sklearn.decomposition import PCA

X = ...
X_50 = PCA(n_components=50, random_state=42).fit_transform(X)

embedding = TSNE(
    n_components=2,
    perplexity=30,
    initialization="pca",
    random_state=42,
).fit(X_50)

Z = embedding.view(np.ndarray)

SNE: uses Gaussian in both spaces; suffers more from crowding.
t-SNE: fixes crowding via Student- $t$ in low-dim.
UMAP: often preserves more global structure, faster for big datasets, supports transform (out-of-sample) more naturally.
PCA: linear; preserves global variance directions; good baseline and a common pre-step for t-SNE.

1. High-Dimensional Similarities pj∣ip_{j\mid i}pj∣i​#

Perplexity and choosing σi\sigma_iσi​#

Symmetrization pijp_{ij}pij​#

2. Low-Dimensional Similarities qijq_{ij}qij​#

3. Objective: KL Divergence KL(P ∥ Q)\mathrm{KL}(P\,\|\,Q)KL(P∥Q)#