828 个字词
4 分钟
t-SNE
首次发布: 2025-12-27
... 次访问

The core objective of t-SNE (t-distributed Stochastic Neighbor Embedding) is to reduce the dimensionality of high-dimensional data while preserving local neighborhood structure.

More precisely, t-SNE tries to make “who is close to whom” in high-dimensional space look similar in the low-dimensional embedding. It is primarily a visualization method (2D/3D), not a general-purpose dimensionality reduction for downstream metrics.

What t-SNE preserves (and what it doesn’t)

  • Preserves: local neighborhoods / nearest-neighbor relationships.
  • Does not promise: global distances, cluster sizes, or meaningful absolute axes.
  • Typical outcome: points form visually separated islands, but inter-island distances are not reliable.

1. High-Dimensional Similarities pjip_{j\mid i}#

Let the dataset be X={xi}i=1N\mathcal{X} = \{x_i\}_{i=1}^N, with xiRDx_i \in \mathbb{R}^D.

For each center point xix_i, t-SNE defines a conditional probability that xjx_j is a neighbor of xix_i using a Gaussian kernel with a point-specific bandwidth σi\sigma_i:

pji=exp(xixj22σi2)kiexp(xixk22σi2),pii=0.p_{j\mid i} = \frac{\exp\left(-\frac{\lVert x_i - x_j \rVert^2}{2\sigma_i^2}\right)}{\sum_{k \neq i} \exp\left(-\frac{\lVert x_i - x_k \rVert^2}{2\sigma_i^2}\right)},\quad p_{i\mid i}=0.

Why per-point σi\sigma_i? Because data density varies across the space; a single global kernel width often fails.

Perplexity and choosing σi\sigma_i#

Instead of picking σi\sigma_i directly, t-SNE chooses it so that the conditional distribution Pi={pji}jiP_i = \{p_{j\mid i}\}_{j\neq i} has a target perplexity:

Perp(Pi)=2H(Pi),H(Pi)=jipjilog2pji.\mathrm{Perp}(P_i) = 2^{H(P_i)},\quad H(P_i) = -\sum_{j \neq i} p_{j\mid i}\,\log_2 p_{j\mid i}.

Intuition: perplexity is an “effective number of neighbors”. In practice, σi\sigma_i is found by binary search to match the desired perplexity.

Symmetrization pijp_{ij}#

t-SNE uses a symmetric joint distribution over pairs:

pij=pji+pij2N,pii=0,ijpij=1.p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N},\quad p_{ii}=0,\quad \sum_{i\neq j} p_{ij}=1.

2. Low-Dimensional Similarities qijq_{ij}#

Let yiRdy_i \in \mathbb{R}^d be the low-dimensional embedding (d=2d=2 or 33). Instead of a Gaussian, t-SNE uses a heavy-tailed Student-tt distribution (with 1 degree of freedom, i.e. Cauchy):

qij=(1+yiyj2)1kl(1+ykyl2)1,qii=0.q_{ij} = \frac{\left(1 + \lVert y_i - y_j \rVert^2\right)^{-1}}{\sum_{k \neq l} \left(1 + \lVert y_k - y_l \rVert^2\right)^{-1}},\quad q_{ii}=0.

This choice addresses the crowding problem: in low dimensions, many moderately-close points compete for limited area. Heavy tails allow moderately distant points to stay separated without forcing everyone into the center.

3. Objective: KL Divergence KL(PQ)\mathrm{KL}(P\,\|\,Q)#

t-SNE fits the embedding by minimizing:

L(Y)=KL(PQ)=ijpijlogpijqij.\mathcal{L}(Y) = \mathrm{KL}(P\,\|\,Q) = \sum_{i\neq j} p_{ij} \log\frac{p_{ij}}{q_{ij}}.

Important asymmetry: KL(PQ)\mathrm{KL}(P\,\|\,Q) heavily penalizes when a high-probability neighbor in PP is far apart in QQ (i.e. it prioritizes local neighbor preservation).

4. Gradient (Attractive vs Repulsive Forces)#

The gradient w.r.t. a point yiy_i has a clean “forces” form:

Lyi=4ji(pijqij)(yiyj)1+yiyj2.\frac{\partial \mathcal{L}}{\partial y_i} = 4\sum_{j\neq i} (p_{ij}-q_{ij})\,\frac{(y_i-y_j)}{1+\lVert y_i-y_j\rVert^2}.
  • Attractive term: pijp_{ij} pulls close neighbors together.
  • Repulsive term: qijq_{ij} pushes points apart (preventing collapse).

5. The Algorithm (Practical t-SNE)#

In practice, t-SNE is optimized by gradient descent with momentum and a few well-known tricks.

Step-by-step#

  1. (Optional but recommended) Preprocess XX: standardize features; often apply PCA to 30–50 dims.
  2. Compute approximate kNN graph (for speed) and distances.
  3. For each ii, binary search σi\sigma_i to match the chosen perplexity, yielding pjip_{j\mid i}.
  4. Symmetrize to get pijp_{ij}.
  5. Initialize yiy_i (random or PCA).
  6. Optimize L(Y)\mathcal{L}(Y) with gradient descent.

Two common training tricks#

(a) Early exaggeration#

For an initial phase, replace pijp_{ij} with αpij\alpha p_{ij}, where α>1\alpha>1 (often 4–12). This encourages clusters to separate early before fine local fitting.

(b) Learning rate and momentum#

Learning rate too small: points may clump and move slowly. Learning rate too large: embedding may “explode” and become unstable.

6. Complexity and Accelerations#

Naive t-SNE uses all pairwise interactions and costs O(N2)O(N^2) in memory/time.

Common accelerations:

  • Barnes–Hut t-SNE: approximates repulsive forces with a quadtree/octree, typically O(NlogN)O(N\log N) (mostly for 2D/3D).
  • FIt-SNE / FFT-based: uses interpolation + FFT for faster repulsive force computation, often near O(N)O(N) in practice.
  • Approximate neighbors: compute pijp_{ij} using kNN only (sparse PP), which is crucial for large NN.

7. Engineering Notes (Hyperparameters + Pitfalls)#

  • Standardize features (mean 0, variance 1) unless distances already meaningful.
  • PCA to 30–50 dims often improves stability and speed (and denoises).

Perplexity: how to pick#

  • Small (5–30): focuses on very local structure; more fragmented islands.
  • Medium (30–100): smoother global neighborhood; fewer islands.
  • Rule of thumb: perplexity should be smaller than N/3N/3 and large enough to capture the neighborhood scale you care about.

Common failure modes#

  • Interpreting island distances as meaningful global geometry.
  • Interpreting island size/density as real density (t-SNE distorts densities).
  • Comparing two different runs without fixing randomness: different seeds can produce different layouts.
  • Applying t-SNE directly to raw pixels or unnormalized features; PCA/standardization typically helps.

About “new points” (out-of-sample)#

Classic t-SNE is non-parametric: it optimizes {yi}\{y_i\} for the training set only. If you need to embed new samples:

  • Use parametric t-SNE (learn a neural net to predict yy).
  • Or use libraries that support adding points approximately (e.g. openTSNE).

8. Minimal Code Examples (Python)#

scikit-learn#

import numpy as np
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# X: [N, D]
X = ...
X = StandardScaler().fit_transform(X)

Z = TSNE(
    # Depending on your sklearn version, this may be `max_iter` instead of `n_iter`.
    n_components=2,
    perplexity=30,
    learning_rate="auto",
    init="pca",
    max_iter=1000,
    random_state=42,
).fit_transform(X)

Notes:

  • init="pca" usually improves stability.
  • learning_rate="auto" is often a good default in recent sklearn.

openTSNE (often faster / more flexible)#

from openTSNE import TSNE
from sklearn.decomposition import PCA

X = ...
X_50 = PCA(n_components=50, random_state=42).fit_transform(X)

embedding = TSNE(
    n_components=2,
    perplexity=30,
    initialization="pca",
    random_state=42,
).fit(X_50)

Z = embedding.view(np.ndarray)
  • SNE: uses Gaussian in both spaces; suffers more from crowding.
  • t-SNE: fixes crowding via Student-tt in low-dim.
  • UMAP: often preserves more global structure, faster for big datasets, supports transform (out-of-sample) more naturally.
  • PCA: linear; preserves global variance directions; good baseline and a common pre-step for t-SNE.

留言板