The core objective of t-SNE (t-distributed Stochastic Neighbor Embedding) is to reduce the dimensionality of high-dimensional data while preserving local neighborhood structure.
More precisely, t-SNE tries to make “who is close to whom” in high-dimensional space look similar in the low-dimensional embedding. It is primarily a visualization method (2D/3D), not a general-purpose dimensionality reduction for downstream metrics.
What t-SNE preserves (and what it doesn’t)
- Preserves: local neighborhoods / nearest-neighbor relationships.
- Does not promise: global distances, cluster sizes, or meaningful absolute axes.
- Typical outcome: points form visually separated islands, but inter-island distances are not reliable.
1. High-Dimensional Similarities
Let the dataset be , with .
For each center point , t-SNE defines a conditional probability that is a neighbor of using a Gaussian kernel with a point-specific bandwidth :
Why per-point ? Because data density varies across the space; a single global kernel width often fails.
Perplexity and choosing
Instead of picking directly, t-SNE chooses it so that the conditional distribution has a target perplexity:
Intuition: perplexity is an “effective number of neighbors”. In practice, is found by binary search to match the desired perplexity.
Symmetrization
t-SNE uses a symmetric joint distribution over pairs:
2. Low-Dimensional Similarities
Let be the low-dimensional embedding ( or ). Instead of a Gaussian, t-SNE uses a heavy-tailed Student- distribution (with 1 degree of freedom, i.e. Cauchy):
This choice addresses the crowding problem: in low dimensions, many moderately-close points compete for limited area. Heavy tails allow moderately distant points to stay separated without forcing everyone into the center.
3. Objective: KL Divergence
t-SNE fits the embedding by minimizing:
Important asymmetry: heavily penalizes when a high-probability neighbor in is far apart in (i.e. it prioritizes local neighbor preservation).
4. Gradient (Attractive vs Repulsive Forces)
The gradient w.r.t. a point has a clean “forces” form:
- Attractive term: pulls close neighbors together.
- Repulsive term: pushes points apart (preventing collapse).
5. The Algorithm (Practical t-SNE)
In practice, t-SNE is optimized by gradient descent with momentum and a few well-known tricks.
Step-by-step
- (Optional but recommended) Preprocess : standardize features; often apply PCA to 30–50 dims.
- Compute approximate kNN graph (for speed) and distances.
- For each , binary search to match the chosen perplexity, yielding .
- Symmetrize to get .
- Initialize (random or PCA).
- Optimize with gradient descent.
Two common training tricks
(a) Early exaggeration
For an initial phase, replace with , where (often 4–12). This encourages clusters to separate early before fine local fitting.
(b) Learning rate and momentum
Learning rate too small: points may clump and move slowly. Learning rate too large: embedding may “explode” and become unstable.
6. Complexity and Accelerations
Naive t-SNE uses all pairwise interactions and costs in memory/time.
Common accelerations:
- Barnes–Hut t-SNE: approximates repulsive forces with a quadtree/octree, typically (mostly for 2D/3D).
- FIt-SNE / FFT-based: uses interpolation + FFT for faster repulsive force computation, often near in practice.
- Approximate neighbors: compute using kNN only (sparse ), which is crucial for large .
7. Engineering Notes (Hyperparameters + Pitfalls)
Recommended preprocessing
- Standardize features (mean 0, variance 1) unless distances already meaningful.
- PCA to 30–50 dims often improves stability and speed (and denoises).
Perplexity: how to pick
- Small (5–30): focuses on very local structure; more fragmented islands.
- Medium (30–100): smoother global neighborhood; fewer islands.
- Rule of thumb: perplexity should be smaller than and large enough to capture the neighborhood scale you care about.
Common failure modes
- Interpreting island distances as meaningful global geometry.
- Interpreting island size/density as real density (t-SNE distorts densities).
- Comparing two different runs without fixing randomness: different seeds can produce different layouts.
- Applying t-SNE directly to raw pixels or unnormalized features; PCA/standardization typically helps.
About “new points” (out-of-sample)
Classic t-SNE is non-parametric: it optimizes for the training set only. If you need to embed new samples:
- Use parametric t-SNE (learn a neural net to predict ).
- Or use libraries that support adding points approximately (e.g. openTSNE).
8. Minimal Code Examples (Python)
scikit-learn
import numpy as np
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# X: [N, D]
X = ...
X = StandardScaler().fit_transform(X)
Z = TSNE(
# Depending on your sklearn version, this may be `max_iter` instead of `n_iter`.
n_components=2,
perplexity=30,
learning_rate="auto",
init="pca",
max_iter=1000,
random_state=42,
).fit_transform(X)Notes:
init="pca"usually improves stability.learning_rate="auto"is often a good default in recent sklearn.
openTSNE (often faster / more flexible)
from openTSNE import TSNE
from sklearn.decomposition import PCA
X = ...
X_50 = PCA(n_components=50, random_state=42).fit_transform(X)
embedding = TSNE(
n_components=2,
perplexity=30,
initialization="pca",
random_state=42,
).fit(X_50)
Z = embedding.view(np.ndarray)9. Relationship to Related Methods (Quick Map)
- SNE: uses Gaussian in both spaces; suffers more from crowding.
- t-SNE: fixes crowding via Student- in low-dim.
- UMAP: often preserves more global structure, faster for big datasets, supports transform (out-of-sample) more naturally.
- PCA: linear; preserves global variance directions; good baseline and a common pre-step for t-SNE.

