Self-Distillation - Ada Lovelemon

Self-Distillation (Motivation & Definition)#

Self-distillation can be read as: the model teaches itself. During training we maintain a student and a teacher; the teacher is not an externally trained model but is usually an exponential moving average (EMA / momentum encoder) of the student parameters.
Compared to classical knowledge distillation, there are no human labels and no separate pretrained teacher. The supervision signal comes from the idea that “different augmented views of the same image should agree.”

Emerging Properties in Self-Supervised Vision Transformers (DINOv1, 2021)#

Paper: https://arxiv.org/abs/2104.14294
Codes: https://github.com/facebookresearch/dino

Overview#

This paper proposes DINO, a self-distillation framework with no labels, to pretrain ViTs. Besides the fact that the DINO method works quite well on this kind of architecture, there are also two interesting properties emerging from the learned features:

DINO features can be naturally applied for semantic segmentation (without fine-tuning):
- Just segment scenes with a nearest neighbor between consecutive frames can lead to competitive results on DAVIS-2017 benchmark;
- Different attention heads capture different semantic concepts (object parts, objects, background, etc).
DINO features also excel at classification tasks:
- Linear probing on ImageNet achieves 80.1% top-1 accuracy (ViT-B/8);
- k-NN classification (k=20) on ImageNet achieves 74.5% top-1 accuracy (ViT-S/16).

Self-Distillation means the DINO method involves two models—a student model and a teacher model, and both of the model share the same architecure but different parameters. Given a batch of input images $x \in \mathbb{R}^{B \times 3 \times H \times W}$ , apply two different random transformations on it to get two views $x_s$ and $x_t$ .

For the student model, after being feed with $x_s$ , it will output features $z_s \in \mathbb{R}^{B\times d}$ ; then apply a Softmax operation with temperature $\tau_s$ on $z_s$ ‘s feature dimension to get a student probability distribution $p_s$ .

p_s = \text{Softmax}\left(\frac{z_s}{\tau_s}, \text{dim}=1\right)

For the teacher model, after output features $z_t \in \mathbb{R}^{B\times d}$ when given $x_t$ , firstly center $z_t$ by subtracting a mean vector $c \in \mathbb{R}^{d}$ (dynamically updated during training), then apply the same Softmax operation with temperature $\tau_t$ to get the teacher probability distribution $p_t$ .

\begin{aligned} p_t = \text{Softmax}\left(\frac{z_t - c}{\tau_t}, \text{dim}=1\right) \\ c \leftarrow m \cdot c + (1-m) \cdot \frac{1}{B} \sum_{i=1}^{B} z_t^{(i)} \end{aligned}

Finally, we want the two distributions to be as close as possible, so we minimize the cross-entropy loss between them:

\min_{\theta_s} \text{CE}(p_t, p_s) = -\frac{1}{B}\sum_{i=1}^B\sum_{j=1}^d (p_t^{(ij)} \cdot \log p_s^{(ij)})

Note that only the student model is updated by back-propagation; the teacher model’s parameters are updated by an exponential moving average (EMA) of the student model’s parameters:

\theta_t \leftarrow \lambda \cdot \theta_t + (1-\lambda) \cdot \theta_s

DINO Framework

# gs, gt: student and teacher
# C: center buffer
# tps, tpt: student/teacher temperature
# m_net, m_center: EMA momentum for network and center

gt.params = gs.params
for x in loader:
    x_1, x_2 = augment_1(x), augment_2(x)  # random augmentations

    z_s1, z_s2 = gs(x_1), gs(x_2)          # student forward
    z_t1, z_t2 = gt(x_1), gt(x_2)          # teacher forward

    p_s1, p_s2 = softmax(z_s1 / tps), softmax(z_s2 / tps)  # student prob
    p_t1, p_t2 = softmax((z_t1 - C) / tpt), softmax((z_t2 - C) / tpt)  # teacher prob

    loss = 0.5 * (CE(p_t1, p_s2) + CE(p_t2, p_s1))  # cross-entropy loss
    loss.backward()

    gs_optimizer.zero_grad()
    gs_optimizer.step()  # update student

    # update teacher by EMA
    for p_s, p_t in zip(gs.params, gt.params):
        p_t.data = m_net * p_t.data + (1 - m_net) * p_s.data
    
    # update center
    C = m_center * C + (1 - m_center) * mean([z_t1, z_t2])

Technical Details#

Random Augmentation Strategy
Two major types of augmentations are applied here:
- BYOL’s augmentation: color jittering, Gaussian blur and solarization;
- Multi-crop: each input image $x$ is transformed into 2 global views $x_1^g, x_2^g$ (224x224, covering most of the image) and 6–10 local crops (96x96, randomly sampled patches), constructing a set of views $V = \{x_1^g, x_2^g, x_1^l, \ldots, x_n^l\}$ . The teacher only processes the global views, while the student processes all views (global + local). This design encourages the student to learn consistent representations across scales and promotes fine-grained semantic understanding—critical for downstream dense prediction tasks like segmentation.

Actual loss function:

\min_{\theta_s} \sum_{x\in\{x_1^g, x_2^g\}}\sum_{x'\in V, x' \ne x} \text{CE}\left(p_t(x), p_s(x')\right)

Centering to Prevent Collapse
While there are multiple stablizing methods including constrastive loss, clustering constraints, predictor and batch normalizations, only a centering and sharpening of the momentum teacher outputs (via using a low value of $\tau_t$ ) is enough to avoid model collapse—a degenerate solution where all inputs map to the same feature.
- Centering avoids the collapse induced by a dominant dimension, but encourages a uniform distribution.
- Sharpening prevents the model from outputting uniform distributions across all dimensions, while allowing some dimensions to dominate.
[CLS] token
An extra learnable token is prepended to the sequence, which acts to aggregate global information from the entire sequence. Although [CLS] token is primarily designed for supervised classification tasks, considering by adding a aggregative token can obtain similar benefits, we continue to use this name in DINO. A projection head is applied to the [CLS] token output to produce a global representation.
Attention Maps for Different Heads Different attention heads in the pretrained ViT capture different semantic concepts. For example, some heads focus on object parts, while others attend to whole objects or background regions. This diversity in attention patterns allows the model to learn rich and varied representations that are useful for downstream tasks like segmentation.

DINO Attention Maps

Writing Tricks#

Major Issue:
- Information Loss: Supervised pretraining usually reduces the rich visual information in an image to a single concept (label).
- Unlabeled Scales Up: Supervised pretraining requires large labeled datasets, which does not scale well with the increasing number of images.
Motivation: Self-supervised works well on NLP tasks, can be transferred to vision tasks?
Key Design: Like knowledge distillation.

iBOT: Image BERT Pre-Training with Online Tokenizer#

Paper: https://arxiv.org/abs/2111.07832
One-liner: extends DINO-style self-distillation from mostly global/CLS supervision to token-level targets, and combines it with masked image modeling.

How to think about it

DINO is closer to “make whole-image representations consistent across views”, while iBOT also “makes patch/token predictions consistent”.
In practice, iBOT often yields better fine-grained features (more friendly to detection/segmentation).

DINOv2: Learning Robust Visual Features Without Any Labels#

Paper: https://arxiv.org/abs/2304.07193
One-liner: scales the DINO/iBOT-style self-distillation recipe and refines training details to obtain very strong frozen features.

Main upgrades (reproduction-oriented)

Larger/cleaner data and improved training recipe (augmentations, regularization, optimization hyperparameters).
Uses token-level training signals (aligned with iBOT) plus more stable training details.
The resulting ViT features transfer well to classification/retrieval/segmentation/depth, acting as a “general-purpose vision backbone”.