Knowledge Distillation: A Visual Guide
Knowledge distillation transfers the rich representations learned by a large teacher network into a compact student model, enabling deployment-friendly models without starting from scratch. This guide walks through the three major distillation paradigms with interactive diagrams.
Why? I wanted to summarise some of my work on knowledge distillation and place it in the wider field, especially how those ideas connect to the broader distillation literature.
1. Overview and the Teacher Student Framework
The core idea, introduced by Hinton et al. [1], is elegantly simple: rather than training a small model purely on hard one-hot labels, we also train it to mimic the soft outputs of a large, pre-trained teacher. Soft outputs carry far more information than a binary correct/incorrect signal. A teacher that assigns 0.6 probability to "cat" and 0.35 to "lynx" is revealing something about visual similarity that a ground-truth label never could.
Modern surveys [2] organise distillation methods into three families based on what knowledge is transferred:
- Response-based methods match the teacher's final output layer.
- Feature-based methods match internal representations at one or more hidden layers.
- Relation-based methods match the relationships between samples or layers.
All three can be combined, and each adds a different inductive bias to the student. The diagram below gives a high-level picture; click any block to jump to its section.
2. Interactive Overview Diagram
Figure 1. The three families of knowledge distillation. Click any coloured block to jump to that section.
3. Response-Based Distillation
Response-based distillation is the original formulation [1]. The teacher's output logits are used as soft targets via a temperature \(T\) that controls how much the distribution is smoothed:
$$p_i(T) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$At \(T=1\) this is the standard softmax. Larger \(T\) produces a softer distribution, exposing the relative ordering of all classes. The training loss combines standard cross-entropy with a KL divergence distillation term:
$$\mathcal{L} = (1 - \alpha)\,\mathcal{L}_{\mathrm{CE}}(y,\, p_S) \;+\; \alpha\, T^2 \cdot \mathrm{KL}\!\left(p_T(T) \,\|\, p_S(T)\right)$$The \(T^2\) factor compensates for the smaller gradient magnitudes at high temperatures.
Variants: Common response-distillation variants include:
- Born-Again Networks (BAN) [3] iteratively re-train a model of the same architecture using the previous generation as teacher; successive generations consistently improve.
- Self-distillation uses shallower sub-networks within the same model as the teacher, requiring no separate training run.
- Label smoothing can be viewed as response distillation with a uniform teacher.
4. Feature-Based Distillation
Rather than only matching the final output, feature-based (or hint-based) distillation aligns intermediate hidden representations. Intermediate features capture hierarchical structure such as edges, textures, and object parts, which is compressed away by the time it reaches the logit layer.
The challenge is that teacher and student usually have different widths. You cannot directly compute a loss between a \(\mathbb{R}^{C_T \times H \times W}\) teacher tensor and a \(\mathbb{R}^{C_S \times H \times W}\) student tensor when \(C_T \neq C_S\). This is solved by a lightweight projector.
The Projector / Adapter: A projector is a small module (often a single \(1{\times}1\) convolution or a two-layer MLP) that maps the student's feature tensor into the teacher's dimensionality so a direct comparison can be made:
$$\mathcal{L}_{\mathrm{feat}} = \left\| f_T - \varphi(f_S) \right\|_2^2$$The projector \(\varphi\) is trained jointly with the student and discarded at inference time, adding zero cost to the deployed model. Some methods (e.g. FitNets [4]) place the projector on the student side, mapping student features into the teacher's space. Others use symmetric projectors or place them on the teacher side. The choice subtly affects what the student is forced to learn.
Miles and Mikolajczyk [15] provide a theoretical analysis showing that the projector implicitly encodes information about past training examples, enabling relational gradients for the student. They also show that representation normalisation is tightly coupled with projector training dynamics, and propose a soft maximum function to handle capacity gaps between teacher and student. Their analysis also suggests that a projector can still be beneficial even when feature dimensions already match, because it changes the optimisation geometry rather than merely fixing a width mismatch.
Why the projector helps: Following Miles and Mikolajczyk [15], it is useful to write the feature loss in matrix form. Let \(Z_s \in \mathbb{R}^{B \times d_s}\) and \(Z_t \in \mathbb{R}^{B \times d_t}\) be the student and teacher representations over a batch, and let \(W_p \in \mathbb{R}^{d_s \times d_t}\) be a bias-free linear projector. The basic squared loss is
$$D(Z_s, Z_t; W_p) = \frac{1}{2}\left\| Z_s W_p - Z_t \right\|_F^2.$$Taking the gradient with respect to the projector gives a particularly revealing update rule:
$$\dot W_p = -\frac{\partial D}{\partial W_p} = -Z_s^\top Z_s W_p + Z_s^\top Z_t = C_{st} - C_s W_p,$$where \(C_s = Z_s^\top Z_s\) is the student self-correlation matrix and \(C_{st} = Z_s^\top Z_t\) is the student-teacher cross-correlation matrix. This is the key point from [15]: the projector is not just a shape-matching layer. Its weights are driven by both the internal geometry of the student features and their cross-relationship to the teacher. In other words, the projector implicitly stores relational information that other KD methods often build explicitly through Gram matrices, kernels, or memory banks.
This also explains why a projector can help even when \(d_s = d_t\). If the student and teacher spaces were already perfectly aligned, the identity map would be enough. In practice they are not. The projector learns a better local geometry for the loss, so the student receives gradients in a coordinate system that is easier to match to the teacher. In the paper's interpretation, the projector is functioning as a compact, learnable encoder of cross-sample relational structure, not merely as a dimensionality bridge.
Cross-architecture distillation: Cross-architecture KD is usually harder than same-family KD because the student and teacher do not just differ in width or depth, they often encode different inductive biases. A CNN teacher and a transformer student, for example, organise information very differently. This is exactly the setting where an over-expressive projector can become a problem: it may learn the architecture bridge itself, while the student backbone learns much less.
VkD [19] makes this idea explicit. Their goal is to preserve the structural information in the student features while still aligning them to the teacher. They define an intra-batch kernel \(K_{ij} = k(Z^i_s, Z^j_s)\), and for kernels that can be expanded as \(k(Z^i_s, Z^j_s) = \sum_{n=0}^{\infty} a_n \langle Z^i_s, Z^j_s \rangle^n\), preserving the kernel reduces to preserving inner products. For a linear projection \(P\), that means enforcing \(\langle Z^i_s, Z^j_s \rangle = \langle Z^i_s P, Z^j_s P \rangle\), which in turn gives the orthogonality-style constraint \(P P^\top = I\) for the row-orthogonal case.
The practical consequence is important. If the singular values of \(P\) are all one, the projector is not free to squash, stretch, or collapse some directions just to make the loss smaller. More of the burden of matching the teacher is pushed back into the student backbone itself, which is why the paper describes the orthogonal projector as maximising the amount of knowledge distilled into the backbone. When \(d_s \neq d_t\), they implement this through a row-orthogonal projection on the Stiefel manifold rather than an unconstrained linear layer.
This is especially useful in cross-architecture transfer. VkD [19] reports strong results in harder CNN \(\leftrightarrow\) transformer settings, and argues that the gain comes partly from a softer transfer of inductive bias. In their analysis, a CNN teacher can pass translational equivariance to a transformer student more effectively when the projection preserves structure instead of inventing a new feature space of its own. They also use a simple pooled patch-token interface to connect transformer and CNN features, which keeps the bridge lightweight.
The same paper also notes that when the projector is updated with learning rate \(\alpha_p\) and weight decay \(\eta\),
$$W_p \leftarrow (1-\eta)W_p + \alpha_p \dot W_p,$$so in the special case \(\eta = \alpha_p\), the projector behaves like a moving average of relational features. That makes it conceptually close to momentum-encoder style mechanisms used in contrastive methods, but without explicitly storing large external memory structures.
Why normalisation matters: Normalisation is often introduced as a simple way to keep gradients well-scaled, but [15] argues that it has a deeper role: it changes the fixed point that the projector converges toward. Setting the stationary condition \(\dot W_p = 0\) gives
$$C_{st} - C_s W_p = 0.$$If the student features are whitened, or close enough to decorrelated that \(C_s \approx I\), then the fixed point becomes especially simple:
$$C_s = I \quad \Longrightarrow \quad W_p = C_{st}.$$In that regime, the projector directly captures the cross-relationship between student and teacher features. This is why representation normalisation and projector training are tightly coupled: good normalisation makes it easier for the projector to preserve informative directions instead of collapsing them.
The empirical story in [15] matches this derivation. When they track the singular values of the projector during training, the better-performing normalisation schemes shrink fewer singular values toward zero. Shrinking a singular value to zero means collapsing some direction of the student representation, which throws away information before the distillation loss can exploit it. In their experiments, batch normalisation was the most consistently effective choice, while weaker or no normalisation led to more projector collapse and worse accuracy.
A useful extension from VkD [19] is that the right normalisation depends on the task. For discriminative problems, simple standardisation often improves optimisation by making the distillation loss less sensitive to nuisance variation. For generative problems, whitening can be a better inductive bias. The reason is not just numerical conditioning: if the teacher features satisfy \(Z_t^\top Z_t = I\), then the L2 feature loss can be rewritten to yield a lower bound of the form \(\mathcal{L}_{\mathrm{distill}} \geq \mathrm{const} - \lambda \sum_{i \neq j} C_{j,i}^2\), where \(C\) measures cross-feature distances between student and teacher coordinates. Minimising the loss therefore also pushes up the off-diagonal cross-feature terms, which implicitly encourages decorrelated and therefore more diverse features.
This is why whitening makes particular sense for generative KD. Instead of adding a separate diversity loss that may fight the distillation objective, whitening bakes that pressure into the representation geometry itself. In the VkD experiments, this was especially valuable in data-limited image generation, where whitening helped avoid mode collapse and improved both realism and diversity.
1×1 convpreserves spatial resolution with minimal compute and no non-linearity.BN + ReLU + 1×1 convadds capacity and handles distribution shift.MLP (flattened)is used when distilling global features such as ViT [CLS] tokens.- Even when teacher and student have the same width, a projector can still help in practice by improving optimisation and preserving useful relational structure [15].
Figure 2. Simplified feature distillation pipeline inspired by the projector recipe [15]. The student feeds a projector for feature distillation and a student head for the task loss, while the teacher remains frozen.
FitNets and Attention Transfer: FitNets [4] was among the first to show that aligning mid-network features outperforms pure response distillation, especially when the student is much thinner than the teacher. They introduce a two-stage training procedure: first pre-train the student to mimic a chosen hint layer via a projector, then fine-tune end-to-end with both task and distillation losses.
Attention Transfer [5] aligns spatial attention maps, specifically the sum of squared activations across channels, normalised to unit norm:
$$A(F) = \sum_c |F_c|^2 \;\in\; \mathbb{R}^{H \times W}$$ $$\mathcal{L}_{\mathrm{AT}} = \sum_l \left\| \frac{A(F_T^l)}{\|A(F_T^l)\|} - \frac{A(F_S^l)}{\|A(F_S^l)\|} \right\|_2^2$$This is cheaper than full feature alignment and often more transferable across architectures because attention maps are more architecture-agnostic than raw feature tensors.
5. Relation-Based Distillation
Relation-based methods shift focus from individual feature vectors to the structure they form. Even if the student cannot replicate the teacher's exact feature values, it can learn to reproduce the pairwise or higher-order relationships among samples in the embedding space.
Figure 3. Relation-based distillation matches the geometry of the teacher and student embedding spaces. The loss acts on pairwise relationships rather than on individual coordinates.
Prominent Methods: Some of the most widely used relation-based methods are:
- RKD [6] matches pairwise distance and angle relationships between example embeddings, transferring relational structure even across very different architectures.
- CRD [7] frames distillation as a contrastive learning problem: teacher and student embeddings of the same sample should be similar, while different samples should be far apart.
- NST [8] matches the distribution of neuron activations using maximum mean discrepancy (MMD) rather than direct alignment.
6. Loss Functions
Most distillation recipes combine multiple objectives. The total training loss is typically a weighted sum:
$$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{task}} \;+\; \lambda_{\mathrm{resp}}\,\mathcal{L}_{\mathrm{response}} \;+\; \lambda_{\mathrm{feat}}\,\mathcal{L}_{\mathrm{feature}} \;+\; \lambda_{\mathrm{rel}}\,\mathcal{L}_{\mathrm{relation}}$$| Loss | Formula | Used for |
|---|---|---|
| KL divergence | \(\mathrm{KL}(p_T \| p_S)\) | Response-based; soft label matching |
| L2 / MSE | \(\|f_T - \varphi(f_S)\|_2^2\) | Feature-based; direct channel alignment |
| Cosine loss | \(1 - \cos(f_T,\, \varphi(f_S))\) | Feature-based; direction-only alignment |
| LogSum / soft maximum | \(\log \sum_i |Z_s W_p - Z_t|_i^\alpha\) | Feature-based; more forgiving under large capacity gaps |
| Contrastive (NCE) | \(-\log \sigma(f_T \cdot f_S / \tau)\) | Relation-based (CRD) |
| Gram / MMD | \(\|G_T - G_S\|_F\) | Relation-based; distribution matching |
L2 / MSE: This is the cleanest feature-matching objective and is the starting point for the analysis in [15]. It penalises absolute coordinate-wise mismatch between \(Z_s W_p\) and \(Z_t\). Because it works directly in Euclidean coordinates, it is very sensitive to representation scale and covariance. That is exactly why the projector and the normalisation scheme matter so much: with poor scaling, some channels dominate the loss while others get effectively ignored.
Cosine-style alignment: Cosine losses focus on direction rather than magnitude. This can be helpful when teacher and student norms are inconsistent or when only angular similarity matters. The trade-off is that cosine losses deliberately suppress norm information, so they can be too weak when magnitude carries useful confidence or saliency cues. In practice, they often stabilise training but are less directly tied to the projector-dynamics analysis above than the squared loss.
Why normalisation changes the loss: A feature loss is never acting on "raw" features in isolation. It acts on whatever geometry the combination of \(W_p\) and the chosen normalisation creates. With no normalisation, the loss tends to chase high-variance directions. With stronger decorrelation, the projector can encode cross-feature relationships more faithfully. This is why [15] treats the projector and normalisation as a coupled system rather than two independent design choices. VkD [19] pushes this further by arguing that standardisation is often the better default for discriminative tasks, whereas whitening is especially appropriate for generative tasks because it softly encourages cross-feature diversity.
LogSum / soft maximum: The projector paper argues that plain L2 can become too rigid when the teacher-student capacity gap is large. If the student cannot realistically match every feature coordinate, forcing all errors to be treated equally can hurt the downstream task. Their proposed alternative is a soft maximum:
$$D(Z_s, Z_t; W_p) = \log \sum_i |Z_s W_p - Z_t|_i^\alpha,$$where \(\alpha\) controls how sharply the loss focuses on larger mismatches. Intuitively, this behaves like a smoother "hard-example" feature loss: it downweights coordinates that are already close and puts more pressure on poorly aligned ones. In [15] this consistently helped in larger capacity-gap settings, with \(\alpha\) around 4-5 working well across several architecture pairs.
Relation-based losses: Contrastive, Gram-matrix, and MMD-style objectives try to preserve geometry at the batch level rather than match features one-by-one. The projector paper's main argument is that some of this relational information can already be absorbed implicitly by a learned projector, which helps explain why a very simple linear projector plus good normalisation can rival much more elaborate loss constructions.
7. Method Comparison
| Paradigm | What is transferred | Architecture constraint | Key hyperparameter |
|---|---|---|---|
| Response-based | Soft output logits | None; only requires same number of classes | Temperature \(T\) |
| Feature-based | Hidden layer tensors | Requires alignment via a projector | Layer selection, projector design |
| Relation-based | Pairwise / batch structure | None; works across any architectures | Batch size, contrastive temperature \(\tau\) |
8. Language v.s. Vision
Although the high-level taxonomy is shared, the practical form of distillation differs a lot between vision and language. Vision usually distils logits or spatial features. Language models often distil distributions over large vocabularies, generated sequences, or explicit reasoning traces.
Chain-of-thought distillation: In language reasoning, the teacher signal increasingly includes intermediate rationales rather than only final answers. Sequence-level KD [12] already moved in this direction by training on teacher-generated outputs, and later work such as Distilling Step-by-Step [16] and SCOTT [17] showed that smaller students can benefit directly from teacher reasoning traces. This kind of free-form rationale transfer is much less common in vision, where the supervision is usually tied to logits or hidden activations.
Feature distillation: Feature distillation is less common in language than in vision. In vision, intermediate maps have a natural spatial structure and clear layer-wise correspondence, so projector-based matching is often straightforward. In language models, hidden states are sequence-valued, lengths vary across examples, and token positions or attention heads do not line up as cleanly. Methods such as TinyBERT [11] and MiniLM [10] do distil internal representations, but they typically transfer attention distributions or selected hidden-state statistics rather than raw feature maps in the vision sense.
Tokenizer mismatch: When teacher and student use different tokenisers, token-level response distillation is not directly compatible. This is one of the most important practical issues in LLM distillation. Common workarounds include learning a vocabulary projection, aligning at the sentence-embedding level, or distilling from decoded reasoning traces instead of matching token probabilities one-to-one.
Speculative decoding: Distillation is not the only way to use a stronger model. Speculative decoding [18] keeps a larger verifier model in the loop at test time while a smaller draft model proposes tokens. This is a different trade-off from classic distillation: instead of compressing the teacher fully into the student during training, it leverages the stronger model directly at inference time to accelerate decoding while preserving the target model's output distribution.
Distillation Tokens (DeiT):
DeiT (Data-efficient Image Transformers) [20] introduced a dedicated
distillation token to embed teacher supervision directly into the
transformer's attention mechanism. Rather than adding an external loss on top of the
classification objective, DeiT appends a learnable [DIST] token to the standard
sequence of patch embeddings and the class token [CLS].
The [DIST] token is processed through all transformer layers alongside every patch
and the class token, interacting with them via self-attention at every depth.
At the output, its representation feeds a separate distillation head trained to match the
teacher's prediction, while [CLS] feeds the standard classification head:
The teacher is typically a strong CNN (e.g. RegNet), providing either hard labels
(argmax) or soft probability distributions. Because [DIST] can attend differently
to all patches, it develops representations that reflect the teacher's inductive biases
(e.g. CNN-style local spatial features) rather than simply mimicking [CLS].
At inference the two heads' predictions are averaged for optimal accuracy. DeiT is a
notable response–representation hybrid: the loss is response-based
(matching teacher outputs) but the mechanism operates at the internal representation level
by routing teacher signals through the full depth of the network's attention.
Figure 4. DeiT's distillation token. A dedicated [DIST] token attends to all patch tokens and [CLS] at every layer. Its output drives a distillation head aligned to the frozen CNN teacher's prediction; the [CLS] head handles the standard task loss.
Reasoning Distillation (DeepSeek-R1): DeepSeek-R1 [21] demonstrates one of the most striking recent applications of distillation: the structured reasoning behaviour of a very large model can be transferred to models orders of magnitude smaller via a remarkably simple recipe.
The R1 training pipeline uses group relative policy optimisation (GRPO) to train a 671B-parameter mixture-of-experts model to produce long chain-of-thought reasoning traces: sequences that interleave exploration, self-correction, and answer consolidation. These traces are then used as supervision: a smaller student (7B–70B parameters) is fine-tuned on teacher-generated traces with standard cross-entropy, requiring no reinforcement learning of its own.
This is qualitatively different from response distillation (matching logits) or feature distillation (matching hidden states). What is transferred is not a probability distribution but the teacher's problem-solving process. Distilled models significantly outperform same-size counterparts trained without reasoning traces, suggesting the long structured sequences carry an implicit curriculum: early steps provide simpler sub-problem supervision, later steps require integrating prior reasoning.
DeepSeek-R1 also highlights a practical shift in where distillation bottlenecks arise for LLMs. Unlike vision distillation, the challenge is not architecture mismatch or projector design; it is generating high-quality teacher traces at scale and ensuring student context windows are large enough to absorb them.
9. Future Directions
Knowledge distillation is an active area and several threads are shaping its frontier.
Process and reward distillation. Systems like DeepSeek-R1 [21] signal a shift from output-level distillation toward process-level supervision: matching the teacher's step-by-step reasoning rather than its final answer. A related thread distils verifier or reward models from RLHF-trained teachers, useful when the reward signal is expensive to compute at inference but cheap to generate offline.
Test-time compute distillation. Large models can be given additional test-time compute, generating many candidate solutions and selecting the best, giving a stronger signal than a single forward pass. Distilling this capability into a student that approximates expensive search in a single step collapses the gap between slow-thinking and fast-inference models, and is closely related to best-of-\(N\) sampling, diffusion stitching [13], and speculative decoding [18].
Diffusion model distillation. Score-based and diffusion generative models require many denoising steps at inference. Progressive distillation and consistency models compress a multi-step teacher into a student that approximates the same mapping in far fewer steps, often just one or two. This is a direct analogue of response distillation applied to generative processes: instead of matching class probabilities, the student matches the teacher's denoised output at each step.
Multimodal distillation. Cross-modal knowledge transfer (e.g. distilling a vision-language model into a vision-only or text-only student) is an emerging challenge. The alignment problem is especially pronounced because modalities have fundamentally different structure and there is no obvious token-level correspondence between image patches and text tokens. Methods that distil shared embedding spaces or use captioning as an intermediate supervision signal are early steps in this direction.
Architecture-aware distillation. As models become more heterogeneous, mixing attention layers, state-space models, and mixture-of-experts blocks, the question of how to transfer knowledge across radically different inductive biases is increasingly important. VkD [19] and the orthogonal projector are one step in this direction, but robust cross-paradigm distillation (e.g. from a transformer to an SSM) and distillation for sparse expert models remain largely open research problems.
10. References
- [1]G. Hinton, O. Vinyals, J. Dean. Distilling the Knowledge in a Neural Network. NIPS Workshop, 2015. arXiv:1503.02531
- [2]J. Gou et al. Knowledge Distillation: A Survey. IJCV, 2021. arXiv:2006.05525
- [3]T. Furlanello et al. Born Again Neural Networks. ICML, 2018. arXiv:1805.04770
- [4]A. Romero et al. FitNets: Hints for Thin Deep Nets. ICLR, 2015. arXiv:1412.6550
- [5]S. Zagoruyko, N. Komodakis. Paying More Attention to Attention. ICLR, 2017. arXiv:1612.03928
- [6]W. Park et al. Relational Knowledge Distillation. CVPR, 2019. arXiv:1904.05068
- [7]Y. Tian, D. Krishnan, P. Isola. Contrastive Representation Distillation. ICLR, 2020. arXiv:1910.10699
- [8]Z. Huang, N. Wang. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. 2017. arXiv:1707.01219
- [9]V. Sanh et al. DistilBERT, a distilled version of BERT. 2019. arXiv:1910.01108
- [10]W. Wang et al. MiniLM: Deep Self-Attention Distillation. NeurIPS, 2020. arXiv:2002.10957
- [11]X. Jiao et al. TinyBERT: Distilling BERT for Natural Language Understanding. EMNLP, 2020. arXiv:1909.10351
- [12]Y. Kim, A. Rush. Sequence-Level Knowledge Distillation. EMNLP, 2016. arXiv:1606.07947
- [13]R. Miles et al. Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching. 2026. arXiv:2602.22871
- [14]B. Zhao, K. Cui. Decoupled Knowledge Distillation. CVPR, 2022. arXiv:2203.08679
- [15]R. Miles, K. Mikolajczyk. Understanding the Role of the Projector in Knowledge Distillation. AAAI, 2024. arXiv:2303.11098
- [16]C.-Y. Hsieh et al. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. Findings of ACL, 2023. arXiv:2305.02301
- [17]P. Wang et al. SCOTT: Self-Consistent Chain-of-Thought Distillation. ACL, 2023. ACL Anthology
- [18]Y. Leviathan, M. Kalman, Y. Matias. Fast Inference from Transformers via Speculative Decoding. ICML, 2023. arXiv:2211.17192
- [19]R. Miles, I. Elezi, J. Deng. VkD: Improving Knowledge Distillation using Orthogonal Projections. CVPR, 2024. arXiv:2403.06213
- [20]H. Touvron et al. Training data-efficient image transformers & distillation through attention. ICML, 2021. arXiv:2012.12877
- [21]DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. 2025. arXiv:2501.12948