Tutorial Knowledge Distillation Model Compression March 2026 · Roy Miles

Knowledge Distillation: A Visual Guide

Knowledge distillation transfers the rich representations learned by a large teacher network into a compact student model, enabling deployment-friendly models without starting from scratch. This guide walks through the three major distillation paradigms with interactive diagrams.

Why? I wanted to summarise some of my work on knowledge distillation and place it in the wider field, especially how those ideas connect to the broader distillation literature.

1. Overview and the Teacher Student Framework

The core idea, introduced by Hinton et al. [1], is elegantly simple: rather than training a small model purely on hard one-hot labels, we also train it to mimic the soft outputs of a large, pre-trained teacher. Soft outputs carry far more information than a binary correct/incorrect signal. A teacher that assigns 0.6 probability to "cat" and 0.35 to "lynx" is revealing something about visual similarity that a ground-truth label never could.

Key insight. Knowledge is not just what the teacher predicts, but how confident it is across all classes, which intermediate features it builds, and how those features relate to each other.

Modern surveys [2] organise distillation methods into three families based on what knowledge is transferred:

All three can be combined, and each adds a different inductive bias to the student. The diagram below gives a high-level picture; click any block to jump to its section.

2. Interactive Overview Diagram

Teacher Large / Pre-trained Student Compact / Efficient Response-Based Soft logits · KL divergence Feature-Based Hidden layers · Projector · L2 / cosine Relation-Based Sample graphs · Contrastive Ground-Truth Labels Task loss (CE)
Response-based
Feature-based
Relation-based

Figure 1. The three families of knowledge distillation. Click any coloured block to jump to that section.

3. Response-Based Distillation

Response-based distillation is the original formulation [1]. The teacher's output logits are used as soft targets via a temperature \(T\) that controls how much the distribution is smoothed:

$$p_i(T) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

At \(T=1\) this is the standard softmax. Larger \(T\) produces a softer distribution, exposing the relative ordering of all classes. The training loss combines standard cross-entropy with a KL divergence distillation term:

$$\mathcal{L} = (1 - \alpha)\,\mathcal{L}_{\mathrm{CE}}(y,\, p_S) \;+\; \alpha\, T^2 \cdot \mathrm{KL}\!\left(p_T(T) \,\|\, p_S(T)\right)$$

The \(T^2\) factor compensates for the smaller gradient magnitudes at high temperatures.

Why not just use one-hot labels? Soft labels encode dark knowledge. A teacher assigning non-trivial probability to visually similar classes is communicating semantic structure for free, acting as implicit data augmentation.

Variants: Common response-distillation variants include:

4. Feature-Based Distillation

Rather than only matching the final output, feature-based (or hint-based) distillation aligns intermediate hidden representations. Intermediate features capture hierarchical structure such as edges, textures, and object parts, which is compressed away by the time it reaches the logit layer.

The challenge is that teacher and student usually have different widths. You cannot directly compute a loss between a \(\mathbb{R}^{C_T \times H \times W}\) teacher tensor and a \(\mathbb{R}^{C_S \times H \times W}\) student tensor when \(C_T \neq C_S\). This is solved by a lightweight projector.

The Projector / Adapter: A projector is a small module (often a single \(1{\times}1\) convolution or a two-layer MLP) that maps the student's feature tensor into the teacher's dimensionality so a direct comparison can be made:

$$\mathcal{L}_{\mathrm{feat}} = \left\| f_T - \varphi(f_S) \right\|_2^2$$

The projector \(\varphi\) is trained jointly with the student and discarded at inference time, adding zero cost to the deployed model. Some methods (e.g. FitNets [4]) place the projector on the student side, mapping student features into the teacher's space. Others use symmetric projectors or place them on the teacher side. The choice subtly affects what the student is forced to learn.

Miles and Mikolajczyk [15] provide a theoretical analysis showing that the projector implicitly encodes information about past training examples, enabling relational gradients for the student. They also show that representation normalisation is tightly coupled with projector training dynamics, and propose a soft maximum function to handle capacity gaps between teacher and student. Their analysis also suggests that a projector can still be beneficial even when feature dimensions already match, because it changes the optimisation geometry rather than merely fixing a width mismatch.

Why the projector helps: Following Miles and Mikolajczyk [15], it is useful to write the feature loss in matrix form. Let \(Z_s \in \mathbb{R}^{B \times d_s}\) and \(Z_t \in \mathbb{R}^{B \times d_t}\) be the student and teacher representations over a batch, and let \(W_p \in \mathbb{R}^{d_s \times d_t}\) be a bias-free linear projector. The basic squared loss is

$$D(Z_s, Z_t; W_p) = \frac{1}{2}\left\| Z_s W_p - Z_t \right\|_F^2.$$

Taking the gradient with respect to the projector gives a particularly revealing update rule:

$$\dot W_p = -\frac{\partial D}{\partial W_p} = -Z_s^\top Z_s W_p + Z_s^\top Z_t = C_{st} - C_s W_p,$$

where \(C_s = Z_s^\top Z_s\) is the student self-correlation matrix and \(C_{st} = Z_s^\top Z_t\) is the student-teacher cross-correlation matrix. This is the key point from [15]: the projector is not just a shape-matching layer. Its weights are driven by both the internal geometry of the student features and their cross-relationship to the teacher. In other words, the projector implicitly stores relational information that other KD methods often build explicitly through Gram matrices, kernels, or memory banks.

This also explains why a projector can help even when \(d_s = d_t\). If the student and teacher spaces were already perfectly aligned, the identity map would be enough. In practice they are not. The projector learns a better local geometry for the loss, so the student receives gradients in a coordinate system that is easier to match to the teacher. In the paper's interpretation, the projector is functioning as a compact, learnable encoder of cross-sample relational structure, not merely as a dimensionality bridge.

Cross-architecture distillation: Cross-architecture KD is usually harder than same-family KD because the student and teacher do not just differ in width or depth, they often encode different inductive biases. A CNN teacher and a transformer student, for example, organise information very differently. This is exactly the setting where an over-expressive projector can become a problem: it may learn the architecture bridge itself, while the student backbone learns much less.

VkD [19] makes this idea explicit. Their goal is to preserve the structural information in the student features while still aligning them to the teacher. They define an intra-batch kernel \(K_{ij} = k(Z^i_s, Z^j_s)\), and for kernels that can be expanded as \(k(Z^i_s, Z^j_s) = \sum_{n=0}^{\infty} a_n \langle Z^i_s, Z^j_s \rangle^n\), preserving the kernel reduces to preserving inner products. For a linear projection \(P\), that means enforcing \(\langle Z^i_s, Z^j_s \rangle = \langle Z^i_s P, Z^j_s P \rangle\), which in turn gives the orthogonality-style constraint \(P P^\top = I\) for the row-orthogonal case.

The practical consequence is important. If the singular values of \(P\) are all one, the projector is not free to squash, stretch, or collapse some directions just to make the loss smaller. More of the burden of matching the teacher is pushed back into the student backbone itself, which is why the paper describes the orthogonal projector as maximising the amount of knowledge distilled into the backbone. When \(d_s \neq d_t\), they implement this through a row-orthogonal projection on the Stiefel manifold rather than an unconstrained linear layer.

This is especially useful in cross-architecture transfer. VkD [19] reports strong results in harder CNN \(\leftrightarrow\) transformer settings, and argues that the gain comes partly from a softer transfer of inductive bias. In their analysis, a CNN teacher can pass translational equivariance to a transformer student more effectively when the projection preserves structure instead of inventing a new feature space of its own. They also use a simple pooled patch-token interface to connect transformer and CNN features, which keeps the bridge lightweight.

The same paper also notes that when the projector is updated with learning rate \(\alpha_p\) and weight decay \(\eta\),

$$W_p \leftarrow (1-\eta)W_p + \alpha_p \dot W_p,$$

so in the special case \(\eta = \alpha_p\), the projector behaves like a moving average of relational features. That makes it conceptually close to momentum-encoder style mechanisms used in contrastive methods, but without explicitly storing large external memory structures.

Why normalisation matters: Normalisation is often introduced as a simple way to keep gradients well-scaled, but [15] argues that it has a deeper role: it changes the fixed point that the projector converges toward. Setting the stationary condition \(\dot W_p = 0\) gives

$$C_{st} - C_s W_p = 0.$$

If the student features are whitened, or close enough to decorrelated that \(C_s \approx I\), then the fixed point becomes especially simple:

$$C_s = I \quad \Longrightarrow \quad W_p = C_{st}.$$

In that regime, the projector directly captures the cross-relationship between student and teacher features. This is why representation normalisation and projector training are tightly coupled: good normalisation makes it easier for the projector to preserve informative directions instead of collapsing them.

The empirical story in [15] matches this derivation. When they track the singular values of the projector during training, the better-performing normalisation schemes shrink fewer singular values toward zero. Shrinking a singular value to zero means collapsing some direction of the student representation, which throws away information before the distillation loss can exploit it. In their experiments, batch normalisation was the most consistently effective choice, while weaker or no normalisation led to more projector collapse and worse accuracy.

A useful extension from VkD [19] is that the right normalisation depends on the task. For discriminative problems, simple standardisation often improves optimisation by making the distillation loss less sensitive to nuisance variation. For generative problems, whitening can be a better inductive bias. The reason is not just numerical conditioning: if the teacher features satisfy \(Z_t^\top Z_t = I\), then the L2 feature loss can be rewritten to yield a lower bound of the form \(\mathcal{L}_{\mathrm{distill}} \geq \mathrm{const} - \lambda \sum_{i \neq j} C_{j,i}^2\), where \(C\) measures cross-feature distances between student and teacher coordinates. Minimising the loss therefore also pushes up the off-diagonal cross-feature terms, which implicitly encourages decorrelated and therefore more diverse features.

This is why whitening makes particular sense for generative KD. Instead of adding a separate diversity loss that may fight the distillation objective, whitening bakes that pressure into the representation geometry itself. In the VkD experiments, this was especially valuable in data-limited image generation, where whitening helped avoid mode collapse and improved both realism and diversity.

Projector design choices:
Student Compact / Efficient Projector feature adapter Student head used for task loss Teacher ❄️ Large / Pre-trained feature loss task loss
Student
Projector
Student head
Teacher ❄️

Figure 2. Simplified feature distillation pipeline inspired by the projector recipe [15]. The student feeds a projector for feature distillation and a student head for the task loss, while the teacher remains frozen.

FitNets and Attention Transfer: FitNets [4] was among the first to show that aligning mid-network features outperforms pure response distillation, especially when the student is much thinner than the teacher. They introduce a two-stage training procedure: first pre-train the student to mimic a chosen hint layer via a projector, then fine-tune end-to-end with both task and distillation losses.

Attention Transfer [5] aligns spatial attention maps, specifically the sum of squared activations across channels, normalised to unit norm:

$$A(F) = \sum_c |F_c|^2 \;\in\; \mathbb{R}^{H \times W}$$ $$\mathcal{L}_{\mathrm{AT}} = \sum_l \left\| \frac{A(F_T^l)}{\|A(F_T^l)\|} - \frac{A(F_S^l)}{\|A(F_S^l)\|} \right\|_2^2$$

This is cheaper than full feature alignment and often more transferable across architectures because attention maps are more architecture-agnostic than raw feature tensors.

5. Relation-Based Distillation

Relation-based methods shift focus from individual feature vectors to the structure they form. Even if the student cannot replicate the teacher's exact feature values, it can learn to reproduce the pairwise or higher-order relationships among samples in the embedding space.

Teacher Embeddings match pairwise geometry RKD / CRD / NST distances, angles, and batch structure Student Embeddings

Figure 3. Relation-based distillation matches the geometry of the teacher and student embedding spaces. The loss acts on pairwise relationships rather than on individual coordinates.

Prominent Methods: Some of the most widely used relation-based methods are:

6. Loss Functions

Most distillation recipes combine multiple objectives. The total training loss is typically a weighted sum:

$$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{task}} \;+\; \lambda_{\mathrm{resp}}\,\mathcal{L}_{\mathrm{response}} \;+\; \lambda_{\mathrm{feat}}\,\mathcal{L}_{\mathrm{feature}} \;+\; \lambda_{\mathrm{rel}}\,\mathcal{L}_{\mathrm{relation}}$$
LossFormulaUsed for
KL divergence \(\mathrm{KL}(p_T \| p_S)\) Response-based; soft label matching
L2 / MSE \(\|f_T - \varphi(f_S)\|_2^2\) Feature-based; direct channel alignment
Cosine loss \(1 - \cos(f_T,\, \varphi(f_S))\) Feature-based; direction-only alignment
LogSum / soft maximum \(\log \sum_i |Z_s W_p - Z_t|_i^\alpha\) Feature-based; more forgiving under large capacity gaps
Contrastive (NCE) \(-\log \sigma(f_T \cdot f_S / \tau)\) Relation-based (CRD)
Gram / MMD \(\|G_T - G_S\|_F\) Relation-based; distribution matching

L2 / MSE: This is the cleanest feature-matching objective and is the starting point for the analysis in [15]. It penalises absolute coordinate-wise mismatch between \(Z_s W_p\) and \(Z_t\). Because it works directly in Euclidean coordinates, it is very sensitive to representation scale and covariance. That is exactly why the projector and the normalisation scheme matter so much: with poor scaling, some channels dominate the loss while others get effectively ignored.

Cosine-style alignment: Cosine losses focus on direction rather than magnitude. This can be helpful when teacher and student norms are inconsistent or when only angular similarity matters. The trade-off is that cosine losses deliberately suppress norm information, so they can be too weak when magnitude carries useful confidence or saliency cues. In practice, they often stabilise training but are less directly tied to the projector-dynamics analysis above than the squared loss.

Why normalisation changes the loss: A feature loss is never acting on "raw" features in isolation. It acts on whatever geometry the combination of \(W_p\) and the chosen normalisation creates. With no normalisation, the loss tends to chase high-variance directions. With stronger decorrelation, the projector can encode cross-feature relationships more faithfully. This is why [15] treats the projector and normalisation as a coupled system rather than two independent design choices. VkD [19] pushes this further by arguing that standardisation is often the better default for discriminative tasks, whereas whitening is especially appropriate for generative tasks because it softly encourages cross-feature diversity.

LogSum / soft maximum: The projector paper argues that plain L2 can become too rigid when the teacher-student capacity gap is large. If the student cannot realistically match every feature coordinate, forcing all errors to be treated equally can hurt the downstream task. Their proposed alternative is a soft maximum:

$$D(Z_s, Z_t; W_p) = \log \sum_i |Z_s W_p - Z_t|_i^\alpha,$$

where \(\alpha\) controls how sharply the loss focuses on larger mismatches. Intuitively, this behaves like a smoother "hard-example" feature loss: it downweights coordinates that are already close and puts more pressure on poorly aligned ones. In [15] this consistently helped in larger capacity-gap settings, with \(\alpha\) around 4-5 working well across several architecture pairs.

Relation-based losses: Contrastive, Gram-matrix, and MMD-style objectives try to preserve geometry at the batch level rather than match features one-by-one. The projector paper's main argument is that some of this relational information can already be absorbed implicitly by a learned projector, which helps explain why a very simple linear projector plus good normalisation can rival much more elaborate loss constructions.

7. Method Comparison

ParadigmWhat is transferredArchitecture constraintKey hyperparameter
Response-based Soft output logits None; only requires same number of classes Temperature \(T\)
Feature-based Hidden layer tensors Requires alignment via a projector Layer selection, projector design
Relation-based Pairwise / batch structure None; works across any architectures Batch size, contrastive temperature \(\tau\)
Takeaway. Feature-based distillation generally offers the largest absolute improvement but requires the most engineering. Combining all three families in a carefully tuned multi-objective loss is the approach used by current state-of-the-art methods such as DKD [14] and the projector recipe of Miles and Mikolajczyk [15]. At the same time, recent projector analyses suggest that very complex relational objectives may not always be necessary: a learned projector can already encode suitable relational structure, so in some settings a simple pairwise feature loss can go surprisingly far without adding an explicit relation loss [15, 19].

8. Language v.s. Vision

Although the high-level taxonomy is shared, the practical form of distillation differs a lot between vision and language. Vision usually distils logits or spatial features. Language models often distil distributions over large vocabularies, generated sequences, or explicit reasoning traces.

Chain-of-thought distillation: In language reasoning, the teacher signal increasingly includes intermediate rationales rather than only final answers. Sequence-level KD [12] already moved in this direction by training on teacher-generated outputs, and later work such as Distilling Step-by-Step [16] and SCOTT [17] showed that smaller students can benefit directly from teacher reasoning traces. This kind of free-form rationale transfer is much less common in vision, where the supervision is usually tied to logits or hidden activations.

Feature distillation: Feature distillation is less common in language than in vision. In vision, intermediate maps have a natural spatial structure and clear layer-wise correspondence, so projector-based matching is often straightforward. In language models, hidden states are sequence-valued, lengths vary across examples, and token positions or attention heads do not line up as cleanly. Methods such as TinyBERT [11] and MiniLM [10] do distil internal representations, but they typically transfer attention distributions or selected hidden-state statistics rather than raw feature maps in the vision sense.

Tokenizer mismatch: When teacher and student use different tokenisers, token-level response distillation is not directly compatible. This is one of the most important practical issues in LLM distillation. Common workarounds include learning a vocabulary projection, aligning at the sentence-embedding level, or distilling from decoded reasoning traces instead of matching token probabilities one-to-one.

Speculative decoding: Distillation is not the only way to use a stronger model. Speculative decoding [18] keeps a larger verifier model in the loop at test time while a smaller draft model proposes tokens. This is a different trade-off from classic distillation: instead of compressing the teacher fully into the student during training, it leverages the stronger model directly at inference time to accelerate decoding while preserving the target model's output distribution.

Distillation Tokens (DeiT): DeiT (Data-efficient Image Transformers) [20] introduced a dedicated distillation token to embed teacher supervision directly into the transformer's attention mechanism. Rather than adding an external loss on top of the classification objective, DeiT appends a learnable [DIST] token to the standard sequence of patch embeddings and the class token [CLS].

The [DIST] token is processed through all transformer layers alongside every patch and the class token, interacting with them via self-attention at every depth. At the output, its representation feeds a separate distillation head trained to match the teacher's prediction, while [CLS] feeds the standard classification head:

$$\mathcal{L} = \tfrac{1}{2}\,\mathcal{L}_{\mathrm{CE}}(y,\,\hat{y}_{\mathrm{cls}}) + \tfrac{1}{2}\,\mathcal{L}_{\mathrm{KL}}(\hat{y}_{\mathrm{teacher}},\,\hat{y}_{\mathrm{dist}})$$

The teacher is typically a strong CNN (e.g. RegNet), providing either hard labels (argmax) or soft probability distributions. Because [DIST] can attend differently to all patches, it develops representations that reflect the teacher's inductive biases (e.g. CNN-style local spatial features) rather than simply mimicking [CLS]. At inference the two heads' predictions are averaged for optimal accuracy. DeiT is a notable response–representation hybrid: the loss is response-based (matching teacher outputs) but the mechanism operates at the internal representation level by routing teacher signals through the full depth of the network's attention.

P₁ patch P₂ patch P₃ patch [CLS] class token [DIST] distil token ★ ViT Transformer (L layers of self-attention) patches, [CLS] and [DIST] all interact freely Class Head [CLS] CE loss vs ground truth Distil Head [DIST] KL/CE loss vs teacher ŷ Teacher ❄️ (e.g. RegNet CNN) teacher ŷ Input Image → patches
Patch tokens
[CLS] token
[DIST] token ★
Teacher ❄️

Figure 4. DeiT's distillation token. A dedicated [DIST] token attends to all patch tokens and [CLS] at every layer. Its output drives a distillation head aligned to the frozen CNN teacher's prediction; the [CLS] head handles the standard task loss.

Reasoning Distillation (DeepSeek-R1): DeepSeek-R1 [21] demonstrates one of the most striking recent applications of distillation: the structured reasoning behaviour of a very large model can be transferred to models orders of magnitude smaller via a remarkably simple recipe.

The R1 training pipeline uses group relative policy optimisation (GRPO) to train a 671B-parameter mixture-of-experts model to produce long chain-of-thought reasoning traces: sequences that interleave exploration, self-correction, and answer consolidation. These traces are then used as supervision: a smaller student (7B–70B parameters) is fine-tuned on teacher-generated traces with standard cross-entropy, requiring no reinforcement learning of its own.

This is qualitatively different from response distillation (matching logits) or feature distillation (matching hidden states). What is transferred is not a probability distribution but the teacher's problem-solving process. Distilled models significantly outperform same-size counterparts trained without reasoning traces, suggesting the long structured sequences carry an implicit curriculum: early steps provide simpler sub-problem supervision, later steps require integrating prior reasoning.

DeepSeek-R1 also highlights a practical shift in where distillation bottlenecks arise for LLMs. Unlike vision distillation, the challenge is not architecture mismatch or projector design; it is generating high-quality teacher traces at scale and ensuring student context windows are large enough to absorb them.

Takeaway. Vision distillation benefits most from feature-level alignment of spatial maps with a simple projector. Language distillation benefits most from rich soft-label matching over large vocabularies, with feature alignment requiring careful attention-head and layer-mapping strategies. Transformer-specific techniques like DeiT's distillation token embed teacher signals into the model's attention mechanism, while large-scale reasoning distillation (DeepSeek-R1) shows that process-level supervision can outperform point-wise output matching.

9. Future Directions

Knowledge distillation is an active area and several threads are shaping its frontier.

Process and reward distillation. Systems like DeepSeek-R1 [21] signal a shift from output-level distillation toward process-level supervision: matching the teacher's step-by-step reasoning rather than its final answer. A related thread distils verifier or reward models from RLHF-trained teachers, useful when the reward signal is expensive to compute at inference but cheap to generate offline.

Test-time compute distillation. Large models can be given additional test-time compute, generating many candidate solutions and selecting the best, giving a stronger signal than a single forward pass. Distilling this capability into a student that approximates expensive search in a single step collapses the gap between slow-thinking and fast-inference models, and is closely related to best-of-\(N\) sampling, diffusion stitching [13], and speculative decoding [18].

Diffusion model distillation. Score-based and diffusion generative models require many denoising steps at inference. Progressive distillation and consistency models compress a multi-step teacher into a student that approximates the same mapping in far fewer steps, often just one or two. This is a direct analogue of response distillation applied to generative processes: instead of matching class probabilities, the student matches the teacher's denoised output at each step.

Multimodal distillation. Cross-modal knowledge transfer (e.g. distilling a vision-language model into a vision-only or text-only student) is an emerging challenge. The alignment problem is especially pronounced because modalities have fundamentally different structure and there is no obvious token-level correspondence between image patches and text tokens. Methods that distil shared embedding spaces or use captioning as an intermediate supervision signal are early steps in this direction.

Architecture-aware distillation. As models become more heterogeneous, mixing attention layers, state-space models, and mixture-of-experts blocks, the question of how to transfer knowledge across radically different inductive biases is increasingly important. VkD [19] and the orthogonal projector are one step in this direction, but robust cross-paradigm distillation (e.g. from a transformer to an SSM) and distillation for sparse expert models remain largely open research problems.

Common thread. Across all these directions, the core tension remains the same as in classical KD: the teacher holds richer structure than any single loss can capture. Progress comes from finding the right representation of that structure: whether it is a soft-label distribution, an intermediate feature, a reasoning trace, or a denoising trajectory, and a training signal that transfers it faithfully into a more efficient student.

10. References

Back to top Back to Roy Miles's homepage