TinyML Distillation

Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.

Course link

📚 Lecture Overview: Knowledge Distillation (KD) for TinyML

This lecture explores Knowledge Distillation (KD), a technique for training smaller, efficient models (student models) by leveraging knowledge from larger, pre-trained models (teacher models). KD is particularly impactful in deploying complex neural networks on resource-constrained hardware, making it ideal for TinyML applications.

🔍 1. What is Knowledge Distillation?

🎯 Motivation:
Efficient AI models that operate on diverse hardware platforms, from cloud GPUs to tiny edge devices with limited compute and memory.
💡 Core Idea:
Transfer knowledge from a large, high-accuracy teacher model to a smaller, efficient student model.
⚙️ Process:
- Both models process the same input.
- Training combines:
  - Standard classification loss (e.g., cross-entropy).
  - Distillation loss: Encourages the student’s output to match the teacher’s.
- Temperature parameter (T) in softmax smooths output probabilities, transferring “dark knowledge.”
💬 Quote:
“Can we use a larger model to guide a smaller model? So we have a larger Model A teacher model on the left. We have a smaller model, student model on the right.”

🧠 2. What to Match?

KD matches more than final logits. Intermediate tensors are used for enhanced knowledge transfer:

🔢 Output Logits: Match final class probabilities with cross-entropy or L2 loss.
⚖️ Intermediate Weights: Match weights using low-rank approximations or projections, even with dimensional differences.
🌊 Intermediate Features: Align feature maps between teacher and student. Metrics like Maximum Mean Discrepancy (MMD) are used.
🌀 Gradients: Match gradients of the loss w.r.t. inputs or activations to guide learning.
⚡ Sparsity Patterns: Match ReLU sparsity patterns to mimic teacher neuron activations.
🌐 Relational Information: Match relationships across multiple inputs or layers for richer knowledge transfer.
💬 Quote:
“What tensors can we match? … Starting from the output logits… Intermediate tensors, including the intermediate weights, also the intermediate features.”

🤝 3. Self and Online Distillation

🔄 Self-Distillation:
- No separate teacher needed.
- A single model trains iteratively, with each version acting as the teacher for the next (e.g., Born-Again Networks).
🤖 Online Distillation:
- Teacher and student models train simultaneously, learning collaboratively (e.g., Deep Mutual Learning).
🔗 Combined Approaches:
- Deeper layers supervise shallower layers in the same model.
💬 Quote:
“If we don’t have the teacher, how do we apply knowledge distillation to begin with? … Self and online distillation … Learn together with your classmates.”

🛠️ 4. Distillation for Different Tasks

KD extends beyond image classification to other tasks:

🔍 Object Detection: Match features and bounding box predictions.
🌈 Semantic Segmentation: Use feature imitation and adversarial losses for pixel-wise predictions.
🎨 GANs: Compress resource-heavy generative models by matching intermediate features and outputs.
📖 NLP: Match logits and attention maps in transformers for smaller, efficient models.
🧑‍💻 LLMs & VLMs: Combine pruning with KD for significant size reduction and cost savings.
💬 Quote:
“We want to apply knowledge distillation to different tasks to solve real-world problems. … Starting with object detection. … Segmentation is a new task… We try to find to give a label for each pixel, pixel-wise prediction.”

🏗️ 5. Network Augmentation

📈 Motivation:
Overcome underfitting in tiny models, where limited capacity makes standard techniques like data augmentation ineffective.
🔧 Core Idea:
Augment the model architecture during training for extra supervision, enabling the tiny model to learn as part of a larger network.
🛠️ Process:
- Expand the original model’s width or depth temporarily during training.
- Use shared weights between the base and augmented models.
- Combine losses from both the base and augmented models.
💬 Quote:
“Can we do network augmentation, okay, to augment the model to get some extra supervision during training for the tiny model? … So in the end, we still want to deploy a small tiny model, like here, only with two channels rather than four channels, but during the learning process, we find some redundancy helps.”

🎯 Conclusion

Knowledge Distillation is a versatile and powerful technique for building efficient AI models. By transferring knowledge from teacher to student models, KD supports:

Diverse tasks beyond image classification.
Advanced techniques like network augmentation.

It remains a cornerstone for tackling real-world challenges in AI deployment on resource-constrained hardware.

Distillation Introduction

TinyML Lecture 9