TinyML Distillation
Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.
๐ Lecture Overview: Knowledge Distillation (KD) for TinyML
This lecture explores Knowledge Distillation (KD), a technique for training smaller, efficient models (student models) by leveraging knowledge from larger, pre-trained models (teacher models). KD is particularly impactful in deploying complex neural networks on resource-constrained hardware, making it ideal for TinyML applications.
๐ 1. What is Knowledge Distillation?
-
๐ฏ Motivation:
Efficient AI models that operate on diverse hardware platforms, from cloud GPUs to tiny edge devices with limited compute and memory. -
๐ก Core Idea:
Transfer knowledge from a large, high-accuracy teacher model to a smaller, efficient student model. - โ๏ธ Process:
- Both models process the same input.
- Training combines:
- Standard classification loss (e.g., cross-entropy).
- Distillation loss: Encourages the studentโs output to match the teacherโs.
- Temperature parameter (T) in softmax smooths output probabilities, transferring โdark knowledge.โ
- ๐ฌ Quote:
โCan we use a larger model to guide a smaller model? So we have a larger Model A teacher model on the left. We have a smaller model, student model on the right.โ
๐ง 2. What to Match?
KD matches more than final logits. Intermediate tensors are used for enhanced knowledge transfer:
- ๐ข Output Logits: Match final class probabilities with cross-entropy or L2 loss.
- โ๏ธ Intermediate Weights: Match weights using low-rank approximations or projections, even with dimensional differences.
- ๐ Intermediate Features: Align feature maps between teacher and student. Metrics like Maximum Mean Discrepancy (MMD) are used.
- ๐ Gradients: Match gradients of the loss w.r.t. inputs or activations to guide learning.
- โก Sparsity Patterns: Match ReLU sparsity patterns to mimic teacher neuron activations.
-
๐ Relational Information: Match relationships across multiple inputs or layers for richer knowledge transfer.
- ๐ฌ Quote:
โWhat tensors can we match? โฆ Starting from the output logitsโฆ Intermediate tensors, including the intermediate weights, also the intermediate features.โ
๐ค 3. Self and Online Distillation
- ๐ Self-Distillation:
- No separate teacher needed.
- A single model trains iteratively, with each version acting as the teacher for the next (e.g., Born-Again Networks).
- ๐ค Online Distillation:
- Teacher and student models train simultaneously, learning collaboratively (e.g., Deep Mutual Learning).
- ๐ Combined Approaches:
- Deeper layers supervise shallower layers in the same model.
- ๐ฌ Quote:
โIf we donโt have the teacher, how do we apply knowledge distillation to begin with? โฆ Self and online distillation โฆ Learn together with your classmates.โ
๐ ๏ธ 4. Distillation for Different Tasks
KD extends beyond image classification to other tasks:
- ๐ Object Detection: Match features and bounding box predictions.
- ๐ Semantic Segmentation: Use feature imitation and adversarial losses for pixel-wise predictions.
- ๐จ GANs: Compress resource-heavy generative models by matching intermediate features and outputs.
- ๐ NLP: Match logits and attention maps in transformers for smaller, efficient models.
-
๐งโ๐ป LLMs & VLMs: Combine pruning with KD for significant size reduction and cost savings.
- ๐ฌ Quote:
โWe want to apply knowledge distillation to different tasks to solve real-world problems. โฆ Starting with object detection. โฆ Segmentation is a new taskโฆ We try to find to give a label for each pixel, pixel-wise prediction.โ
๐๏ธ 5. Network Augmentation
-
๐ Motivation:
Overcome underfitting in tiny models, where limited capacity makes standard techniques like data augmentation ineffective. -
๐ง Core Idea:
Augment the model architecture during training for extra supervision, enabling the tiny model to learn as part of a larger network. - ๐ ๏ธ Process:
- Expand the original modelโs width or depth temporarily during training.
- Use shared weights between the base and augmented models.
- Combine losses from both the base and augmented models.
- ๐ฌ Quote:
โCan we do network augmentation, okay, to augment the model to get some extra supervision during training for the tiny model? โฆ So in the end, we still want to deploy a small tiny model, like here, only with two channels rather than four channels, but during the learning process, we find some redundancy helps.โ
๐ฏ Conclusion
Knowledge Distillation is a versatile and powerful technique for building efficient AI models. By transferring knowledge from teacher to student models, KD supports:
- Diverse tasks beyond image classification.
- Advanced techniques like network augmentation.
It remains a cornerstone for tackling real-world challenges in AI deployment on resource-constrained hardware.
