TinyML Distillation

Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.

Course link

๐Ÿ“š Lecture Overview: Knowledge Distillation (KD) for TinyML

This lecture explores Knowledge Distillation (KD), a technique for training smaller, efficient models (student models) by leveraging knowledge from larger, pre-trained models (teacher models). KD is particularly impactful in deploying complex neural networks on resource-constrained hardware, making it ideal for TinyML applications.


๐Ÿ” 1. What is Knowledge Distillation?

  • ๐ŸŽฏ Motivation:
    Efficient AI models that operate on diverse hardware platforms, from cloud GPUs to tiny edge devices with limited compute and memory.

  • ๐Ÿ’ก Core Idea:
    Transfer knowledge from a large, high-accuracy teacher model to a smaller, efficient student model.

  • โš™๏ธ Process:
    • Both models process the same input.
    • Training combines:
      • Standard classification loss (e.g., cross-entropy).
      • Distillation loss: Encourages the studentโ€™s output to match the teacherโ€™s.
    • Temperature parameter (T) in softmax smooths output probabilities, transferring โ€œdark knowledge.โ€
  • ๐Ÿ’ฌ Quote:
    โ€œCan we use a larger model to guide a smaller model? So we have a larger Model A teacher model on the left. We have a smaller model, student model on the right.โ€

๐Ÿง  2. What to Match?

KD matches more than final logits. Intermediate tensors are used for enhanced knowledge transfer:

  • ๐Ÿ”ข Output Logits: Match final class probabilities with cross-entropy or L2 loss.
  • โš–๏ธ Intermediate Weights: Match weights using low-rank approximations or projections, even with dimensional differences.
  • ๐ŸŒŠ Intermediate Features: Align feature maps between teacher and student. Metrics like Maximum Mean Discrepancy (MMD) are used.
  • ๐ŸŒ€ Gradients: Match gradients of the loss w.r.t. inputs or activations to guide learning.
  • โšก Sparsity Patterns: Match ReLU sparsity patterns to mimic teacher neuron activations.
  • ๐ŸŒ Relational Information: Match relationships across multiple inputs or layers for richer knowledge transfer.

  • ๐Ÿ’ฌ Quote:
    โ€œWhat tensors can we match? โ€ฆ Starting from the output logitsโ€ฆ Intermediate tensors, including the intermediate weights, also the intermediate features.โ€

๐Ÿค 3. Self and Online Distillation

  • ๐Ÿ”„ Self-Distillation:
    • No separate teacher needed.
    • A single model trains iteratively, with each version acting as the teacher for the next (e.g., Born-Again Networks).
  • ๐Ÿค– Online Distillation:
    • Teacher and student models train simultaneously, learning collaboratively (e.g., Deep Mutual Learning).
  • ๐Ÿ”— Combined Approaches:
    • Deeper layers supervise shallower layers in the same model.
  • ๐Ÿ’ฌ Quote:
    โ€œIf we donโ€™t have the teacher, how do we apply knowledge distillation to begin with? โ€ฆ Self and online distillation โ€ฆ Learn together with your classmates.โ€

๐Ÿ› ๏ธ 4. Distillation for Different Tasks

KD extends beyond image classification to other tasks:

  • ๐Ÿ” Object Detection: Match features and bounding box predictions.
  • ๐ŸŒˆ Semantic Segmentation: Use feature imitation and adversarial losses for pixel-wise predictions.
  • ๐ŸŽจ GANs: Compress resource-heavy generative models by matching intermediate features and outputs.
  • ๐Ÿ“– NLP: Match logits and attention maps in transformers for smaller, efficient models.
  • ๐Ÿง‘โ€๐Ÿ’ป LLMs & VLMs: Combine pruning with KD for significant size reduction and cost savings.

  • ๐Ÿ’ฌ Quote:
    โ€œWe want to apply knowledge distillation to different tasks to solve real-world problems. โ€ฆ Starting with object detection. โ€ฆ Segmentation is a new taskโ€ฆ We try to find to give a label for each pixel, pixel-wise prediction.โ€

๐Ÿ—๏ธ 5. Network Augmentation

  • ๐Ÿ“ˆ Motivation:
    Overcome underfitting in tiny models, where limited capacity makes standard techniques like data augmentation ineffective.

  • ๐Ÿ”ง Core Idea:
    Augment the model architecture during training for extra supervision, enabling the tiny model to learn as part of a larger network.

  • ๐Ÿ› ๏ธ Process:
    • Expand the original modelโ€™s width or depth temporarily during training.
    • Use shared weights between the base and augmented models.
    • Combine losses from both the base and augmented models.
  • ๐Ÿ’ฌ Quote:
    โ€œCan we do network augmentation, okay, to augment the model to get some extra supervision during training for the tiny model? โ€ฆ So in the end, we still want to deploy a small tiny model, like here, only with two channels rather than four channels, but during the learning process, we find some redundancy helps.โ€

๐ŸŽฏ Conclusion

Knowledge Distillation is a versatile and powerful technique for building efficient AI models. By transferring knowledge from teacher to student models, KD supports:

  • Diverse tasks beyond image classification.
  • Advanced techniques like network augmentation.

It remains a cornerstone for tackling real-world challenges in AI deployment on resource-constrained hardware.