TinyML On-device Training
Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.
This document summarizes key themes and findings from a lecture on on-device training and transfer learning. The central motivation is to enable machine learning model training on edge devices (e.g., phones, cars, embedded systems) to:
- Preserve privacy 🛡️
- Process sensitive data locally 🔒
- Adapt models to new sensor data 🔄
The lecture explores challenges associated with on-device training and presents techniques to address them, focusing on memory and computational efficiency.
🌟 Key Themes and Ideas
🔐 Privacy Concerns with Gradient Sharing
- Problem: Sharing gradients (model updates) with the cloud, even in federated learning, risks user privacy. Gradients can be exploited to reconstruct original data (Deep Leakage attack).
- “Sharing gradients is as dangerous as sharing the original user data!”
- Defense Strategies:
- Adding Noise: Prevents leakage but harms accuracy significantly.
“Simply applying noise cannot prevent deep leakage unless we allow significant accuracy drop.” - Gradient Compression (Pruning): Pruning up to 99% of gradients makes leakage harder while preserving accuracy.
“If you prune away 99% of the gradient, there’s very little left to match!”
- Adding Noise: Prevents leakage but harms accuracy significantly.
🧠 Memory Bottlenecks of On-Device Training
- Challenge: Training requires far more memory than inference, which is problematic for edge devices.
- “Edge devices have tight memory constraints. The training memory footprint of neural networks can easily exceed the limit.”
- Key Insights:
- Activation Memory is the bottleneck, as intermediate values from backpropagation must be stored.
- Training Memory > Inference Memory because activations for all layers must be kept during backpropagation.
- Parameter-efficient transfer learning doesn’t always equate to memory efficiency.
🌱 Tiny Transfer Learning (TinyTL)
- Goal: Memory-efficient on-device transfer learning with minimal accuracy loss.
- Approach:
- Update Biases Only: Requires no activation storage, saving memory.
- “Updating biases does not require storing intermediate activations.”
- Lite Residual Learning: Adds lightweight residual branches to compensate for reduced capacity.
- “Add lite residual modules to increase model capacity.”
- Update Biases Only: Requires no activation storage, saving memory.
- Results: Saves up to 6.5x memory while maintaining high accuracy.
✂️ Sparse Back-Propagation (SparseBP)
- Inspiration: Synaptic pruning in the brain—only update the most important parts of the model.
- “Synapses become sparse during adolescence.”
- Techniques:
- Sparse Layer Backpropagation: Update only crucial layers.
- Sparse Tensor Backpropagation: Partially update tensors in layers.
- Avoid Early Layers: Skip updating low-level feature extractors.
- Optimization: Use evolutionary search to determine optimal backpropagation schemes.
- Results: Comparable accuracy to full backpropagation but with significantly reduced memory.
🔢 Quantized Training with Quantization Aware Scaling (QAS)
- Goal: Use lower precision (e.g., int8) for training to save memory and computation.
- Challenge: Real quantized training often suffers accuracy loss due to scaling mismatches.
- Solution:
- QAS re-scales gradients of weights/biases to align with FP32, stabilizing training.
- Results: Enables stable, accurate training with limited resources.
🛠️ PockEngine: System Support for SparseBP & QAS
- Problem: Existing deep learning frameworks are not optimized for on-device training.
- Solution: Move computation from runtime to compile time.
- Compile-Time AutoDiff: Derivatives computed at compile time for better optimization.
- Static Computation Graph: Enables graph-level optimizations (e.g., sparse updates, operator reordering).
- Code Generation: Produces optimized binaries for target devices.
- Results:
- Training speed-up: 13–21x (on ARM CPUs vs. TensorFlow/PyTorch).
- Supports on-device training across various hardware platforms (e.g., Apple M1, Raspberry Pi, smartphones).
🎯 Practical Applications & Results
- On-device training in 256KB of memory: TinyTL + SparseBP + QAS + PockEngine reduces memory from hundreds of MB to just 141 KB.
- On-device LLM fine-tuning: Enabled fine-tuning of LLaMA-2 7B on edge devices.
- Latency and Memory Improvements: SparseBP reduces latency and memory usage for LLMs while maintaining performance.
- Qualitative Analysis: SparseBP fine-tuned models outperform untuned models in answering questions.
🗝️ Key Takeaways
- On-device training ensures privacy and local adaptability but faces challenges of limited memory and compute.
- Techniques like TinyTL, SparseBP, QAS, and PockEngine make on-device training feasible without significant accuracy trade-offs.
- The future of machine learning lies in optimizing for edge devices while balancing accuracy, memory, and efficiency. 🚀
