TinyML TinyEngine
Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.
This briefing reviews MIT 6.5940 EfficientML.ai Lecture 11, focusing on deploying neural networks on resource-constrained edge devices, such as microcontrollers and laptops.
๐ The Challenge of Edge AI
Edge devices, like smartphones, microcontrollers, and laptops, bring AI closer to users but face severe resource constraints compared to cloud GPUs or desktops.
| Feature | MacBook Pro (M1 Ultra) | Microcontroller (STM32F746) |
|---|---|---|
| CPU Cores | 20 | 1 |
| Clock Rate | 3.2 GHz | 216 MHz |
| GPU Cores | 64 | N/A |
| L1 Cache | 320 KB | 8 KB |
| Memory Capacity | 64 GB | 320 KB |
| Storage Capacity | 8 TB | 1 MB |
โAI has a lot of applications โฆ but AI has limited computational resource constraints.โ
โ๏ธ Parallel Computing Techniques
The lecture emphasizes several techniques for accelerating neural network inference on edge devices.
๐ Loop Optimization
- Loop Reordering: Improves data locality and reduces cache misses by reordering loop iterations.
- Example: Reordering
i, j, ktoi, k, jin matrix multiplication improved performance 12x on Intel Xeon CPUs.
- Example: Reordering
- Loop Tiling: Divides loops into smaller โtilesโ to fit working sets into cache.
- Multi-level tiling targets different cache levels (L1, L2, etc.).
- Loop Unrolling: Reduces branching overhead by replicating loop bodies.
- Tradeoff: Increased binary size but faster execution for predictable iterations.
๐ ๏ธ SIMD Programming
Single Instruction Multiple Data (SIMD) executes one operation on multiple data points simultaneously.
- Leverages vector registers for increased throughput and energy efficiency.
โInstead of just using one instruction to operate on one piece of data โฆ we use one instruction to work on multiple pieces of data.โ
๐ Multithreading
Allows concurrent execution of threads within a process, boosting performance.
- Tools: Pthreads and OpenMP.
๐ฎ CUDA Programming
Harnesses GPUs for massive parallel computation via grids of thread blocks.
- Key Strategy: Optimize memory spaces (global, shared, private) for efficiency.
CUDA Programming Explained โก
CUDA programming, introduced by Nvidia in 2006, enables developers to use GPUs to accelerate computation. Below is an explanation of the key concepts:
CUDA Thread Hierarchy ๐งต
CUDA programming relies on a hierarchy of threads:
- Grid: A grid is a collection of thread blocks. ๐
- Thread Block: A thread block is a group of threads that can cooperate and share data. ๐งโ๐คโ๐ง
- Thread: A thread is the smallest unit of execution in CUDA. ๐งต
Each thread has a unique ID that can be up to three-dimensional, making it easy to handle multidimensional arrays. The thread ID, along with the block ID, specifies the part of the workload the thread is responsible for. For example, in a matrix addition operation, each thread computes the sum of corresponding elements from input matrices A and B, storing the result in output matrix C.
Programming Model ๐ฅ๏ธ
CUDA programs consist of two parts: host code and kernel code.
- Host Code: This code runs on the CPU and is responsible for tasks such as:
- Allocating memory on the GPU ๐พ
- Copying data between the CPU and GPU โ๏ธ
- Launching kernels ๐
- Kernel Code: This code runs on the GPU and is executed by CUDA threads. It defines the operations each thread performs.
The host code launches a grid of thread blocks, and each block executes the kernel code in parallel. For example, launching a grid of 6 thread blocks, each containing 12 threads, would launch 72 CUDA threads simultaneously.
Memory Model ๐ง
CUDA has distinct host (CPU) and device (GPU) address spaces. Data must be moved between these spaces explicitly.
For example, to perform matrix addition on the GPU:
- Allocate memory on the GPU ๐ฅ๏ธ
- Copy the input matrices from the CPU to the GPU ๐ค๐ฅ
- Launch a kernel to perform the addition ๐ฅ
- Copy the output matrix back from the GPU to the CPU ๐
CUDA kernels have access to three types of memory:
- Global Memory: Readable and writable by all threads. This is the largest and slowest type of memory. ๐๏ธโโ๏ธ
- Shared Memory: Readable and writable by all threads within a block. Faster than global memory. โก
- Private Memory: Readable and writable only by the thread that owns it. Fastest but smallest type. ๐
Speeding Up Matrix Multiplication โ
Matrix multiplication is a common operation that can be significantly accelerated using CUDA. Each thread block works on a portion of the matrices, and each thread is responsible for computing a single element of the output matrix.
- Shared memory can be used to store tiles of the input matrices, improving data locality and reducing the number of accesses to global memory. ๐
- Synchronization mechanisms like
__syncthreads()ensure that all threads within a block finish loading data or performing computations before proceeding to the next step. โณ
Tensor Cores: Higher Throughput and More Data Types โ๏ธ
Recent GPUs include Tensor Cores, specialized hardware units designed for high-performance matrix multiplication.
- CUDA Core vs. Tensor Core:
- A CUDA core can perform one FP32 or two FP16 multiply-accumulate operations per cycle.
- A Tensor Core can complete an entire matrix multiplication in FP16 per cycle, offering significantly higher throughput. ๐
- Workload Size and Performance:
- Tensor Cores are particularly effective for large matrix multiplications, with the performance gap widening as the matrix size increases. ๐
Matrix Multiplication Intrinsics (MMA) ๐ข
To leverage Tensor Cores, CUDA provides matrix multiplication intrinsics (MMA). These intrinsics perform matrix multiplications on smaller tiles, which can be combined to compute larger matrix multiplications.
- For example, a 16x16x32 matrix multiplication can be computed using 16x8x16 MMA intrinsics, breaking down the matrices into smaller tiles.
- The MMA instructions are applied iteratively to these tiles, accumulating the results into the output matrix. ๐
Note: In-depth resources on CUDA programming are needed to learn how to specifically utilize Tensor Cores and MMA intrinsics.
๐ Inference Optimizations for CNNs
Beyond general techniques, the lecture highlights tailored optimizations for convolutional neural networks:
1. Image to Column (Im2col) Convolution
- Rearranges input data to use optimized matrix multiplication kernels.
- Variation: Implicit GEMM reduces additional memory overhead.
2. In-Place Depth-Wise Convolution
- Reuses input buffers for output, minimizing memory usage.
- Ideal for mobile-friendly architectures.
3. Memory Layout: NHWC vs. NCHW
- NHWC (batch, height, width, channels) suits point-wise convolutions for better locality.
- NCHW (batch, channels, height, width) works well for depth-wise convolutions.
4. Winograd Convolution
- Reduces multiplications by applying transformations to inputs and filters.
- Effective for larger kernel sizes.
โWe want to minimize the amount of Kernel calls and do as much as possible within the Kernel โฆ thatโs called Kernel fusion.โ
๐ ๏ธ TinyEngine: Efficient On-Device Inference
TinyEngine is an open-source library leveraging these optimizations to enable efficient inference.
- Combines quantization with optimized kernel implementations.
- Demonstrates capabilities by deploying LLMs like LLaMA 2 on laptops.
โBy combining these techniques we can deploy a large language model locally on a laptop.โ
๐ฏ Key Takeaways
- Edge AI requires highly efficient techniques due to severe resource constraints.
- Parallel computing techniques like loop optimization, SIMD, and multithreading are vital.
- CNN-specific optimizations, such as Im2col, Winograd convolution, and memory layouts, significantly enhance performance.
- TinyEngine provides a practical framework for deploying AI models on edge devices.
โYou donโt want to go to the DRAM โฆ you want to stay in the cache as much as possible.โ
This lecture underscores the critical role of efficient programming and optimization strategies in making AI accessible on diverse, resource-constrained edge platforms.
