TinyML TinyEngine

Modern AI models are becoming increasingly large, demanding substantial computational resources and memory. This creates a gap between the computational demands of these models and the available hardware capabilities. Pruning addresses this gap by reducing model size, memory footprint, and ultimately, energy consumption.

Course link

This briefing reviews MIT 6.5940 EfficientML.ai Lecture 11, focusing on deploying neural networks on resource-constrained edge devices, such as microcontrollers and laptops.

🌍 The Challenge of Edge AI

Edge devices, like smartphones, microcontrollers, and laptops, bring AI closer to users but face severe resource constraints compared to cloud GPUs or desktops.

Feature	MacBook Pro (M1 Ultra)	Microcontroller (STM32F746)
CPU Cores	20	1
Clock Rate	3.2 GHz	216 MHz
GPU Cores	64	N/A
L1 Cache	320 KB	8 KB
Memory Capacity	64 GB	320 KB
Storage Capacity	8 TB	1 MB

“AI has a lot of applications … but AI has limited computational resource constraints.”

⚙️ Parallel Computing Techniques

The lecture emphasizes several techniques for accelerating neural network inference on edge devices.

🔄 Loop Optimization

Loop Reordering: Improves data locality and reduces cache misses by reordering loop iterations.
- Example: Reordering i, j, k to i, k, j in matrix multiplication improved performance 12x on Intel Xeon CPUs.
Loop Tiling: Divides loops into smaller “tiles” to fit working sets into cache.
- Multi-level tiling targets different cache levels (L1, L2, etc.).
Loop Unrolling: Reduces branching overhead by replicating loop bodies.
- Tradeoff: Increased binary size but faster execution for predictable iterations.

🛠️ SIMD Programming

Single Instruction Multiple Data (SIMD) executes one operation on multiple data points simultaneously.

Leverages vector registers for increased throughput and energy efficiency.

“Instead of just using one instruction to operate on one piece of data … we use one instruction to work on multiple pieces of data.”

🌐 Multithreading

Allows concurrent execution of threads within a process, boosting performance.

Tools: Pthreads and OpenMP.

🎮 CUDA Programming

Harnesses GPUs for massive parallel computation via grids of thread blocks.

Key Strategy: Optimize memory spaces (global, shared, private) for efficiency.

CUDA Programming Explained ⚡

CUDA programming, introduced by Nvidia in 2006, enables developers to use GPUs to accelerate computation. Below is an explanation of the key concepts:

CUDA Thread Hierarchy 🧵

CUDA programming relies on a hierarchy of threads:

Grid: A grid is a collection of thread blocks. 🌐
Thread Block: A thread block is a group of threads that can cooperate and share data. 🧑‍🤝‍🧑
Thread: A thread is the smallest unit of execution in CUDA. 🧵

Each thread has a unique ID that can be up to three-dimensional, making it easy to handle multidimensional arrays. The thread ID, along with the block ID, specifies the part of the workload the thread is responsible for. For example, in a matrix addition operation, each thread computes the sum of corresponding elements from input matrices A and B, storing the result in output matrix C.

Programming Model 🖥️

CUDA programs consist of two parts: host code and kernel code.

Host Code: This code runs on the CPU and is responsible for tasks such as:
- Allocating memory on the GPU 💾
- Copying data between the CPU and GPU ↔️
- Launching kernels 🚀
Kernel Code: This code runs on the GPU and is executed by CUDA threads. It defines the operations each thread performs.

The host code launches a grid of thread blocks, and each block executes the kernel code in parallel. For example, launching a grid of 6 thread blocks, each containing 12 threads, would launch 72 CUDA threads simultaneously.

Memory Model 🧠

CUDA has distinct host (CPU) and device (GPU) address spaces. Data must be moved between these spaces explicitly.

For example, to perform matrix addition on the GPU:

Allocate memory on the GPU 🖥️
Copy the input matrices from the CPU to the GPU 📤📥
Launch a kernel to perform the addition 🔥
Copy the output matrix back from the GPU to the CPU 🔄

CUDA kernels have access to three types of memory:

Global Memory: Readable and writable by all threads. This is the largest and slowest type of memory. 🏋️‍♂️
Shared Memory: Readable and writable by all threads within a block. Faster than global memory. ⚡
Private Memory: Readable and writable only by the thread that owns it. Fastest but smallest type. 🚀

Speeding Up Matrix Multiplication ➗

Matrix multiplication is a common operation that can be significantly accelerated using CUDA. Each thread block works on a portion of the matrices, and each thread is responsible for computing a single element of the output matrix.

Shared memory can be used to store tiles of the input matrices, improving data locality and reducing the number of accesses to global memory. 📊
Synchronization mechanisms like __syncthreads() ensure that all threads within a block finish loading data or performing computations before proceeding to the next step. ⏳

Tensor Cores: Higher Throughput and More Data Types ⚙️

Recent GPUs include Tensor Cores, specialized hardware units designed for high-performance matrix multiplication.

CUDA Core vs. Tensor Core:
- A CUDA core can perform one FP32 or two FP16 multiply-accumulate operations per cycle.
- A Tensor Core can complete an entire matrix multiplication in FP16 per cycle, offering significantly higher throughput. 📈
Workload Size and Performance:
- Tensor Cores are particularly effective for large matrix multiplications, with the performance gap widening as the matrix size increases. 🚀

Matrix Multiplication Intrinsics (MMA) 🔢

To leverage Tensor Cores, CUDA provides matrix multiplication intrinsics (MMA). These intrinsics perform matrix multiplications on smaller tiles, which can be combined to compute larger matrix multiplications.

For example, a 16x16x32 matrix multiplication can be computed using 16x8x16 MMA intrinsics, breaking down the matrices into smaller tiles.
The MMA instructions are applied iteratively to these tiles, accumulating the results into the output matrix. 🔄

Note: In-depth resources on CUDA programming are needed to learn how to specifically utilize Tensor Cores and MMA intrinsics.

🔍 Inference Optimizations for CNNs

Beyond general techniques, the lecture highlights tailored optimizations for convolutional neural networks:

1. Image to Column (Im2col) Convolution

Rearranges input data to use optimized matrix multiplication kernels.
Variation: Implicit GEMM reduces additional memory overhead.

2. In-Place Depth-Wise Convolution

Reuses input buffers for output, minimizing memory usage.
Ideal for mobile-friendly architectures.

3. Memory Layout: NHWC vs. NCHW

NHWC (batch, height, width, channels) suits point-wise convolutions for better locality.
NCHW (batch, channels, height, width) works well for depth-wise convolutions.

4. Winograd Convolution

Reduces multiplications by applying transformations to inputs and filters.
Effective for larger kernel sizes.

“We want to minimize the amount of Kernel calls and do as much as possible within the Kernel … that’s called Kernel fusion.”

🛠️ TinyEngine: Efficient On-Device Inference

TinyEngine is an open-source library leveraging these optimizations to enable efficient inference.

Combines quantization with optimized kernel implementations.
Demonstrates capabilities by deploying LLMs like LLaMA 2 on laptops.

“By combining these techniques we can deploy a large language model locally on a laptop.”

🎯 Key Takeaways

Edge AI requires highly efficient techniques due to severe resource constraints.
Parallel computing techniques like loop optimization, SIMD, and multithreading are vital.
CNN-specific optimizations, such as Im2col, Winograd convolution, and memory layouts, significantly enhance performance.
TinyEngine provides a practical framework for deploying AI models on edge devices.

“You don’t want to go to the DRAM … you want to stay in the cache as much as possible.”

This lecture underscores the critical role of efficient programming and optimization strategies in making AI accessible on diverse, resource-constrained edge platforms.