📚 Hardware Acceleration

Course Link

This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.

I. Introduction and Motivation

alt_text

⚡ Necessity of Acceleration

The increasing computational demands of large models and datasets require hardware accelerators like GPUs.

“In all cases, we need accelerations to keep up with the computing demand brought by big models and datasets.”

🛠️ Understanding Low-Level Details

Understanding hardware acceleration techniques offers insights into why code runs fast or slow, and how high-level machine learning code translates to hardware instructions.

“It’s cool to know how high-level ML code from linear algebra translates to real instructions.”

🧩 Custom Operator Implementation

Hardware acceleration knowledge enables custom operator implementation on specific hardware.

“If I want to develop a new operator, how can I implement it on the available acceleration hardware?”

II. Layers in Machine Learning Frameworks

🏗️ Hierarchical Structure

ML frameworks consist of layers, including computational graphs, tensor linear algebra libraries, and underlying hardware.

“The highest layer involves tensor linear algebra and computational graphs.”

🔢 Tensor Linear Algebra

Libraries manage multi-dimensional arrays (tensors) and perform arithmetic operations.

“Tensor algebra libraries let us create arrays and run operations like matrix multiplication and addition.”

🔧 Need for Optimization

Different environments (CPUs, GPUs, mobile) need optimized tensor libraries.

“How can we optimize tensor algebra libraries for different environments?”

III. General Acceleration Techniques

📊 Vectorization

Concept: Uses special hardware instructions (vector registers) to perform operations on multiple data points simultaneously.

“Modern CPUs have vector instructions to load contiguous memory into vector registers for parallel arithmetic.”

Alignment: Memory must align (e.g., 16-byte multiples).

“Floating-point data must align to multiples of four.”

Aligned Allocation: Memory allocation must align to maximize vectorization.

“We use aligned allocation instead of default memory allocation.”

alt_text

🗂️ Data Layout and Strides

Row vs. Column Major: Ways to store arrays in flat memory.

Row Major: Stores data row by row.
Column Major: Stores data column by column.

“Row major stores elements row-wise, while column major stores them column-wise.”

Strides: Defines offset for incrementing dimensions.

“Strides generalize data layouts beyond row/column major.”

Advantages: Enables zero-copy operations (slicing, transposing).

“Strides allow flexible data transformations without extra memory.”

Disadvantages: Can hinder vectorization.

“Strided arrays may slow down vectorized operations, requiring compaction.”

Contiguous Arrays: Operations often require contiguous memory.

“Libraries offer functions like as_contiguous to reorganize arrays.”

🧵 Parallelization

Concept: Divides tasks across CPU cores/threads.

“Loops can be parallelized by assigning tasks to different CPU cores.”

OpenMP: Library for parallelism (#pragma omp parallel for).

IV. Case Study: Matrix Multiplication

✏️ Vanilla Implementation

Complexity: O(n³) operations.

“Matrix multiplication involves O(n³) computations.”

Surprise: Many BLAS libraries still use O(n³) implementations.

“Most BLAS libraries stick with vanilla O(n³) methods.”

📌 Concept

🎯 Goal: Compute the product of two matrices, A and B, to obtain matrix C.
🧮 Formula: Each element $( C[i][j] )$ is the dot product of the i-th row of A and the j-th column of B.
🔢 Mathematical Representation:
$[ C_{ij} = \sum_k A_{ik} \times B_{kj} ]$ where k represents the summation over the shared dimension.

⚙️ Process Breakdown

🔄 Nested Loops:
- The outer two loops iterate over rows ( $( i )$ ) and columns ( $( j )$ ) of matrix C.
- The innermost loop calculates the dot product by iterating over $( k )$.
🏗️ Initialization:
$( C[i][j] )$ is initialized to zero.
For each $( k )$, the product of $( A[i][k] )$ and $( B[k][j] )$ is added to $( C[i][j] )$.

# A basic implementation in pseudocode is shown below
for i = 0 to n:
  for j = 0 to n:
    C[i][j] = 0
    for k = 0 to n:
      C[i][j] += A[i][k] * B[j][k]

📊 Example

🧩 Matrices:
Let’s consider two 2x2 matrices A and B.
$[ A = \begin{bmatrix} a_{11} & a_{12} \ a_{21} & a_{22} \end{bmatrix}, \quad B = \begin{bmatrix} b_{11} & b_{12} \ b_{21} & b_{22} \end{bmatrix} ]$
🧩 Computation:
Following the vanilla algorithm, the resulting matrix $( C = A \times B )$ is computed as:
$[ C = \begin{bmatrix} c_{11} & c_{12} \ c_{21} & c_{22} \end{bmatrix} ]$ where:
$[ c_{11} = a_{11} \cdot b_{11} + a_{12} \cdot b_{21} ]$ $[ c_{12} = a_{11} \cdot b_{12} + a_{12} \cdot b_{22} ]$ $[ c_{21} = a_{21} \cdot b_{11} + a_{22} \cdot b_{21} ]$ $[ c_{22} = a_{21} \cdot b_{12} + a_{22} \cdot b_{22} ]$

⏱️ Complexity

🚀 Time Complexity:
$[ O(n^3) ]$
- Three nested loops iterating $( n )$ times each.

🧠 Memory Considerations

⚡ Memory Hierarchy:
- Data access times differ depending on the storage level:
  - 💾 DRAM (slowest)
  - 🔐 L2 Cache
  - 🏎️ L1 Cache (faster)
  - 🚀 Registers (fastest)
📈 Optimization Insight:
- The vanilla implementation doesn’t optimize for memory hierarchy.
- Tiling and other sophisticated techniques improve efficiency by leveraging faster memory levels.

alt_text

🎯 Key Takeaway

Simple but Inefficient:
Vanilla matrix multiplication is conceptually easy but may not be the most efficient approach.
Memory access patterns and tiling strategies can significantly enhance performance.

🧱 Memory Hierarchy Importance

Latency: Memory access times vary (e.g., DRAM vs. L1 cache).

“DRAM access takes 200ns, but L1 cache takes only 0.5ns.”

Goal: Optimize memory access to use faster caches.

🏗️ Architecture Aware Analysis

Simplified Model: Focus on DRAM and registers.
Cost: DRAM-to-register load dominates (2n³).
Registers: Only 3 registers (A, B, C) in the basic model.

alt_text

🧮 Register Tiled Implementation

Tiling: Computes submatrices instead of single elements.

“We divide matrices into submatrices to reuse data and reduce memory load.”

Reduced Load: DRAM load drops to n³/v1 + n³/v2.

The register tiled matrix multiplication is an optimization technique that enhances matrix multiplication performance by dividing matrices into smaller sub-matrices (tiles) that fit into registers – the fastest CPU memory. This reduces memory access to slower levels like DRAM by reusing data in registers.

alt_text

🎯 Concept:

🔹 Instead of computing one element at a time (like vanilla implementation), register tiling computes sub-matrices.
🔹 Smaller matrix blocks are loaded into registers, computations are performed on these blocks, and results are stored back into memory. This allows data reuse while in fast register memory.

⚙️ Implementation Details:

📏 Matrices are divided into blocks of size:
- 🔹 $(v1 \times v3)$ for $(A)$
- 🔹 $(v2 \times v3)$ for $(B)$
- 🔹 $(v1 \times v2)$ for $(C)$
🔄 Nested loops iterate over sub-matrices.
📐 The innermost loop performs a dot product of $(A)$ and $(B)$ sub-matrices, adding to $(C)$.
🔢 $(v1, v2, v3)$ are tiling factors based on hardware register space, chosen to maximize reuse.

💻 Pseudocode:

for i = 0 to n/v1:
  for j = 0 to n/v2:
    register float c[v1][v2] = 0
    for k = 0 to n/v3:
      register float a[v1][v3] = A[i][k]
      register float b[v2][v3] = B[j][k]
      c += dot(a, b.T)
    C[i][j] = c

🔹 $(n)$ – matrix dimension
🔹 $(v1, v2, v3)$ – tiling factors

📊 Memory Access Analysis:

💾 Data Loading Cost:
- 🔹 $(A: \frac{n^3}{v2})$
- 🔹 $(B: \frac{n^3}{v1})$
- ✅ Data reuse reduces loading cost.
📦 Register Cost:
- 🔹 $(A: v1 \times v3)$
- 🔹 $(B: v2 \times v3)$
- 🔹 $(C: v1 \times v2)$

📈 Example:

🧩 Multiply two $(n \times n)$ matrices $(A)$ and $(B)$, with $(n = 6)$.
🧩 Tiling factors: $(v1 = 2, v2 = 3, v3 = 1)$.
🔹 $(A)$ is tiled into $(2 \times 1)$, $(B)$ into $(3 \times 1)$, and $(C)$ into $(2 \times 3)$.
🔄 Submatrices are multiplied using nested loops, dot products occur in registers, and results are saved to memory.
Matrices:
We are multiplying two square matrices $(A)$ and $(B)$, each of size $(6 \times 6)$.
The resulting matrix $(C)$ will also be of size $(6 \times 6)$.
Tiling Factors:
- $(v1 = 2)$ – The tile size for matrix $(A)$’s rows and matrix $(C)$’s rows.
- $(v2 = 3)$ – The tile size for matrix $(B)$’s columns and matrix $(C)$’s columns.
- $(v3 = 1)$ – The tile size along the shared dimension (dot product dimension).

🔹 Tiling Breakdown:

Matrix $(A)$:
- $(A)$ is divided into submatrices of size $(2 \times 1)$.
- Since $(n = 6)$, we need $(6 / 2 = 3)$ tiles along the rows and $(6 / 1 = 6)$ tiles along the columns.
- This results in $(3 \times 6)$ tiles of size $(2 \times 1)$.
Matrix $(B)$:
- $(B)$ is divided into submatrices of size $(3 \times 1)$.
- We need $(6 / 3 = 2)$ tiles along the rows and $(6 / 1 = 6)$ tiles along the columns.
- This results in $(2 \times 6)$ tiles of size $(3 \times 1)$.
Matrix $(C)$:
- $(C)$ is divided into submatrices of size $(2 \times 3)$.
- We need $(6 / 2 = 3)$ tiles along the rows and $(6 / 3 = 2)$ tiles along the columns.
- This results in $(3 \times 2)$ tiles of size $(2 \times 3)$.

🔄 Calculation Walkthrough:

Loop Structure:
1. Outer Loop – Iterates over $(C)$’s row tiles (3 tiles of height 2).
2. Middle Loop – Iterates over $(C)$’s column tiles (2 tiles of width 3).
3. Inner Loop – Iterates over the shared dimension (6 tiles of width 1).
Submatrix Multiplication:
- For each tile of $(C[i][j])$ (of size $(2 \times 3)$), the algorithm performs:
  - Load a $(2 \times 1)$ submatrix of $(A)$.
  - Load a $(3 \times 1)$ submatrix of $(B)$.
  - Perform the dot product to update a $(2 \times 3)$ submatrix of $(C)$.
- This process is repeated across the shared dimension $(k)$, accumulating the results.

🔢 Example Calculation (with Numbers):

Suppose:
$[ A = \begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} & a_{15} & a_{16}
a_{21} & a_{22} & a_{23} & a_{24} & a_{25} & a_{26}
a_{31} & a_{32} & a_{33} & a_{34} & a_{35} & a_{36}
a_{41} & a_{42} & a_{43} & a_{44} & a_{45} & a_{46}
a_{51} & a_{52} & a_{53} & a_{54} & a_{55} & a_{56}
a_{61} & a_{62} & a_{63} & a_{64} & a_{65} & a_{66} \end{bmatrix} ]$
$[ B = \begin{bmatrix} b_{11} & b_{12} & b_{13} & b_{14} & b_{15} & b_{16}
b_{21} & b_{22} & b_{23} & b_{24} & b_{25} & b_{26}
b_{31} & b_{32} & b_{33} & b_{34} & b_{35} & b_{36}
b_{41} & b_{42} & b_{43} & b_{44} & b_{45} & b_{46}
b_{51} & b_{52} & b_{53} & b_{54} & b_{55} & b_{56}
b_{61} & b_{62} & b_{63} & b_{64} & b_{65} & b_{66} \end{bmatrix} ]$

Using $(v1 = 2, v2 = 3, v3 = 1)$:

The first tile from $(A)$ is:
$[ A_{00} = \begin{bmatrix} a_{11}
a_{21} \end{bmatrix} ]$
The first tile from $(B)$ is:
$[ B_{00} = \begin{bmatrix} b_{11}
b_{21}
b_{31} \end{bmatrix} ]$
Their dot product gives part of:
$[ C_{00} = \begin{bmatrix} c_{11} & c_{12} & c_{13}
c_{21} & c_{22} & c_{23} \end{bmatrix} ]$

This process continues across all tiles until the entire matrix $(C)$ is computed.

🌟 Benefits:

📉 Reduced Memory Access: Data reuse in registers lowers DRAM-register data transfer.
📊 Memory Hierarchy: Keeps data in registers (faster memory) longer.

Reuse in Action:

Load a $(2 \times 1)$ tile from $(A)$. This tile is reused to compute 3 columns of the corresponding $(2 \times 3)$ tile in $(C)$.
Load a $(3 \times 1)$ tile from $(B)$. This tile is reused to compute 2 rows of the corresponding $(2 \times 3)$ tile in $(C)$.

For example:

The first tile of $(A)$:
$[ A_{00} = \begin{bmatrix} a_{11}
a_{21} \end{bmatrix} ]$
is reused for every tile of $(B)$ along the shared dimension.
Similarly, the tile from $(B)$:
$[ B_{00} = \begin{bmatrix} b_{11}
b_{21}
b_{31} \end{bmatrix} ]$
is reused for every tile of $(A)$ along the shared dimension.

At the start of the outer loop:

Load a $(2 \times 1)$ tile from $(A)$. Load a $(3 \times 1)$ tile from $(B)$. In the inner loop:

Perform a dot product between the two tiles. This contributes to one element of the resulting $(2 \times 3)$ tile in $(C)$. The same tiles in registers are reused to compute multiple columns/rows of $(C)$. Only after the full inner loop completes:

The result tile is stored back into $(C)$ in DRAM.

⚠️ Considerations:

🎯 Choosing Tiling Factors:
- 🔹 Factors must reduce memory cost but fit within register space.
- 🔹 Experimentation may be needed for optimal values.
- 🔹 $(v3)$ can be set to $(1)$ as it doesn’t affect loading cost.

🆚 Comparison to Vanilla Multiplication:

🔹 Vanilla Multiplication:
- Computes each element individually, leading to frequent DRAM-register data transfers.
🔹 Register Tiling:
- Reduces memory loads, significantly improving efficiency.

🚀 Conclusion:

Register tiled matrix multiplication optimizes performance by leveraging memory hierarchy and data reuse. Tiling matrices to fit in registers boosts matrix multiplication speed, utilizing the CPU’s fastest memory.

🗄️ Cache Line Aware Tiling

alt_text L1 Cache: Data first loads into L1 cache, then into registers.
Blocking Factors: Use blocking (b1, b2) to reduce DRAM access.

“Blocking factors b1 and b2 determine submatrix sizes loaded from DRAM to L1.”

Cache line aware tiling is a further optimization technique for matrix multiplication that builds upon register tiling by incorporating the concept of cache memory, specifically the L1 cache, into the data access strategy. This approach recognizes that accessing data from the L1 cache is much faster than from DRAM, and it aims to keep frequently accessed data within the L1 cache to reduce memory access times.

🔹 Cache line aware tiling introduces an additional level of blocking (tiling) that is aware of the L1 cache, reducing data transfers between DRAM and L1 cache.
🔹 The matrix is divided into blocks that fit within the L1 cache. A block of data is loaded into the L1 cache, used for computations, and replaced by the next block, minimizing slower DRAM accesses.
🔹 Register tiling can still be applied inside each L1 cache block to further optimize computations.

⚙️ Implementation Details:

🔄 The algorithm involves nested loops that iterate over sub-matrices, similar to register tiling, but with additional outer loops for loading data from DRAM into L1 cache.
🧩 Step 1: A block of matrix $(A)$ is loaded from DRAM into the L1 cache.
🧩 Step 2: A loop iterates over blocks of matrix $(B)$, loading these into the L1 cache.
🧩 Step 3: The dot product between these blocks is computed and stored. Register tiling is applied within the sub-procedure by loading sub-blocks into registers.
🔹 Two levels of tiling are used:
- L1 cache-aware tiling with factors $(b1)$ and $(b2)$.
- Register-aware tiling within L1 cache blocks using factors $(v1)$ and $(v2)$.

🧩 Putting it All Together

🔄 Multi-Level Tiling

Putting it all together in the context of matrix multiplication refers to combining both cache-aware tiling and register tiling to maximize computational efficiency by leveraging the memory hierarchy. This approach recognizes that memory access times vary greatly, with registers being the fastest, followed by L1 cache, and then DRAM. By combining tiling techniques that manage data locality at each level of this hierarchy, a more optimized matrix multiplication can be achieved.

alt_text

Final Cost: Includes L1-to-register and DRAM-to-L1 transfers.

🔹 Cache-Aware Tiling:

This is the outer level of tiling and focuses on moving data between DRAM and the L1 cache.
The matrices are divided into blocks based on the size of the L1 cache.
The tiling factors here are $b1$ and $b2$.
Blocks of matrix $A$ with dimensions $b1 \times n$ are loaded into the L1 cache.
Blocks of matrix $B$ with dimensions $b2 \times n$ are also loaded into L1 cache.
This step reduces the number of accesses to DRAM, which is the slowest memory.

🔹 Register Tiling:

This is the inner level of tiling and is applied within the L1 cache blocks to manage data movement between L1 cache and registers.
The blocks within the L1 cache are further divided into smaller sub-blocks to fit within registers.
The tiling factors here are $v1$ and $v2$.
Sub-blocks of matrix $A$ with dimensions $v1$ are loaded into registers.
Sub-blocks of matrix $B$ with dimensions $v2$ are loaded into registers.
The actual dot product computations occur on these register-resident sub-blocks.
This step reduces the number of accesses to the L1 cache.

⚙️ Implementation Details:

The combined implementation involves nested loops that iterate over the sub-matrices defined by both the L1 cache tiling factors $(b1, b2)$ and the register tiling factors $(v1, v2).
The outermost loops iterate over the L1 cache blocks, loading them from DRAM to L1 cache.
The inner loops perform the dot product within each L1 cache block, using register tiling.

🧑‍💻 Pseudocode:

for i = 0 to n/b1:
  l1cache float a[b1/v1][n][v1] = A[i]
  for j = 0 to n/b2:
    l1cache float b[b2/v2][n][v2] = B[j]
    for x = 0 to b1/v1
      for y = 0 to b2/v2
        register float c[v1][v2] = 0
        for k = 0 to n
           register float ar[v1] = a[x][k][:]
           register float br[v2] = b[y][k][:]
           c += dot(ar, br.T)

$n$ is the matrix dimension,
$b1, b2$ are the cache tiling factors, and
$v1, v2$ are the register tiling factors.

DRAM to L1 Cache Load Cost:

The cost of loading matrix $A$ from DRAM to L1 cache is approximately $n^2$.
The cost of loading matrix $B$ from DRAM to L1 cache is approximately $n^3 / b1$.

L1 Cache to Register Load Cost:

The cost of loading $A$ from L1 cache to register is approximately $n^3 / v2$.
The cost of loading $B$ from L1 cache to register is approximately $n^3 / v1$.

Total Load Cost:

The total load cost combines the DRAM to L1 cache and L1 cache to register loads:
$l1speed \times (n^3 / v2 + n^3 / v1) + dramspeed \times (n^2 + n^3 / b1)$.

⚠️ Constraints:

The sizes of the L1 cache blocks $(b1$ and $b2)$ must be chosen so that $b1 \times n + b2 \times n$ is less than the size of the L1 cache.
$b1$ must be a multiple of $v1$, and $b2$ must be a multiple of $v2$.

🚀 Key Benefits:

Reduced Memory Access Time: By using L1 cache and registers effectively, this method minimizes the number of accesses to slow DRAM, which significantly speeds up the computation.
Data Reuse: Data loaded into the L1 cache is reused multiple times, and data in registers is reused even more. The reuse of ‘a’ elements is along the $j$ dimension, with a reuse factor of $v2$, and the reuse of ‘b’ elements is along the $i$ dimension, with a reuse factor of $v1$.
Combined Optimization: By combining L1 cache and register tiling, this method leverages the entire memory hierarchy, leading to maximum performance gains.

❓ Why don’t directly copy to register but to cache first and then register?

The reason for copying data to the L1 cache before loading it into registers instead of directly loading it into registers involves the structure of modern memory hierarchies and the optimization of memory access patterns. Here are the key reasons:

Memory Hierarchy Efficiency
Registers are extremely fast but limited in size, whereas the L1 cache is larger and faster than DRAM. By loading data into the L1 cache first, it can be reused multiple times before accessing slower memory, improving overall performance.
Cache Line Access
CPUs fetch data in cache lines (typically 64 bytes). It’s more efficient to load entire cache lines into the L1 cache, after which smaller parts can be loaded into registers for computation, reducing DRAM accesses.
Data Reuse and Coalescing
Once data is in the L1 cache, it can be reused without repeatedly accessing DRAM. This optimizes memory bandwidth and reduces latency.
Minimizing Register Pressure
Registers are limited in number, so using the L1 cache as a buffer allows the CPU to handle larger data sets while keeping the register space efficient.
Efficient Data Loading and Eviction
The L1 cache is managed by the CPU’s cache controller. It automatically handles evicting and reusing data, which is less complex than manually managing registers.
Separation of Concerns (CPU Design)
CPUs are optimized to load data in large blocks to the cache. After that, registers can be used for finer-grained computations, allowing the system to leverage the multi-level memory hierarchy efficiently.

Loading data into the L1 cache first allows CPUs to exploit data locality, reduce memory access costs, and manage the limited register space more efficiently, leading to better overall performance in matrix multiplication and other computations.

💡 Key Insight: Memory Load Reuse

Reuse Focus: Maximizing data reuse minimizes memory load times.

“Efficient algorithms load data once and reuse it multiple times.”

Loop Reuse: Loop tiling increases reuse by isolating dimensions.

“Tiling the J-dimension by v enables reuse of A for v times.”

V. Conclusion

🔍 Summary of Techniques

Vectorization
Data Layout (Row/Column Major, Strides)
Parallelization
Matrix Multiplication Optimization

📈 Hardware Awareness

Understanding hardware (memory hierarchy, caches) is essential for optimization.

🔁 Reuse Principle

Final Insight: Load data once and reuse it often for maximum performance.

“Memory reuse is key to acceleration techniques.”

Matrix Multiplication Example

Consider the matrix multiplication $C = A \times B^T$, where $C[i][j] = \sum(A[i][k] \times B[j][k], \text{ axis } k)$.
In this calculation, the access of elements in matrix $A$, specifically $A[i][k]$, is independent of the index $j$. This means that for a given row $i$ of $A$, the same elements can be reused across multiple $j$ indices when computing elements in $C$.
By tiling the $j$ dimension, the elements of $A$ can be loaded into faster memory and reused multiple times while iterating over different values of $j$. This is done by using a tiling factor ‘$v$’ along the $j$ dimension, which enables the reuse of the same element of $A$ ‘$v$’ times.
Similarly, the access of elements in matrix $B$, specifically $B[j][k]$, is independent of the index $i$. Thus, tiling along the $i$ dimension allows for reuse of elements of $B$.

How Tiling Enables Reuse

When you tile the $j$ dimension, you effectively load a block of $A$ into a faster level of memory and then iterate through the $j$ dimension as needed to compute the results. Because the access to elements in $A$ are independent of $j$, they can be reused across all computations involving the elements in that block.
For example, when using register tiling, blocks of $A$ can be loaded into registers and reused for the computations for all elements within a block of $C$. When using cache-aware tiling, blocks of $A$ can be loaded into L1 cache and reused for computations involving all elements of the tile in the $j$ dimension.