πŸ“š Convolutional Networks Implementation

Course Link

This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.


πŸ–₯️ Implementing Convolutions in Code

The lecture focuses on how to implement convolutions in code, moving from basic for-loops to more efficient matrix operations. The implementation is done live in a notebook, and the presenter encourages the audience to code along. πŸŽ“

Key Concepts Discussed πŸ“š

πŸ” Storage Order

  • How tensors involved in convolutions are stored in memory:
    • Input Images: Stored in NHWC format (Batch, Height, Width, Channels).
    • Kernel Weights: Stored in K x K x C_in x C_out format.
    • This differs from PyTorch’s default NCHW format, which the presenter argues is less intuitive for convolution operations.

    Details:

    • Input images are stored as a 4D tensor:
      • First dimension: batches
      • Second: height of the image
      • Third: width of the image
      • Fourth: channels
    • Kernel weights are stored as a 4D tensor:
      • Kernel height
      • Kernel width
      • Input channels
      • Output channels

πŸ”— Reference Implementation

  • PyTorch is used as a reference to verify the correctness of custom convolution implementations.
  • Tensors are converted between NHWC (lecture) and NCHW (PyTorch).
import torch
import torch.nn as nn

def conv_reference(Z, weight):
    # NHWC -> NCHW
    Z_torch = torch.tensor(Z).permute(0,3,1,2)
    
    # KKIO -> OIKK
    W_torch = torch.tensor(weight).permute(3,2,0,1)
    
    # run convolution
    out = nn.functional.conv2d(Z_torch, W_torch)
    
    # NCHW -> NHWC
    return out.permute(0,2,3,1).contiguous().numpy()

Z = np.random.randn(10,32,32,8)
W = np.random.randn(3,3,8,16)
out = conv_reference(Z,W)
print(out.shape)  # (10, 30, 30, 16) (H-K+1)*(H-K+1) shape image

🐒 Naive Convolution

  • First implementation uses nested for-loops:
    • Iterates over batches, channels, and image positions to perform the convolution.
    • Slow but useful for understanding the process.
def conv_naive(Z, weight):
    N,H,W,C_in = Z.shape
    K,_,_,C_out = weight.shape
    
    out = np.zeros((N,H-K+1,W-K+1,C_out))
    for n in range(N):
        for c_in in range(C_in):
            for c_out in range(C_out):
                for y in range(H-K+1):
                    for x in range(W-K+1):
                        for i in range(K):
                            for j in range(K):
                                out[n,y,x,c_out] += Z[n,y+i,x+j,c_in] * weight[i,j,c_in,c_out]
    return out

out2 = conv_naive(Z,W)
print(np.linalg.norm(out - out2))

πŸ—οΈ Convolution as Matrix Multiplication

  • Explains that a convolution with a 1x1 filter is equivalent to matrix multiplication.
  • Steps:
    1. Reshape the input tensor and filter weights into matrices.
    2. Perform convolution using matrix multiplication.
  • Advantages:
    • Significantly faster than naive for-loops.
    • Leverages optimized matrix multiplication routines.

    Implementation Details:

    • Two-layer deep for-loop iterates over kernel locations, using matrix multiplication for the rest.
def conv_matrix_mult(Z, weight):
    N,H,W,C_in = Z.shape
    K,_,_,C_out = weight.shape
    out = np.zeros((N,H-K+1,W-K+1,C_out))
    
    for i in range(K):
        for j in range(K):
            out += Z[:,i:i+H-K+1,j:j+W-K+1,:] @ weight[i,j]
    return out
    
Z = np.random.randn(100,32,32,8)
W = np.random.randn(3,3,8,16)

out = conv_reference(Z,W)
out2 = conv_matrix_mult(Z,W)
print(np.linalg.norm(out - out2))

πŸš€ im2col (Image to Column) Operation

  • The most efficient implementation uses the im2col technique:
    • Rearranges input data into a matrix for convolution via matrix multiplication.
    • Utilizes numpy’s as_strided function to create a view of the input tensor containing overlapping blocks without copying data.

    Steps:

    1. Use as_strided to create overlapping blocks in memory.
    2. Apply ascontiguousarray to create a memory copy laid out contiguously for matrix multiplication.
    3. Multiply the reshaped input matrix by reshaped filter weights.

    Notes:

    • as_strided is essentially free (no memory copy).
    • reshape involves more complex operations to compact the array.

What are strides?

Strides are a tuple of integers that represent the number of bytes you need to skip in memory to move from one element to the next along each dimension of the array. For a 1D array, the stride is simply the size of each element in bytes. For multi-dimensional arrays, the strides define how to move through the array’s memory layout to access elements in different dimensions.

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr.strides)  # Output: (12, 4)
  • In this example, arr is a 2D array of integers.
  • The stride (12, 4) tells us that:
    • To move to the next element in the same row (along the columns), we need to skip 4 bytes (the size of an integer).
    • To move to the next row (along the rows), we need to skip 12 bytes (which is 3 columns * 4 bytes per integer).

Tiling example

import numpy as np

n = 6
A = np.arange(n**2, dtype=np.float32).reshape(n,n)
print(A)

"""
[[ 0.  1.  2.  3.  4.  5.]
 [ 6.  7.  8.  9. 10. 11.]
 [12. 13. 14. 15. 16. 17.]
 [18. 19. 20. 21. 22. 23.]
 [24. 25. 26. 27. 28. 29.]
 [30. 31. 32. 33. 34. 35.]]
"""

import ctypes
print(np.frombuffer(ctypes.string_at(A.ctypes.data, A.nbytes), dtype=A.dtype, count=A.size))

"""
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.]
"""

# Tiling
B = np.lib.stride_tricks.as_strided(A, shape=(3,3,2,2), strides=np.array((12,2,6,1))*4)
print(B)
"""
2x2 tiling block
[[[[ 0.  1.]
   [ 6.  7.]]

  [[ 2.  3.]
   [ 8.  9.]]
...
]
"""
print(np.frombuffer(ctypes.string_at(B.ctypes.data, size=B.nbytes), B.dtype, B.size))
"""
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.]
"""

C = np.ascontiguousarray(B)
print(np.frombuffer(ctypes.string_at(C.ctypes.data, size=C.nbytes), C.dtype, C.size))
"""
[ 0.  1.  6.  7.  2.  3.  8.  9.  4.  5. 10. 11. 12. 13. 18. 19. 14. 15.
 20. 21. 16. 17. 22. 23. 24. 25. 30. 31. 26. 27. 32. 33. 28. 29. 34. 35.]
"""

print(C.strides)
"""
C will be compact as we can see from the strides
(48, 16, 8, 4)
"""

im2col for multi-channel convolutions

The im2col operation is a technique used to transform an input image into a matrix, enabling convolution to be performed using a single matrix multiplication. This method is more efficient than using nested for-loops or standard matrix multiplication, which still require multiple iterations. πŸš€

πŸ› οΈ How im2col Works
  • Core Idea:
    • Creates a matrix where each column represents a flattened receptive field (a small block) from the input image.
    • For multi-channel images, the receptive field includes all channels for that spatial block.
  • Key Steps:
    1. Use the as_strided function to create a new view of the input tensor with custom strides.
      • This allows extraction of overlapping blocks from the input image without copying data.
    2. Set the shape and strides parameters of as_strided carefully:
      • Ensures each block of the tensor represents an overlapping receptive field of the input, including all channels.
    3. Reshape the tensor to combine spatial dimensions of the receptive field (kernel size x kernel size) and input channels into a single dimension.
      • Forms the columns of the im2col matrix.
    4. Reshape the filter weights to make them compatible with the im2col matrix multiplication.
    5. Perform the convolution using matrix multiplication of the im2col matrix and the reshaped filter weights.
πŸŒ€ Duplicate Items in im2col
  • Why Duplicates Exist:
    • The im2col operation creates overlapping receptive fields, leading to duplicate items in the created matrix.
    • Example: In a 3x3 convolution, the center pixel is included in all nine overlapping blocks.
  • Purpose of Overlap:
    • Necessary for sliding the filter over the input image to compute the convolution.
πŸ’Ύ Reshape and Memory Handling
  • as_strided Efficiency:
    • The as_strided function does not copy memory; it creates a new view with different strides.
    • Efficient, but requires caution when modifying data in this view.
  • Memory Expansion:
    • The reshape operation after as_strided flattens the overlapping receptive fields and kernel weights.
    • Because of overlaps, the resulting matrix is larger than the original tensor, requiring additional memory.
  • Ensuring Contiguity:
    • Use ascontiguousarray to create a contiguous memory copy for matrix multiplication.
    • Why Necessary?
      • Matrix operations require contiguous memory layouts.
  • Performance Notes:
    • as_strided is essentially free (no data copy).
    • Reshape and ascontiguousarray involve more complex memory-handling operations:
      • Reshape compacts overlapping fields into a matrix.
      • ascontiguousarray ensures data is stored contiguously in memory for efficiency.
    • Implementing your own ND array library requires manual memory handling for ascontiguousarray.
def conv_im2col(Z, weight):
    N,H,W,C_in = Z.shape
    K,_,_,C_out = weight.shape
    Ns, Hs, Ws, Cs = Z.strides
    
    inner_dim = K * K * C_in
    A = np.lib.stride_tricks.as_strided(Z, shape = (N, H-K+1, W-K+1, K, K, C_in),
                                        strides = (Ns, Hs, Ws, Hs, Ws, Cs)).reshape(-1,inner_dim)
    out = A @ weight.reshape(-1, C_out)
    return out.reshape(N,H-K+1,W-K+1,C_out)

Z = np.random.randn(100,32,32,8)
W = np.random.randn(3,3,8,16)
out = conv_reference(Z,W)
out2 = conv_im2col(Z,W)
print(np.linalg.norm(out - out2))

πŸ“Š Performance Comparison

  • Naive for-loops: Slowest implementation.
  • Matrix multiplication: Faster but slower than im2col.
  • im2col: ~2x faster than matrix multiplication but still slower than PyTorch reference.

By combining these methods, the lecture demonstrates how to move from an intuitive but slow implementation to a highly optimized convolution computation. πŸ†