π Convolutional Networks Implementation
This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.
π₯οΈ Implementing Convolutions in Code
The lecture focuses on how to implement convolutions in code, moving from basic for-loops to more efficient matrix operations. The implementation is done live in a notebook, and the presenter encourages the audience to code along. π
Key Concepts Discussed π
π Storage Order
- How tensors involved in convolutions are stored in memory:
- Input Images: Stored in NHWC format (Batch, Height, Width, Channels).
- Kernel Weights: Stored in K x K x C_in x C_out format.
- This differs from PyTorchβs default NCHW format, which the presenter argues is less intuitive for convolution operations.
Details:
- Input images are stored as a 4D tensor:
- First dimension: batches
- Second: height of the image
- Third: width of the image
- Fourth: channels
- Kernel weights are stored as a 4D tensor:
- Kernel height
- Kernel width
- Input channels
- Output channels
π Reference Implementation
- PyTorch is used as a reference to verify the correctness of custom convolution implementations.
- Tensors are converted between NHWC (lecture) and NCHW (PyTorch).
import torch
import torch.nn as nn
def conv_reference(Z, weight):
# NHWC -> NCHW
Z_torch = torch.tensor(Z).permute(0,3,1,2)
# KKIO -> OIKK
W_torch = torch.tensor(weight).permute(3,2,0,1)
# run convolution
out = nn.functional.conv2d(Z_torch, W_torch)
# NCHW -> NHWC
return out.permute(0,2,3,1).contiguous().numpy()
Z = np.random.randn(10,32,32,8)
W = np.random.randn(3,3,8,16)
out = conv_reference(Z,W)
print(out.shape) # (10, 30, 30, 16) (H-K+1)*(H-K+1) shape image
π’ Naive Convolution
- First implementation uses nested for-loops:
- Iterates over batches, channels, and image positions to perform the convolution.
- Slow but useful for understanding the process.
def conv_naive(Z, weight):
N,H,W,C_in = Z.shape
K,_,_,C_out = weight.shape
out = np.zeros((N,H-K+1,W-K+1,C_out))
for n in range(N):
for c_in in range(C_in):
for c_out in range(C_out):
for y in range(H-K+1):
for x in range(W-K+1):
for i in range(K):
for j in range(K):
out[n,y,x,c_out] += Z[n,y+i,x+j,c_in] * weight[i,j,c_in,c_out]
return out
out2 = conv_naive(Z,W)
print(np.linalg.norm(out - out2))
ποΈ Convolution as Matrix Multiplication
- Explains that a convolution with a 1x1 filter is equivalent to matrix multiplication.
- Steps:
- Reshape the input tensor and filter weights into matrices.
- Perform convolution using matrix multiplication.
- Advantages:
- Significantly faster than naive for-loops.
- Leverages optimized matrix multiplication routines.
Implementation Details:
- Two-layer deep for-loop iterates over kernel locations, using matrix multiplication for the rest.
def conv_matrix_mult(Z, weight):
N,H,W,C_in = Z.shape
K,_,_,C_out = weight.shape
out = np.zeros((N,H-K+1,W-K+1,C_out))
for i in range(K):
for j in range(K):
out += Z[:,i:i+H-K+1,j:j+W-K+1,:] @ weight[i,j]
return out
Z = np.random.randn(100,32,32,8)
W = np.random.randn(3,3,8,16)
out = conv_reference(Z,W)
out2 = conv_matrix_mult(Z,W)
print(np.linalg.norm(out - out2))
π im2col (Image to Column) Operation
- The most efficient implementation uses the im2col technique:
- Rearranges input data into a matrix for convolution via matrix multiplication.
- Utilizes numpyβs
as_stridedfunction to create a view of the input tensor containing overlapping blocks without copying data.
Steps:
- Use
as_stridedto create overlapping blocks in memory. - Apply
ascontiguousarrayto create a memory copy laid out contiguously for matrix multiplication. - Multiply the reshaped input matrix by reshaped filter weights.
Notes:
as_stridedis essentially free (no memory copy).reshapeinvolves more complex operations to compact the array.
What are strides?
Strides are a tuple of integers that represent the number of bytes you need to skip in memory to move from one element to the next along each dimension of the array. For a 1D array, the stride is simply the size of each element in bytes. For multi-dimensional arrays, the strides define how to move through the arrayβs memory layout to access elements in different dimensions.
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.strides) # Output: (12, 4)
- In this example, arr is a 2D array of integers.
- The stride (12, 4) tells us that:
- To move to the next element in the same row (along the columns), we need to skip 4 bytes (the size of an integer).
- To move to the next row (along the rows), we need to skip 12 bytes (which is 3 columns * 4 bytes per integer).
Tiling example
import numpy as np
n = 6
A = np.arange(n**2, dtype=np.float32).reshape(n,n)
print(A)
"""
[[ 0. 1. 2. 3. 4. 5.]
[ 6. 7. 8. 9. 10. 11.]
[12. 13. 14. 15. 16. 17.]
[18. 19. 20. 21. 22. 23.]
[24. 25. 26. 27. 28. 29.]
[30. 31. 32. 33. 34. 35.]]
"""
import ctypes
print(np.frombuffer(ctypes.string_at(A.ctypes.data, A.nbytes), dtype=A.dtype, count=A.size))
"""
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.]
"""
# Tiling
B = np.lib.stride_tricks.as_strided(A, shape=(3,3,2,2), strides=np.array((12,2,6,1))*4)
print(B)
"""
2x2 tiling block
[[[[ 0. 1.]
[ 6. 7.]]
[[ 2. 3.]
[ 8. 9.]]
...
]
"""
print(np.frombuffer(ctypes.string_at(B.ctypes.data, size=B.nbytes), B.dtype, B.size))
"""
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.]
"""
C = np.ascontiguousarray(B)
print(np.frombuffer(ctypes.string_at(C.ctypes.data, size=C.nbytes), C.dtype, C.size))
"""
[ 0. 1. 6. 7. 2. 3. 8. 9. 4. 5. 10. 11. 12. 13. 18. 19. 14. 15.
20. 21. 16. 17. 22. 23. 24. 25. 30. 31. 26. 27. 32. 33. 28. 29. 34. 35.]
"""
print(C.strides)
"""
C will be compact as we can see from the strides
(48, 16, 8, 4)
"""
im2col for multi-channel convolutions
The im2col operation is a technique used to transform an input image into a matrix, enabling convolution to be performed using a single matrix multiplication. This method is more efficient than using nested for-loops or standard matrix multiplication, which still require multiple iterations. π
π οΈ How im2col Works
- Core Idea:
- Creates a matrix where each column represents a flattened receptive field (a small block) from the input image.
- For multi-channel images, the receptive field includes all channels for that spatial block.
- Key Steps:
- Use the
as_stridedfunction to create a new view of the input tensor with custom strides.- This allows extraction of overlapping blocks from the input image without copying data.
- Set the
shapeandstridesparameters ofas_stridedcarefully:- Ensures each block of the tensor represents an overlapping receptive field of the input, including all channels.
- Reshape the tensor to combine spatial dimensions of the receptive field (kernel size x kernel size) and input channels into a single dimension.
- Forms the columns of the im2col matrix.
- Reshape the filter weights to make them compatible with the im2col matrix multiplication.
- Perform the convolution using matrix multiplication of the im2col matrix and the reshaped filter weights.
- Use the
π Duplicate Items in im2col
- Why Duplicates Exist:
- The im2col operation creates overlapping receptive fields, leading to duplicate items in the created matrix.
- Example: In a 3x3 convolution, the center pixel is included in all nine overlapping blocks.
- Purpose of Overlap:
- Necessary for sliding the filter over the input image to compute the convolution.
πΎ Reshape and Memory Handling
as_stridedEfficiency:- The
as_stridedfunction does not copy memory; it creates a new view with different strides. - Efficient, but requires caution when modifying data in this view.
- The
- Memory Expansion:
- The reshape operation after
as_stridedflattens the overlapping receptive fields and kernel weights. - Because of overlaps, the resulting matrix is larger than the original tensor, requiring additional memory.
- The reshape operation after
- Ensuring Contiguity:
- Use
ascontiguousarrayto create a contiguous memory copy for matrix multiplication. - Why Necessary?
- Matrix operations require contiguous memory layouts.
- Use
- Performance Notes:
as_stridedis essentially free (no data copy).- Reshape and ascontiguousarray involve more complex memory-handling operations:
- Reshape compacts overlapping fields into a matrix.
ascontiguousarrayensures data is stored contiguously in memory for efficiency.
- Implementing your own ND array library requires manual memory handling for
ascontiguousarray.
def conv_im2col(Z, weight):
N,H,W,C_in = Z.shape
K,_,_,C_out = weight.shape
Ns, Hs, Ws, Cs = Z.strides
inner_dim = K * K * C_in
A = np.lib.stride_tricks.as_strided(Z, shape = (N, H-K+1, W-K+1, K, K, C_in),
strides = (Ns, Hs, Ws, Hs, Ws, Cs)).reshape(-1,inner_dim)
out = A @ weight.reshape(-1, C_out)
return out.reshape(N,H-K+1,W-K+1,C_out)
Z = np.random.randn(100,32,32,8)
W = np.random.randn(3,3,8,16)
out = conv_reference(Z,W)
out2 = conv_im2col(Z,W)
print(np.linalg.norm(out - out2))
π Performance Comparison
- Naive for-loops: Slowest implementation.
- Matrix multiplication: Faster but slower than im2col.
- im2col: ~2x faster than matrix multiplication but still slower than PyTorch reference.
By combining these methods, the lecture demonstrates how to move from an intuitive but slow implementation to a highly optimized convolution computation. π
