π Normalization and Regularization
This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.
π Initialization and Optimization
- Weight Initialization
- Initializing weights is critical for training deep networks.
- Example: For ReLU networks, setting the variance of weights to 2/n (where n is the input dimension) maintains activation variance across layers.
- β οΈ Improper initialization can hinder training even with extensive optimization.
- Impact of Initialization on Training
- Initial weights influence the entire training process.
- Networks initialized with different weights may achieve similar performance, but their training dynamics differ.
π§ͺ Normalization Techniques
πΉ Layer Normalization (LayerNorm)
- Definition: Normalizes activations within each layer to have a mean of 0 and variance of 1.
- Benefits:
- Tackles exploding or vanishing activations.
- Ensures consistent activation norms across layers.
- Drawbacks:
- Can make it harder to train fully connected networks to reach low loss.
- Example: Relative norms of different examples might carry valuable classification information that LayerNorm may obscure.
πΉ Batch Normalization (BatchNorm)
- Definition: Normalizes activations of a specific feature across all examples in a mini-batch.
- Benefits:
- Retains useful discriminatory information by allowing different examples to have varying norms.
- Challenges:
- Introduces dependency between mini-batch examples.
- Solution: Use running averages for mean and variance during inference.
π Matrix Example
Suppose you have a 3D activation matrix (a mini-batch of activations) with dimensions:
- Batch size = 2 (2 examples)
- Number of features = 3 (3 features per example)
- Number of elements per feature = 4 (4 elements per feature)
Let the matrix of activations be:
\[[ \mathbf{A} = \begin{bmatrix} \text{Example 1:} & \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{bmatrix} \\ \text{Example 2:} & \begin{bmatrix} 2 & 4 & 6 & 8 \\ 10 & 12 & 14 & 16 \\ 18 & 20 & 22 & 24 \end{bmatrix} \end{bmatrix} ]\]πΉ Layer Normalization
- Normalization is applied across the feature dimension (per example).
- For each example, we compute the mean and variance across the features for each individual element.
Step-by-Step Example (LayerNorm on Example 1)
- Compute the mean and variance of all features for each activation column in Example 1:
-
Mean of each column:
\(\mu_1 = \frac{1 + 5 + 9}{3} = 5, \quad \mu_2 = 6, \quad \mu_3 = 7, \quad \mu_4 = 8\) -
Variance of each column:
\(\sigma_1^2 = \frac{(1-5)^2 + (5-5)^2 + (9-5)^2}{3} = 10.67\)
- Normalize each activation in the column (subtract the mean, divide by the standard deviation):
This normalizes the activations of each individual example separately.
πΉ Batch Normalization
- Normalization is applied across the batch dimension for each feature independently.
- We compute the mean and variance across all examples for each individual feature.
Step-by-Step Example (BatchNorm on Feature 1 across Examples)
- Feature 1 from Example 1 and Example 2:
- Compute mean and variance across both examples for each feature:
- Mean for feature 1 across both examples:
\(\mu_1 = \frac{1 + 2}{2} = 1.5, \quad \mu_2 = \frac{2 + 4}{2} = 3, \quad \mu_3 = \frac{3 + 6}{2} = 4.5, \quad \mu_4 = \frac{4 + 8}{2} = 6\)
- Mean for feature 1 across both examples:
- Variance for feature 1 across both examples:
\(\sigma_1^2 = \frac{(1-1.5)^2 + (2-1.5)^2}{2} = 0.25\)
- Normalize each feature using batch statistics:
This normalizes each feature independently across the batch.
π€ Key Differences
| Feature | Layer Normalization | Batch Normalization |
|---|---|---|
| Normalization Axis | Across features for each example | Across batch for each feature |
| Statistics Computed | Mean/variance computed per example | Mean/variance computed per feature across the mini-batch |
| Usage | Works well for RNNs, Transformers | Common in CNNs and feedforward networks |
| Dependency | No batch dependency | Depends on the batch size |
π Key Insights
- LayerNorm works per sample and ensures every feature is normalized independently within that sample.
- BatchNorm normalizes across samples for a specific feature, preserving relationships within features but allowing batch statistics to influence normalization.
π Regularization Techniques
πΉ Implicit Regularization
- Definition: Regularizing effects arise naturally from algorithms and architectures.
- Example: Stochastic Gradient Descent (SGD) introduces noise, limiting the search space of neural networks.
πΉ Explicit Regularization
- Definition: Deliberate modifications to control the networkβs complexity.
π L2 Regularization (Weight Decay)
- Adds a penalty term to the loss function based on the squared norm of the weights.
- Benefits:
- Encourages smaller weights, leading to smoother functions and reduced overfitting.
- Implementation: Often integrated into optimizers like SGD or Adam.
π Dropout
- Randomly sets a fraction of activations to zero during training.
- Benefits:
- Forces the network to learn robust features that do not rely on specific activations.
- Acts as a stochastic approximation of the full network computation.
- During testing, dropout is turned off to leverage all learned features.
π Interaction of Optimization, Initialization, Normalization, and Regularization
- Deep learning involves interconnected design choices like:
- Optimizer selection
- Weight initialization
- Normalization techniques
- Regularization strategies
π Case Study: BatchNorm
- Initially proposed to address internal covariate shift.
- Research has debated its true effectiveness, suggesting it smooths the optimization landscape.
- Practical Impact:
- Enhances robustness to distribution shifts, where test data differs from training data.
π Key Takeaways
- Normalization and regularization are essential for efficient training and generalization in deep learning.
- The interplay between design choices impacts performance, and understanding these interactions is crucial.
- π¬ Scientific experimentation and analysis help uncover the mechanisms behind various techniques.
- Despite the empirical nature of deep learning, diverse architectural choices can yield comparable performance, showcasing the flexibility and robustness of modern systems.
