📚 Normalization and Regularization

Course Link

This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.

🏁 Initialization and Optimization

Weight Initialization
- Initializing weights is critical for training deep networks.
- Example: For ReLU networks, setting the variance of weights to 2/n (where n is the input dimension) maintains activation variance across layers.
- ⚠️ Improper initialization can hinder training even with extensive optimization.
Impact of Initialization on Training
- Initial weights influence the entire training process.
- Networks initialized with different weights may achieve similar performance, but their training dynamics differ.

🧪 Normalization Techniques

🔹 Layer Normalization (LayerNorm)

Definition: Normalizes activations within each layer to have a mean of 0 and variance of 1.
Benefits:
- Tackles exploding or vanishing activations.
- Ensures consistent activation norms across layers.
Drawbacks:
- Can make it harder to train fully connected networks to reach low loss.
- Example: Relative norms of different examples might carry valuable classification information that LayerNorm may obscure.

🔹 Batch Normalization (BatchNorm)

Definition: Normalizes activations of a specific feature across all examples in a mini-batch.
Benefits:
- Retains useful discriminatory information by allowing different examples to have varying norms.
Challenges:
- Introduces dependency between mini-batch examples.
- Solution: Use running averages for mean and variance during inference.

📊 Matrix Example

Suppose you have a 3D activation matrix (a mini-batch of activations) with dimensions:

Batch size = 2 (2 examples)
Number of features = 3 (3 features per example)
Number of elements per feature = 4 (4 elements per feature)

Let the matrix of activations be:

\[[ \mathbf{A} = \begin{bmatrix} \text{Example 1:} & \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{bmatrix} \\ \text{Example 2:} & \begin{bmatrix} 2 & 4 & 6 & 8 \\ 10 & 12 & 14 & 16 \\ 18 & 20 & 22 & 24 \end{bmatrix} \end{bmatrix} ]\]

🔹 Layer Normalization

Normalization is applied across the feature dimension (per example).
For each example, we compute the mean and variance across the features for each individual element.

Step-by-Step Example (LayerNorm on Example 1)

Compute the mean and variance of all features for each activation column in Example 1:

\[[ \mathbf{A}_{\text{Example 1}} = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{bmatrix} ]\]

Mean of each column:
\(\mu_1 = \frac{1 + 5 + 9}{3} = 5, \quad \mu_2 = 6, \quad \mu_3 = 7, \quad \mu_4 = 8\)
Variance of each column:
\(\sigma_1^2 = \frac{(1-5)^2 + (5-5)^2 + (9-5)^2}{3} = 10.67\)

\[\sigma_2^2 = 10.67, \quad \sigma_3^2 = 10.67, \quad \sigma_4^2 = 10.67\]

Normalize each activation in the column (subtract the mean, divide by the standard deviation):

\[[ \mathbf{A}_{\text{norm}} = \frac{\mathbf{A}_{\text{Example 1}} - \mu}{\sigma} = \begin{bmatrix} \frac{1 - 5}{\sqrt{10.67}} & \frac{2 - 6}{\sqrt{10.67}} & \frac{3 - 7}{\sqrt{10.67}} & \frac{4 - 8}{\sqrt{10.67}} \\ \frac{5 - 5}{\sqrt{10.67}} & \frac{6 - 6}{\sqrt{10.67}} & \frac{7 - 7}{\sqrt{10.67}} & \frac{8 - 8}{\sqrt{10.67}} \\ \frac{9 - 5}{\sqrt{10.67}} & \frac{10 - 6}{\sqrt{10.67}} & \frac{11 - 7}{\sqrt{10.67}} & \frac{12 - 8}{\sqrt{10.67}} \end{bmatrix} ]\]

This normalizes the activations of each individual example separately.

🔹 Batch Normalization

Normalization is applied across the batch dimension for each feature independently.
We compute the mean and variance across all examples for each individual feature.

Step-by-Step Example (BatchNorm on Feature 1 across Examples)

Feature 1 from Example 1 and Example 2:

\[[ \begin{bmatrix} 1 & 2 & 3 & 4 \quad (\text{Example 1, Feature 1}) \\ 2 & 4 & 6 & 8 \quad (\text{Example 2, Feature 1}) \end{bmatrix} ]\]

Compute mean and variance across both examples for each feature:
- Mean for feature 1 across both examples:
  \(\mu_1 = \frac{1 + 2}{2} = 1.5, \quad \mu_2 = \frac{2 + 4}{2} = 3, \quad \mu_3 = \frac{3 + 6}{2} = 4.5, \quad \mu_4 = \frac{4 + 8}{2} = 6\)

Variance for feature 1 across both examples:
\(\sigma_1^2 = \frac{(1-1.5)^2 + (2-1.5)^2}{2} = 0.25\)

Normalize each feature using batch statistics:

\[[ \mathbf{A}_{\text{norm}} = \frac{\mathbf{A} - \mu}{\sigma} ]\]

This normalizes each feature independently across the batch.

🤔 Key Differences

Feature	Layer Normalization	Batch Normalization
Normalization Axis	Across features for each example	Across batch for each feature
Statistics Computed	Mean/variance computed per example	Mean/variance computed per feature across the mini-batch
Usage	Works well for RNNs, Transformers	Common in CNNs and feedforward networks
Dependency	No batch dependency	Depends on the batch size

🔑 Key Insights

LayerNorm works per sample and ensures every feature is normalized independently within that sample.
BatchNorm normalizes across samples for a specific feature, preserving relationships within features but allowing batch statistics to influence normalization.

🔒 Regularization Techniques

🔹 Implicit Regularization

Definition: Regularizing effects arise naturally from algorithms and architectures.
Example: Stochastic Gradient Descent (SGD) introduces noise, limiting the search space of neural networks.

🔹 Explicit Regularization

Definition: Deliberate modifications to control the network’s complexity.

📉 L2 Regularization (Weight Decay)

Adds a penalty term to the loss function based on the squared norm of the weights.
Benefits:
- Encourages smaller weights, leading to smoother functions and reduced overfitting.
Implementation: Often integrated into optimizers like SGD or Adam.

🔀 Dropout

Randomly sets a fraction of activations to zero during training.
Benefits:
- Forces the network to learn robust features that do not rely on specific activations.
- Acts as a stochastic approximation of the full network computation.
During testing, dropout is turned off to leverage all learned features.

🔄 Interaction of Optimization, Initialization, Normalization, and Regularization

Deep learning involves interconnected design choices like:
- Optimizer selection
- Weight initialization
- Normalization techniques
- Regularization strategies

🔍 Case Study: BatchNorm

Initially proposed to address internal covariate shift.
Research has debated its true effectiveness, suggesting it smooths the optimization landscape.
Practical Impact:
- Enhances robustness to distribution shifts, where test data differs from training data.

🎓 Key Takeaways

Normalization and regularization are essential for efficient training and generalization in deep learning.
The interplay between design choices impacts performance, and understanding these interactions is crucial.
🔬 Scientific experimentation and analysis help uncover the mechanisms behind various techniques.
Despite the empirical nature of deep learning, diverse architectural choices can yield comparable performance, showcasing the flexibility and robustness of modern systems.