📚 Introduction of Automatic Differentiation

Course Link

This document reviews the main themes and key takeaways from Deep Learning Systems: Algorithms and Implementation** at Carnegie Mellon University, taught by J. Zico Kolter and Tianqi Chen.

1. Machine Learning Components

Every machine learning algorithm has three fundamental elements:

Hypothesis Class: A function mapping inputs to predictions, parameterized by a set of parameters $( \theta )$.
Loss Function: A measure of how good the predictions are, often in relation to a training dataset. The goal is to find $( \theta )$ that minimizes this function.
Optimization Method: Algorithms like Stochastic Gradient Descent (SGD) are used to minimize the loss function by:
1. Computing the gradient of the loss function with respect to $( \theta )$.
2. Taking steps in the negative gradient direction.

2. Gradient Computation 🧮

Gradient computation is crucial for optimization. Three common methods are:

Numerical Differentiation:
- Approximates the gradient using a small step size $( \epsilon )$.
- Pros: Simple to implement.
- Cons: Numerical errors and computational inefficiency.
- 📝 “In practice, we rarely want to approximate gradients this way… it suffers from numerical error and is less efficient.”
Symbolic Differentiation:
- Derives the gradient formula analytically.
- Cons: Can lead to redundant computations.
Automatic Differentiation (AD):
- Leverages a computational graph to represent functions and applies the chain rule efficiently.
- Modes:
  - Forward Mode AD: Computes derivatives of all outputs w.r.t. one input in one pass. Efficient for few inputs and many outputs.
  - Reverse Mode AD: Computes derivatives of one output w.r.t. all inputs in one pass. Ideal for neural networks (many inputs, scalar output like loss). Forms the basis of backpropagation.

3. Computational Graphs & Reverse Mode AD 🔄

Computational Graph:

A directed acyclic graph representing the function, where:
- Nodes represent intermediate values.
- Edges represent operations.
Enables efficient gradient computation using Reverse Mode AD.

alt_text

Forward Mode AD:

alt_text

Reverse Mode AD Process:

Reverse mode automatic differentiation (AD) is a powerful technique for efficiently computing gradients of scalar functions with respect to multiple inputs. Below is an overview of the process and its extensions to computational graphs, tensors, and data structures.

Initialize: Set the adjoint (partial derivative) of the output node to 1.
Traverse: Traverse the graph in reverse topological order.
Compute Adjoint: For each node:
- Sum partial adjoints from output nodes.
- Use the chain rule to compute adjoints for input nodes.

\[[ \text{Adjoint of input nodes} \implies \text{Gradient of the function w.r.t. inputs.} ]\]

alt_text

A. Adjoint Definition

For each node $( v )$ in the computational graph, define an adjoint: $[ \bar{v} = \frac{\partial \text{output}}{\partial v} ]$ This adjoint represents the sensitivity of the final output to changes in $( v )$.

alt_text

B. Reverse Traversal

Start at the output node and traverse the graph in reverse topological order.
Compute adjoints for all intermediate nodes using the chain rule.

C. Chain Rule Application

The adjoint of a node $( v_i )$ is computed as: $[ \bar{v}_i = \sum_{j \in \text{successors}} \bar{v}_j \cdot \frac{\partial v_j}{\partial v_i} ]$

D. Handling Multiple Pathways

If a node has multiple outgoing edges (used in several computations), its adjoint is the sum of contributions from all pathways.

Computational Graph Extension ✨

Instead of directly computing adjoint values, construct a new extended computational graph that represents adjoint computations.

alt_text

This explanation focuses on the adjoint computation for $( v_2 )$ in a computational graph where multiple pathways contribute to the gradients. The example highlights how reverse mode AD handles these scenarios effectively.

A. Original Computation

The computational graph represents the function: $[ y = \exp(v_1) + v_1 \cdot \exp(v_1) ]$

Nodes in the Graph:

$( v_1 )$: Input value.
$( v_2 = \exp(v_1) )$: Exponential of $( v_1 )$.
$( v_3 = v_1 \cdot v_2 )$: Product of $( v_1 )$ and $( v_2 )$.
$( v_4 = v_2 + v_3 )$: Final output $( y )$.

B. Adjoint Computation Process

Step 1: Initialization

Assign an adjoint value of $( 1 )$ to the output node ($( v_4 )$): $[ \bar{v}_4 = 1 ]$

Step 2: Reverse Traversal

Traverse the computational graph in reverse topological order, computing adjoints for each node.

Step 3: Handling Multiplication

For the multiplication node ($( v_3 = v_1 \cdot v_2 )$):

Inputs: $( v_1 )$ and $( v_2 )$.
Compute partial adjoints: $[ \bar{v}_{2 \to 3} = \bar{v}_3 \cdot v_1 ]$ $[ \bar{v}_{1 \to 3} = \bar{v}_3 \cdot v_2 ]$

Step 4: Handling Addition

For the addition node ($( v_4 = v_2 + v_3 )$):

Inputs: $( v_2 )$ and $( v_3 )$.
Compute partial adjoints:
- The adjoint of $( v_2 )$ with respect to $( v_4 )$: $[ \bar{v}_{2 \to 4} = \bar{v}_4 ]$
- The adjoint of $( v_3 )$ with respect to $( v_4 )$: $[ \bar{v}_{3 \to 4} = \bar{v}_4 ]$

Step 5: Summing Partial Adjoints for $( v_2 )$

Since $( v_2 )$ contributes to both $( v_3 )$ (multiplication) and $( v_4 )$ (addition), its total adjoint is the sum of partial adjoints: $[ \bar{v}_2 = \bar{v}_{2 \to 3} + \bar{v}_{2 \to 4} ]$

In the extended computational graph, this summation is represented as:

Inputs to $( \bar{v}_2 )$:
- $( \bar{v}_{2 \to 3} )$ from the multiplication node.
- $( \bar{v}_{2 \to 4} )$ from the addition node.
Operation: Summation node computes: $[ \bar{v}_2 = (\bar{v}_3 \cdot v_1) + (\bar{v}_4) ]$

Step 6: Final Adjoint of $( v_2 )$

The computed adjoint of $( v_2 )$ reflects the combined effects of its contributions to both pathways.

Advantages:

Reusability: The extended graph can compute gradients for different input values without re-deriving adjoint computations.
Higher-Order Gradients: Compute gradients of gradients (e.g., $( \nabla^2 )$) by applying reverse mode AD to the extended graph.
Optimization: Opportunities to simplify and optimize gradient computations.

Backpropagation vs. Reverse Mode AD ⚖️

alt_text

Aspect	Backpropagation	Reverse Mode AD
Graph Usage	Operates directly on the original graph.	Constructs an extended graph for adjoint computations.
Higher-Order Gradients	Requires additional manual steps.	Handled seamlessly by extending the computational graph.
Optimization	Limited to the original graph structure.	Extended graph allows further optimization.

Modern frameworks favor reverse mode AD due to its flexibility, reusability, and ability to handle complex data types.

Generalization to Tensors 🧮

Reverse mode AD extends naturally to multi-dimensional tensors:

The adjoint of a tensor is another tensor where: $[ \bar{T}[i, j] = \frac{\partial \text{output}}{\partial T[i, j]} ]$

By applying scalar AD rules element-wise, gradients for tensor operations (e.g., matrix multiplication) are computed efficiently.

Differentiable Programming:

Define adjoints and propagation rules for complex data structures.
Enable gradient computation for programs mixing diverse data types.

4. Implementation Details ⚙️

Modern deep learning frameworks utilize Reverse Mode AD with enhancements:

Gradient of Gradient:
- Automatically compute higher-order gradients by extending the computational graph.
- 📝 “You are getting this gradient of gradient for free by building this reverse mode AD.”
Optimization:
- The resulting gradient graph can itself be optimized computationally.
Generalization:
- Supports tensors and complex data structures (e.g., dictionaries, tuples).
- Enables Differentiable Programming, allowing gradients in programs with diverse data types.

5. Key Takeaways 🎯

Reverse Mode AD underpins modern deep learning frameworks for efficient gradient computation.
Understanding computational graphs and adjoints is essential for grasping Reverse Mode AD.
The flexibility of Reverse Mode AD enables differentiation for complex data structures, empowering advanced applications in machine learning.

Glossary 📖

Term	Definition
Hypothesis Class	A set of models (functions) mapping inputs to predictions.
Loss Function	Quantifies error between predictions and true values.
Optimization Method	Algorithms (e.g., SGD) for finding parameters that minimize the loss function.
Stochastic Gradient Descent (SGD)	Optimization method using gradients computed on small, random batches of data.
Numerical Differentiation	Approximates gradients by evaluating small intervals.
Symbolic Differentiation	Analytically derives gradients using calculus.
Computational Graph	Directed graph of operations for efficient gradient computation.
Forward Mode AD	Computes gradients of all outputs w.r.t. one input in one pass.
Reverse Mode AD	Computes gradients of one output w.r.t. all inputs in one pass.
Adjoint	Partial derivative of the output w.r.t. an intermediate value in the graph.
Backpropagation	Gradient computation in neural networks using Reverse Mode AD.
Gradient of Gradient	Higher-order gradient computation for advanced optimization.
Differentiable Programming	Enables gradient computation in programs with diverse data structures.