♟️ AlphaGo, AlphaGo Zero, and AlphaZero - Deep Reinforcement Learning Meets Search

This blog explores the structure and training process of AlphaGo and its successors, AlphaGo Zero and AlphaZero, illustrating how deep reinforcement learning and search are combined to achieve superhuman performance in complex board games.

🧩 1. Introduction and Go Game Complexity

Game Description:
Go is played on a 19×19 grid with 361 points.
A game state $s$ represents the arrangement of black, white, and empty intersections.
State Representation:
A simplified state can be modeled as a tensor $19 \times 19 \times 2$ — one plane for black stones, one for white.
The original AlphaGo used $19 \times 19 \times 48$ feature planes to encode richer context (e.g., liberties, ko, move history).
Action Space:
Each move corresponds to placing a stone on an empty intersection.
$\mathcal{A} = \{1, 2, 3, \dots, 361\}$ The “pass” move is also a valid action.

Complexity:
The number of possible Go game sequences is on the order of $10^{170}$

— vastly greater than Chess ($\sim 10^{47}$), making brute-force search infeasible.

Game	State-Space Complexity	Game-Tree Complexity
Chess	$\sim 10^{43}$	$\sim 10^{120}$
Go (19 X 19)	$\sim 10^{170}$–$10^{180}$	$\sim 10^{360}$
Checkers	$\sim 10^{20}$	$\sim 10^{40}$

🧠 2. AlphaGo Training Strategy (High-Level)

alt_text

AlphaGo’s training involves three key stages:

Supervised Learning (Behavior Cloning):
Train an initial policy network to imitate human expert moves.
Reinforcement Learning (Self-Play Policy Gradient):
Refine the policy network by playing against itself.
Value Network Training:
Train a value network to predict the win probability of a given state.

During play, Monte Carlo Tree Search (MCTS) combines the policy and value networks to choose moves.

alt_text

🏗️ 3. Policy and Value Network Architecture

Input Representation:
For AlphaGo Zero, input is a $19 \times 19 \times 17$ tensor:
- 8 planes for black’s last 8 moves
- 8 planes for white’s last 8 moves
- 1 plane indicating whose turn it is
Policy Network Output:
$\pi(a \mid s; \theta)$
Outputs a probability distribution over all legal actions via Softmax:
$\sum_{a \in \mathcal{A}} \pi(a \mid s; \theta) = 1$
Value Network Output:
$v(s; w) \in [-1, 1]$
A scalar estimating the expected game outcome (win = +1, loss = −1).
Often, the policy and value heads share convolutional layers.

🧍‍♂️ 4. Training Step 1 — Behavior Cloning (Imitation Learning)

alt_text

Goal: Initialize a policy network to imitate human expert moves.
Method: Treat the problem as multi-class classification (361 possible moves).

Observe a state $s_t$ from human games.
Predict action distribution $p_t = \pi(\cdot \mid s_t; \theta)$.
True label $y_t$ = one-hot vector of human move $a_t^\ast$.
Minimize Cross-Entropy Loss:
$L = -\sum_{a \in \mathcal{A}} y_t(a) \, \log p_t(a)$

Limitation:
When the model encounters unseen states, it performs poorly — hence the need for self-play reinforcement learning to generalize beyond human data.

⚙️ 5. Training Step 2 — Policy Gradient (Self-Play RL)

alt_text

After imitation learning, the agent improves via self-play using the policy gradient method.

Setup:
The current network (Player) plays against an older snapshot (Opponent).
Only the Player’s parameters are updated.
Reward:
$r_T = \begin{cases} +1, & \text{if win} \\ -1, & \text{if lose} \end{cases}$ Intermediate rewards are $0$.
Return:
Since Go’s outcome is binary, every state in a game has the same return: $U_t = r_T$
Policy Gradient Objective:
$\nabla_\theta J(\theta) = \mathbb{E}_t \!\left[ \nabla_\theta \log \pi(a_t \mid s_t; \theta) \, U_t \right]$
Parameter Update:
$\theta \leftarrow \theta + \beta \, \nabla_\theta J(\theta)$

📈 6. Training Step 3 — Value Network Training

alt_text

Goal: Learn $v(s; w) \approx V^\pi(s) = \mathbb{E}[U_t \mid S_t = s]$.
Loss Function: $L(w) = \sum_{t=0}^{T} \!\big( v(s_t; w) - U_t \big)^2$
Optimization:
Use gradient descent to minimize mean squared error between predicted and actual returns.
This transforms the value network into a learned evaluator of board positions.

🌳 7. Execution — Monte Carlo Tree Search (MCTS)

alt_text

At runtime, AlphaGo uses MCTS to plan moves.

Each simulation repeats four stages:

1️⃣ Selection

Choose action $a$ with the highest PUCT score: $\text{score}(a) = Q(a) + \eta \cdot \frac{\pi(a \mid s; \theta)}{1 + N(a)}$

$Q(a)$ — mean value from previous rollouts
$N(a)$ — visit count
$\pi(a \mid s; \theta)$ — prior probability from the policy network

2️⃣ Expansion

Expand the tree by sampling the opponent’s response:
$a_t' \sim \pi(\cdot \mid s_t'; \theta)$ to form the next state $s_{t+1}$.

3️⃣ Evaluation

Estimate the value of $s_{t+1}$ using: $V(s_{t+1}) = \tfrac{1}{2} \, v(s_{t+1}; w) + \tfrac{1}{2} \, r_T$ — averaging the network’s predicted value and the rollout result.

4️⃣ Backup

Propagate $V(s_{t+1})$ back up the tree to update: $Q(a_t) = \frac{1}{N(a_t)} \sum V(s_{t+1})$

Move Selection:
After many simulations, choose the move $a_t$ with the highest visit count $N(a_t)$.

🧬 8. AlphaGo Zero and AlphaZero

🌀 AlphaGo Zero

alt_text

Removes human data — learns purely from self-play.
Uses MCTS visit counts as the target distribution for training the policy: $L = \text{CrossEntropy}(n, p)$
where $n$ = normalized visit counts from MCTS.

🌍 AlphaZero

alt_text

A general version capable of mastering Go, Chess, and Shogi from scratch.

Unified Network:
$f_\theta(s) = (p, v)$ where:
- $p$ — policy probabilities over actions
- $v$ — expected game outcome
Training Loss:
$l = (z - v)^2 - \pi^{\top} \log p + c \, \lVert \theta \rVert^2$ where:
- $z$ — actual game outcome ($+1$ / $-1$)
- $\pi$ — search probabilities from MCTS
- $c$ — regularization coefficient
Search Efficiency:
AlphaZero evaluates ≈ 80 000 positions/sec in Chess (vs. Stockfish’s 70 million), demonstrating the power of neural-guided selective search.

Feature	AlphaGo (vs. Lee Sedol)	AlphaGo Zero
Input Data	Human expert games + self-play	Purely self-play from random moves
Neural Networks	Separate Policy and Value networks	A single, unified Policy–Value network
Training	Complex multi-stage pipeline	Single reinforcement learning loop
Search Features	Used fast rollouts for evaluation	Relied solely on the value network output
Performance	Defeated world champion Lee Sedol (4–1)	Defeated AlphaGo 100–0

🏁 Summary

Stage	Objective	Learning Type	Description
1	Behavior Cloning	Supervised	Learn to mimic human moves
2	Policy Gradient	Reinforcement	Improve policy via self-play
3	Value Network	Supervised (Regression)	Predict win probability
—	MCTS Execution	Search	Combine policy + value for decision-making