Reinforcement Learning (RL) — From Fundamentals to PPO & GRPO in LLMs (I)

This blog provides a detailed study of Reinforcement Learning (RL), starting from fundamental concepts and algorithms, moving through advanced policy optimization techniques like PPO and TRPO, and concluding with the application of RL in Large Language Models (LLMs) via PPO and GRPO.

alt_text

🏁 Section 1: Reinforcement Learning Basics and Fundamental Algorithms

Feel free to check my previous blogs for more detailed explanations of the fundamentals in reinforcement learning. Here we just have a quick recap of the basics.

1️⃣ What is Reinforcement Learning (RL)?

Reinforcement Learning gained attention following the 2025 Turing Award and the release of DeepSeek-R1. A historic milestone was AlphaGo’s victory in 2016, powered by RL.

alt_text

RL vs. Supervised Learning (SL):
In SL, models learn mappings from labeled data.
In RL, the agent learns a policy that maximizes reward through interaction with the environment.
Interaction-Based Learning:
RL problems (like Go) are solved by trial and feedback, not by enumerating all possibilities.
Core Loop:
The agent observes state$s$, takes action$a$, receives reward$r$, and transitions to next state$s’$.

alt_text

2️⃣ Markov Decision Process (MDP)

The MDP formalism defines RL mathematically.

Process:$s_t, a_t, r_{t+1}, s_{t+1}$- Return:
The discounted sum of rewards with discount factor$\gamma$:$G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots$
Key Concepts:
- Policy $\pi(a \mid s)$: Action distribution given a state.
- **Transition Probability: **$P(s’ \mid s,a)$.
- Value Functions:
  - State Value: $V(s) = \mathbb{E}[G_t \mid s_t = s]$
  - Action Value: $Q(s,a) = \mathbb{E}[G_t \mid s_t = s, a_t = a]$
  - **Bellman Equation: **$V(s) = \mathbb{E}_{a,r,s’}[r + \gamma V(s’)]$

3️⃣ State Value and Action Value Bellman Equations

Reinforcement Learning problems are often formalized as a Markov Decision Process (MDP), which introduces randomness in both policy behavior and environment transitions.
Because of this, the key quantities — policy $\pi$ and state transitions — must be expressed as expectations.

alt_text

The central objective is to evaluate how desirable a state or an action is, using value functions that measure the expected return.
These value functions are recursively defined by the Bellman Equations.

🧮 A. Discounted Return

The Return $G_t$ is the total future reward accumulated from time step $t$:$G_t = R_{t+1} + R_{t+2} + R_{t+3} + \dots$ Since many environments may continue indefinitely, RL applies a discount factor $\gamma \in [0,1)$ to prioritize short-term rewards: $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$ This leads to a recursive relationship: $G_t = R_{t+1} + \gamma G_{t+1}$ Here, the return $G_t$ consists of immediate rewards $R_{t+1}$ and future rewards discounted by $\gamma$.

📈 B. State Value Function $V(s)$

The State Value Function measures the expected return starting from a given state $s$.

Definition: $V_\pi(s) = \mathbb{E}_\pi [ G_t \mid S_t = s ]$
**Bellman Equation (Recursive Form): **$V_\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a) [ R(s,a,s') + \gamma V_\pi(s') ]$

$\pi(a \mid s)$: Probability of taking action$a$in state $s$ under policy $\pi$
$P(s’ \mid s,a)$: Transition probability from $s$ to $s’$ after action $a$
$R(s,a,s’)$: Expected reward from the transition

Thus,$V(s)$ represents the expected total reward from starting at$s$and following policy $\pi$.

🎯 C. Action Value Function $Q(s,a)$

The Action Value Function quantifies the expected return starting from state $s$ and taking a specific action $a$.

Definition: **$Q_\pi(s,a) = \mathbb{E}_\pi [ G_t \mid S_t = s, A_t = a ]$Bellman Equation (Recursive Form): **$Q_\pi(s,a) = \sum_{s'} P(s' \mid s,a) [ R(s,a,s') + \gamma \sum_{a'} \pi(a' \mid s') Q_\pi(s',a') ]$

$Q(s,a)$captures both the immediate reward and the expected discounted future return after taking action $a$.

🔗 D. Relationship between $V(s)$ and $Q(s,a)$

The two functions are linked as follows: $V_\pi(s) = \sum_a \pi(a \mid s) Q_\pi(s,a)$ That is, the state value equals the expected action value, weighted by the policy’s probabilities.

4️⃣ The Top-Down Framework of Reinforcement Learning

The ultimate objective of RL is to learn a policy that maximizes the expected cumulative reward.
From a high-level perspective, RL algorithms can be organized under the Top-Down Framework, which consists of two main families: Value-Based and Policy-Based methods.

alt_text

🧭 A. Value-Based Methods

Value-based methods estimate the value functions $V(s)$ or $Q(s,a)$, and derive the policy by choosing the action with the highest value.

Key Concepts and Algorithms:

Temporal Difference (TD) Learning:
Updates estimates after every step, blending ideas from Monte Carlo and dynamic programming.
- SARSA (On-policy TD):
  Updates using the next action actually taken by the current policy.
- Q-Learning (Off-policy TD):
  Uses the greedy action (max over$Q$) for the target, learning the optimal$Q^*$independent of the behavior policy.
Monte Carlo Methods:
Estimate returns using the complete episode trajectory.
Deep Q-Network (DQN):
Extends Q-learning by approximating$Q(s,a)$with neural networks, enabling RL in large or continuous state spaces.

🧠 B. Policy-Based Methods

Policy-based methods directly optimize the policy parameters to maximize the expected return, without needing an explicit value table.

alt_text

Key Algorithms and Ideas:

Policy Gradient (PG):
Performs gradient ascent on policy parameters $\theta$: $\nabla_\theta J(\theta) = \mathbb{E}_{s,a}[\nabla_\theta \log \pi_\theta(a \mid s) Q_\pi(s,a)]$
REINFORCE:
Uses Monte Carlo estimates of the full discounted return $G_t$ as $Q(s,a)$.
Actor-Critic:
Combines two networks:
- Actor: Learns the policy $\pi(a \mid s;\theta)$
- Critic: Learns the value function $V(s;w)$ or $Q(s,a;w)$, updated via TD learning.
Trust Region Policy Optimization (TRPO):
Stabilizes updates by constraining policy change using a trust region based on KL-divergence.
Proximal Policy Optimization (PPO):
Simplifies TRPO using clipping or penalty terms to control policy updates, making it the standard choice in modern LLM RL fine-tuning.
Group Relative Policy Optimization (GRPO):
Used in models like DeepSeek-R1, where the advantage is computed relative to group-level statistics instead of relying on a separate Value Model.

In summary, Value-based methods focus on learning what is good, while Policy-based methods directly learn how to act.
Both approaches converge in advanced algorithms such as Actor-Critic and PPO, which underpin the latest generation of reinforcement learning and LLM training systems.

🧩 Why Policy-Based Methods Build on and Outperform Value-Based Methods

While Value-Based and Policy-Based methods are often presented as two separate branches of Reinforcement Learning, in practice, modern Policy-Based algorithms integrate Value functions internally — combining the strengths of both approaches.

1) Direct Optimization of the Policy

Value-based agents learn an indirect mapping (via Q or V) and must pick the best action greedily.
Policy-based methods optimize the policy end-to-end using gradient signals from performance: $J(\theta) = \mathbb{E}_{\pi_\theta} [ G_t ]$ This allows smooth updates and stochastic policies that can handle uncertainty and exploration naturally.

2) Better for Continuous or High-Dimensional Action Spaces

Value-based methods require computing $\max_a Q(s,a)$, which is trivial in discrete action spaces but intractable for continuous actions (e.g., robotic control, token logits in LLMs).
Policy-based approaches parameterize actions as continuous distributions (e.g., Gaussian policies), enabling gradient-based optimization in continuous domains.

3) Stable and Efficient Learning via Variance Reduction

The Actor-Critic architecture uses a Value Function (Critic) to reduce variance in policy gradient estimates.
The Critic serves as a baseline, ensuring updates reflect true performance improvements rather than random noise: $A(s,a) = Q(s,a) - V(s)$ - This combination yields faster convergence and greater stability than pure value-based learning.

4) Natural Integration with Human Feedback and LLMs

In LLM training (e.g., PPO, GRPO), the “policy” corresponds directly to the model’s output distribution over tokens.
Value-based methods can’t easily represent probabilistic text generation, while policy gradients can directly adjust the model’s output probabilities to maximize human preference scores (Reward Model output).
This makes policy-based RL the only feasible way to fine-tune generative models like GPT or DeepSeek.

5) Improved Exploration and Diversity

Value-based methods tend to collapse into deterministic greedy policies (exploitation-heavy).
Policy-based methods maintain stochastic policies, enabling continual exploration — essential in sparse or uncertain reward settings.

Summary: Unified Actor–Critic View

Aspect	Value-Based	Policy-Based	Actor–Critic (Hybrid)
Objective	Learn Q/V and derive policy	Learn policy directly	Learn both policy and value
Action Space	Discrete	Continuous or discrete	Both
Exploration	ε-greedy heuristics	Inherent stochasticity	Balanced
Stability	Can oscillate or diverge	Smooth gradient updates	Stable via critic
LLM Suitability	Not practical	Excellent	State-of-the-art (PPO/GRPO)

5️⃣ Temporal Difference (TD) Learning

Temporal Difference (TD) Learning is a value-based method used within the top-down framework of Reinforcement Learning (RL) to estimate value functions efficiently.
Its defining feature is the step-by-step parameter update, occurring after every environment interaction, rather than waiting for an entire episode to complete.

The Core Idea: Step-by-Step Learning

TD learning bridges the gap between one-step and multi-step learning.
It continuously refines value estimates using partial experience, allowing faster convergence in dynamic or infinite-horizon environments.

Illustrative Example (Ride-Sharing Analogy):

Initial Estimate: At the start of a ride, the expected travel time is estimated at 30 minutes.
Update: After 10 minutes of observed travel (a known, fixed amount), the system updates its prediction for the remaining time to 18 minutes.
TD Target: The total estimated trip time becomes$10 + 18 = 28$minutes — a more accurate estimate than the original 30 minutes.
Refinement: The new estimate (28) becomes the target used to adjust the previous prediction.

This mirrors the TD learning mechanism, where the observed immediate reward (e.g., the elapsed time $R_t$) and the discounted predicted value of the next state form the TD Target: $\text{TD Target} = R_{t+1} + \gamma V(s_{t+1})$
The value update rule adjusts the current estimate toward this target: $V(s_t) \leftarrow V(s_t) + \alpha \big[ R_{t+1} + \gamma V(s_{t+1}) - V(s_t) \big]$
This quantity inside the brackets is the TD error (δ): $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ TD learning thus uses the prediction of the next state to refine the current state estimate before the full outcome is even known.

Bias–Variance Trade-off

TD learning lies between Monte Carlo methods (which require full episodes) and one-step updates (which are instantaneous).
This intermediate approach creates a distinct bias–variance trade-off:

Feature	Monte Carlo	Temporal Difference (TD)
Steps	Multi-step (waits for episode completion)	One-step (updates after every step)
Bias	Small (uses full trajectory reward)	Larger (relies on estimated future value)
Variance	Large (accumulates stochastic variance)	Small (only one-step uncertainty)

Because TD methods only depend on one transition at a time, they exhibit lower variance, enabling faster and more stable learning, even though some bias is introduced by the bootstrapping mechanism.

TD methods are typically used to estimate the action-value function $Q(s,a)$.
Depending on the scale of the problem, values can be stored in:

Lookup tables (small, discrete state spaces)
Neural networks (large or continuous state spaces, e.g., DQN / Deep Q-Network)

🔹 A. SARSA (State–Action–Reward–State–Action)

SARSA is an on-policy TD control algorithm.

Policy Use: The same policy (often ε-greedy) is used for both data collection and target computation.
That means the behavior policy (used to act) = target policy (used to learn).
Goal: To estimate $Q_\pi(s,a)$, the action-value function of the current policy.

**Update Rule: **$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[ R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \big]$

The Meaning of “On-Policy” vs “Off-Policy”

On-Policy:
The agent learns about the same policy it is using to act.
→ The behavior policy (used to explore and generate data) = the target policy (used to learn).

In other words, the agent evaluates and improves the same policy that is interacting with the environment.
Off-Policy:
The agent learns about a different (usually better) policy than the one it uses to collect data.
→ The behavior policy ≠ the target policy.

This allows the agent to explore with one policy (e.g., ε-greedy) while learning toward another (e.g., purely greedy, optimal).

So SARSA is on-policy because it evaluates and improves the same policy it follows while interacting with the environment. If the agent sometimes takes suboptimal exploratory actions, SARSA’s estimates reflect that behavior — it learns the true performance of its ε-greedy policy.

The update target includes $Q(s_{t+1}, a_{t+1})$ — the Q-value for the action actually chosen in the next state by the current policy.
This means SARSA learns the value of the policy it is executing, including its exploratory behavior (e.g., ε-greedy randomness).

🔹 B. Q-Learning

Q-Learning is an off-policy TD control algorithm.

Policy Use: Uses a behavior policy (e.g., ε-greedy) to explore, but updates using a target policy that selects the greedy action maximizing the next Q-value.
Goal: To estimate the optimal action-value function$Q^*(s,a)$, independent of the behavior policy.

Update Rule: $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[ R_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \big]$
Q-learning thus learns off-policy, using exploratory data while converging toward the optimal policy.

The target uses the maximum Q-value across all actions in the next state:$\max_{a’} Q(s_{t+1}, a’)$.
This represents the optimal greedy action, not necessarily the one taken by the current behavior policy.

As a result:

The agent can behave exploratively (using ε-greedy or random moves) to collect experience,
But it learns toward an optimal greedy policy — the one that always picks the action with the highest Q-value.

TD Learning in Actor–Critic Architectures

Even in policy-based frameworks like Actor–Critic, TD principles remain central.

The Critic network estimates value functions $V(s)$ or $Q(s,a)$.
The Critic is updated using TD learning, ensuring low-variance, online updates.
The Actor (policy network) uses the Advantage Function derived from the Critic to guide its updates: $A(s,a) = Q(s,a) - V(s)$ This combination allows:
The Critic to stabilize training through fast, incremental TD updates
The Actor to make gradient-based improvements using a reliable value baseline

TD learning therefore forms the foundation not only of value-based RL (e.g., Q-learning) but also of modern policy-based algorithms such as PPO, A3C, and GRPO.

🚀 Section 2: Policy Gradient and Trust Region Optimization

1️⃣ Policy Gradient (PG)

Policy Gradient directly parameterizes the policy$\pi_\theta(a \mid s)$.

Objective: $J(\theta) = \mathbb{E}_{\pi_\theta}[Q(s,a)]$
**Gradient: **$\nabla_\theta J(\theta) = \mathbb{E}_{s,a}[\nabla_\theta \log \pi_\theta(a \mid s) Q(s,a)]$

2️⃣ REINFORCE and Actor-Critic

REINFORCE:
Uses Monte Carlo estimates of return$G_t$.
Actor-Critic:
- Actor: Updates policy via gradient ascent.
- Critic: Estimates value via TD learning.
Advantage Function:$A(s,a) = Q(s,a) - V(s)$ Reduces gradient variance by subtracting a baseline.

3️⃣ Trust Region Policy Optimization (TRPO)

alt_text

TRPO (Trust Region Policy Optimization) and PPO (Proximal Policy Optimization) are advanced algorithms within the policy-based framework of Reinforcement Learning (RL).
They were developed to address the instability problem that plagues basic Policy Gradient (PG) methods — specifically, the tendency for large updates to destabilize or “crash” a policy.

Both are categorized as Policy Gradient methods, but they differ in how they constrain policy updates to maintain stable learning.

🧩 Motivation: Controlling Step Size

In traditional Policy Gradient methods, determining the correct update step size is difficult.
If the step is too large, the new policy$\pi_{\theta’}$can deviate drastically from the old one$\pi_{\theta}$, causing divergence or collapse.

TRPO introduces a mechanism to restrict the policy update to a small, safe region — the Trust Region — ensuring steady improvement without destructive jumps.

💡 Core Concept: The Trust Region

TRPO maximizes the expected policy improvement
but constrains how far the new policy can move from the old policy.
The optimization objective is defined as: $\max_{\theta'} \; \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta'}(a \mid s)}{\pi_\theta(a \mid s)} A_\pi(s,a) \right]$ subject to the constraint: $\mathbb{E}_s \left[ D_{KL}\left(\pi_\theta(\cdot \mid s) \;\mid\; \pi_{\theta'}(\cdot \mid s)\right) \right] \le \delta$ where

$A_\pi(s,a)$: Advantage function
$D_{KL}$: Kullback–Leibler divergence
$\delta$: the maximum allowed policy shift (trust region size)

This KL constraint ensures that the new policy remains close to the old policy, preventing large, destabilizing updates.

⚙️ Key Technical Approximations

To make the above optimization tractable, TRPO applies two key approximations:

Stage 1 — Policy Improvement Surrogate (First Approximation)

Ignoring State Distribution Change
Assumes that if policy updates are small, the state visitation distribution doesn’t change significantly
Importance Sampling
Because data comes from the old policy $\pi_\theta$, TRPO uses importance weighting to estimate expectations under the new policy$\pi_{\theta’}$:

The true objective of a policy improvement step is the expected return under the new policy $\pi_{\theta’}$: $J(\theta') = \mathbb{E}_{\tau \sim \pi_{\theta'}} \left[ \sum_t \gamma^t R_t \right]$ However, this expectation depends on the new state distribution induced by$\pi_{\theta’}$, which is expensive to compute.
TRPO approximates it using the old policy’s distribution $\pi_{\theta}$: $J(\theta') \approx \mathbb{E}_{s,a \sim \pi_{\theta}} \left[ \frac{\pi_{\theta'}(a \mid s)}{\pi_{\theta}(a \mid s)} A_\pi(s,a) \right]$ Here:

The importance sampling ratio $\frac{\pi_{\theta’}(a \mid s)}{\pi_\theta(a \mid s)}$ corrects for using old-policy data.
The advantage function $A_\pi(s,a)$ estimates how much better (or worse) each action is than average under the current policy.

🧩 This is the first approximation:

Replace the true new-policy expectation with a surrogate expectation under the old policy.

Stage 2 — Trust Region Constraint (Second Approximation)

Even with the surrogate, updating $\pi_{\theta’}$ freely can still cause large policy shifts and divergence.
To control this, TRPO adds a constraint on the KL divergence between $\pi_{\theta’}$ and $\pi_\theta$: $\max_{\theta'} \; \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta'}(a \mid s)}{\pi_\theta(a \mid s)} A_\pi(s,a) \right] \quad \text{s.t.} \quad \mathbb{E}_s \left[ D_{KL}\left(\pi_\theta(\cdot \mid s) \,\mid\, \pi_{\theta'}(\cdot \mid s)\right) \right] \le \delta$ This introduces a constrained optimization problem.

But solving this exactly is hard, since the expectation over all states and actions is nonlinear.
So TRPO makes a second approximation:

Linearize the objective around current parameters (first-order Taylor expansion).
Quadratically approximate the KL constraint (second-order expansion).

This yields a quadratic constrained optimization (similar to natural gradient descent): $\max_{\theta'} \; g^T (\theta' - \theta) \quad \text{s.t.} \quad \frac{1}{2} (\theta' - \theta)^T H (\theta' - \theta) \le \delta$ where

$g = \nabla_\theta L(\theta)$: policy gradient (first derivative of surrogate objective)
$H$: Fisher Information Matrix (approximating curvature of KL divergence)

✅ This two-stage approximation allows TRPO to perform monotonic policy updates using second-order methods (e.g., conjugate gradient descent) — guaranteed not to step too far.

🧮 TRPO Objective Summary

The TRPO optimization can thus be summarized as: $\begin{aligned} \text{maximize:} \quad & \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta'}(a \mid s)}{\pi_\theta(a \mid s)} A_\pi(s,a) \right] \\ \text{subject to:} \quad & \mathbb{E}_s \left[ D_{KL}(\pi_\theta(\cdot \mid s) \mid \pi_{\theta'}(\cdot \mid s)) \right] \le \delta \end{aligned}$ This formulation guarantees monotonic policy improvement within the trust region.

🎯 Importance Sampling in TRPO and PPO

Importance Sampling is a core mathematical technique in Reinforcement Learning (RL), especially within TRPO and PPO, enabling the reuse of data generated from an older policy to estimate expectations under a new one.
It plays a crucial role in stabilizing and improving sample efficiency in policy gradient methods.

Definition and Mechanism

Importance Sampling allows you to compute the expected value of a function under one probability distribution (P) using samples drawn from a different distribution (Q) — often called a proxy or sampling distribution.

Formally, the expected value of$f(x)$under$P$can be rewritten as: $\mathbb{E}_{x \sim P}[f(x)] = \sum_x P(x) f(x) = \sum_x Q(x) \frac{P(x)}{Q(x)} f(x) = \mathbb{E}_{x \sim Q} \left[ \frac{P(x)}{Q(x)} f(x) \right]$ Here:

$P(x)$: probability of $x$ under the target distribution (the one we want to estimate)
$Q(x)$: probability of $x$ under the sampling distribution (the one we can sample from)
The ratio$\frac{P(x)}{Q(x)}$ is called the importance weight

This importance weight acts as a correction factor, adjusting for the mismatch between the sampling and target distributions.
It assigns more weight to samples that are more likely under the target distribution and less weight to those that are less likely.

✅ In essence:

Importance Sampling allows us to reuse samples from one distribution to estimate quantities under another — saving computation and increasing efficiency.

Application in TRPO and PPO

In Policy Gradient methods like TRPO and PPO, data (trajectories of states and actions) is generated using the old policy $\pi_\theta$,
but during training we want to optimize the new policy $\pi_{\theta’}$.

This creates a distribution mismatch — data collected from $\pi_\theta$ must be used to estimate expectations under $\pi_{\theta’}$.
This is precisely where Importance Sampling is applied.

Target Policy (P): The new policy $\pi_{\theta’}$ — the one being optimized.
Sampling Policy (Q): The old policy $\pi_\theta$ (or $\pi_{old}$) — the one used to generate experience.

Thus, the expectation under the new policy can be rewritten using samples from the old policy: $\mathbb{E}_{s,a \sim \pi_{\theta'}} [f(s,a)] = \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta'}(a \mid s)}{\pi_\theta(a \mid s)} f(s,a) \right]$ Here, $\frac{\pi_{\theta’}(a \mid s)}{\pi_\theta(a \mid s)}$ is the Importance Sampling ratio — the same ratio used in TRPO and PPO objectives.

🔹 In TRPO

The TRPO surrogate objective maximizes the expected advantage function, weighted by this Importance Sampling ratio: $L_{TRPO}(\theta') = \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta'}(a \mid s)}{\pi_\theta(a \mid s)} A_\pi(s,a) \right]$-$A_\pi(s,a)$: Advantage function under the old policy.

The ratio reweights old samples to estimate what the new policy would have achieved.

This mechanism allows TRPO to reuse old-policy data without resampling — a critical efficiency gain for large-scale RL.

Importance Sampling in PPO-Clip

alt_text

PPO inherits the same mechanism but adds a clipping modification to prevent large, unstable updates.

The PPO-Clip objective uses the same ratio: $r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}$ and constrains it within a bounded range$[1 - \epsilon, 1 + \epsilon]$, typically with $\epsilon = 0.1$ or $0.2$.

The clipped objective is: $L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \big( r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \big) \right]$ This clipping acts as a soft trust region, ensuring that:

If $r_t$ strays too far from 1 (i.e., new policy differs too much from old), the update is restricted.
This prevents destructive jumps in policy space while keeping learning efficient.

✅ Key Benefit:

PPO retains the data efficiency of off-policy Importance Sampling,
but controls policy divergence through the clipping mechanism — combining the stability of TRPO with the simplicity of SGD optimization.

4️⃣ Proximal Policy Optimization (PPO)

PPO, introduced by OpenAI, simplifies TRPO while achieving similar performance.
It avoids the complex second-order optimization of TRPO and instead relies on first-order gradient methods with simple clipping or penalties.

PPO serves as the foundation of modern RL applications, including LLM fine-tuning (e.g., ChatGPT RLHF and DeepSeek-R1’s GRPO).

PPO-Penalty (Adaptive KL Penalty)

PPO-Penalty directly integrates the KL divergence constraint into the loss function as a dynamic penalty term.

Objective: $L^{\text{penalty}}(\theta') = \mathbb{E}_t \left[ \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_\theta(a_t \mid s_t)} \hat{A}_t - \beta \, D_{KL}(\pi_{\theta'} \mid \pi_\theta) \right]$ where $\beta$ controls how strongly the policy deviation is penalized.

Adaptive β Adjustment:

If $D_{KL}$ is too small → decrease $\beta$ (allow larger steps, faster learning)
If $D_{KL}$ is too large → increase $\beta$ (reduce step size, stabilize learning)

This dynamic adjustment keeps learning both efficient and stable.

PPO-Clip (Clipping Mechanism)

The most widely used PPO variant is PPO-Clip, which replaces the explicit KL penalty with a clipped ratio function.

Clipped Objective: $L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \big( r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \big) \right]$where$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}$ and $\epsilon$(typically 0.1–0.3) defines the clipping range.

If $r_t$ stays within $[1-\epsilon, 1+\epsilon]$, the update proceeds normally.
If $r_t$ exceeds the bounds, the update is clipped, preventing the policy from drifting too far.

✅ Result:
Stable updates without needing explicit KL constraints or complex second-order methods.

Generalized Advantage Estimation (GAE) in PPO

alt_text

Generalized Advantage Estimation (GAE) is an advanced technique for estimating the Advantage Function$\hat{A}$, used in modern policy-based reinforcement learning algorithms such as PPO and TRPO.
The Advantage Function measures how much better a particular action$a$is compared to the average expected return of a state$s$: $A(s, a) = Q(s, a) - V(s)$ GAE improves the stability and accuracy of policy gradient updates by providing a low-variance yet unbiased estimate of $\hat{A}$.

Purpose and Context of GAE

The core challenge GAE solves is the bias–variance trade-off in estimating returns.

There are two traditional ways to estimate value in RL:

Monte Carlo (Multi-step):
Uses full-episode rewards to estimate returns.
→ Low bias, but high variance (unstable, slow convergence).
Temporal Difference (TD, One-step):
Uses the immediate reward plus discounted estimated value of the next state.
→ Low variance, but high bias (depends heavily on current value estimates).

GAE balances these two extremes by mixing multiple n-step returns, controlled by a parameter$\lambda$(lambda).
This results in smoother, more accurate estimates for the Advantage Function.

GAE Formulation and Mechanism

GAE combines advantage estimates computed over multiple time steps in a recursive manner, blending information from both short-term and long-term returns.

One-Step Advantage Estimate (TD Error)

The foundation of GAE is the one-step Temporal Difference (TD) error, denoted $\delta_t$: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ This represents how much the actual return at time$t$differs from the predicted value of the current state.

$R_{t+1}$: immediate reward
$\gamma$: discount factor
$V(S_t)$: current state value estimate (baseline)

The one-step advantage estimate$\hat{A}_t^{(1)}$is simply this$\delta_t$.

Multi-Step Advantage Estimates

Instead of relying only on one-step updates, GAE builds multi-step estimates by chaining together TD errors:

Two-step advantage: $\hat{A}_t^{(2)} = \delta_t + \gamma \delta_{t+1}$
Three-step advantage: $\hat{A}_t^{(3)} = \delta_t + \gamma \delta_{t+1} + \gamma^2 \delta_{t+2}$ In general, the K-step advantage sums discounted TD errors over$K$future steps.

Introducing λ — The Balancing Hyperparameter

To blend these multi-step estimates efficiently, GAE introduces the λ (lambda) hyperparameter$\in [0,1]$, which controls how quickly the weighting decays with each additional step.

The final GAE estimate combines all possible multi-step advantages as a weighted sum, exponentially decayed by $(\gamma \lambda)^l$: $\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}$ where:

$\gamma$: discount factor for future rewards
$\lambda$: trade-off parameter between bias and variance
$\delta_{t+l}$: TD error at time$t+l$

Interpretation

When λ = 0 → GAE reduces to 1-step TD (low variance, high bias).
When λ = 1 → GAE becomes Monte Carlo (low bias, high variance).
Intermediate values (e.g., λ = 0.95) provide a smooth compromise.

💡 Intuitive Summary

Method	Steps	Bias	Variance	Description
TD(0)	1-step	High	Low	Immediate update, fast but biased
Monte Carlo	Full episode	Low	High	Accurate but noisy
GAE (λ = 0.95)	Mixed	Moderate	Moderate	Balanced trade-off

GAE thus provides a robust, stable, and efficient estimate of the advantage function, improving learning speed and convergence in modern RL algorithms.

GAE bridges Monte Carlo and TD methods.
It computes advantage using multi-step discounted TD errors.
The λ hyperparameter controls the bias–variance balance.
Used in PPO, TRPO, and Actor–Critic algorithms to achieve smooth, stable, and sample-efficient learning.

📌 Summary & What’s Next

Here’s a quick recap of Part I before we dive into PPO-in-practice for LLMs:

✅ What You Should Now Understand

MDP Core
RL is modeled as a Markov Decision Process with stochastic dynamics: $(s_t, a_t, r_{t+1}, s_{t+1})$, and discounted return $G_t = R_{t+1} + \gamma G_{t+1}$.
Value Functions & Bellman
You can evaluate “how good” states/actions are via:
- State value: $V_\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]$
- Action value:
  $Q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid s_t = s, a_t = a]$ with recursive Bellman equations, and the link: $V_\pi(s) = \sum_a \pi(a \mid s)\, Q_\pi(s,a)$
Top-Down RL Framework
Two big families:
- Value-Based (learn $V/Q$ first; act greedily): TD, SARSA (on-policy), Q-Learning (off-policy), DQN.
- Policy-Based (optimize $\pi_\theta$ directly): PG, REINFORCE, Actor–Critic.
TD Learning Intuition
Step-by-step updates with TD error $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$.
Lower variance than Monte Carlo; may be biased due to bootstrapping.
Why Policy-Based (with a Critic) Wins in Practice
Works with continuous/high-dimensional action spaces (e.g., token logits), enables stochastic exploration, and—via a Critic as baseline—reduces gradient variance: $A(s,a) = Q(s,a) - V(s)$.
Stabilizing PG: TRPO → PPO
Raw policy gradients can “crash” with big steps.
- TRPO: maximizes a surrogate objective subject to a KL trust-region (second-order methods).
- PPO: simplifies with clipped ratios (or adaptive KL penalty) using first-order optimization — now the default for large-scale RL.
Importance Sampling (IS)
Reuses data from the old policy to estimate objectives for the new policy via ratios $r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}$.
GAE (Generalized Advantage Estimation)
“Advanced TD error” that blends multi-step TD with decay: $\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^{l} \, \delta_{t+l}$

Strikes a bias–variance balance (typical $\gamma = 0.99, \lambda = 0.95$).

🧭 What’s Coming in Part II (PPO & GRPO for LLMs)

In the next part, we’ll go from theory to practice:

PPO Training Loop for LLMs (RLHF/RLAIF)
- Phase 1: Set up Actor $\pi_\theta$, Old Policy $\pi_{\theta_{\text{old}}}$, Critic $V_\phi$, and Reward Model.
- Phase 2: Collect trajectories by generating completions with frozen $\pi_{\theta_{\text{old}}}$; record old log-probs.
- Phase 3: Compute rewards (RM + KL to reference), values, and advantages via GAE.
- Phase 4: Optimize Actor with PPO-Clip and Critic with TD loss over multiple epochs; copy back $\pi_{\theta_{\text{old}}} \leftarrow \pi_\theta$.
LLM-Specific Details
- Token-level vs. sequence-level rewards, masks, truncation, KL control to a reference model, entropy bonus, advantage normalization, early stopping on KL.
GRPO (DeepSeek-style)
- Group generation per prompt, rule-based or RM rewards, no Critic.
- Group standardization advantage: $A = \frac{R - \text{mean}(R)}{\text{STD}(R)}$.
- Lower memory footprint; KL constraint to reference for stability.

Goal: Provide a clear, reproducible PPO/GRPO recipe you can implement for LLM alignment—complete with diagrams, equations, and mini checklists.