Reinforcement Learning (RL) — From Fundamentals to PPO & GRPO in LLMs (II)

This blog continues to provide the advanced policy optimization techniques like PPO and TRPO, and concluding with the application of RL in Large Language Models (LLMs) via PPO and GRPO.

🚀 Section 2: Policy Gradient and Trust Region Optimization

… In the last blog, we have some fundamental introductions on PPO. Let’s continue on the PPO to see more detailed step by step PPO training process.

🧭 PPO Training Process — Step-by-Step Phases (with LLM Examples)

The Proximal Policy Optimization (PPO) training process can be divided into four main phases, forming the foundation of RLHF (Reinforcement Learning from Human Feedback) or RLAIF used in modern Large Language Model (LLM) training.

alt_text

Each phase connects to the Actor–Critic structure, Generalized Advantage Estimation (GAE), and the on/off-policy hybrid design of PPO.

🧩 Phase 1: Preparation of Models and Data Structures

PPO requires three core components that work together in an Actor–Critic structure:

Policy Model (π_θ / Actor)
- The main trainable model that outputs a probability distribution over possible actions.
- In LLMs, this corresponds to predicting token probabilities via SoftMax.
- Updated each iteration via the PPO objective (Clip or Penalty form).
Old Policy Model (π_θₒₗd / Reference Policy)
- A frozen snapshot of the Actor before the current update cycle.
- Used to collect rollouts (trajectories) and compute importance sampling ratios for off-policy correction.
- Provides stability by ensuring all data in a batch come from the same, fixed policy distribution.
Value Model (V_ϕ / Critic)
- A separate network that estimates the expected return $ V(s) $.
- Trained via Temporal Difference (TD) loss to serve as a baseline for computing advantages.
- Reduces variance in policy gradients and stabilizes learning.

🧠 LLM Example:
In RLHF training:

Actor (π_θ): GPT-like model generating completions.
Old Policy (π_old): frozen copy used for rollout generation.
Critic (V_ϕ): small value head attached to the model, predicting scalar “expected reward.”

The system collects tuples:

\[(s_t, a_t, r_t, s_{t+1}, P_{\theta_{old}}(a_t \mid s_t))\]

for use in later advantage estimation and policy updates.

🎬 Phase 2: Trajectory Collection (Data Generation)

The Old Policy Model interacts with the environment (or prompt dataset) to produce experience samples.

🧩 General PPO Process

Sample States (Prompts):
The agent receives a state $ s_t $.
In standard RL, this is the environment’s observation;
in LLMs, this corresponds to a text prompt or context.
Generate Action Distribution:
The Old Policy $ \pi_{\theta_{old}} $ outputs probabilities for each possible action $ a_t $.
- Continuous control → Gaussian policy.
- LLM → SoftMax over vocabulary tokens.
Sample Actions:
Draw an action $ a_t \sim \pi_{\theta_{old}}(a_t \mid s_t) $.
Record its log-probability $ \log P_{\theta_{old}}(a_t \mid s_t) $.
Receive Feedback:
The environment provides:
- Reward $ r_t $ — immediate feedback for the chosen action.
- Next state $ s_{t+1} $ — what the environment (or text) looks like after that action.
Store Trajectories:
Save the full sequence:

\[(s_t, a_t, r_t, s_{t+1}, P_{\theta_{old}}(a_t \mid s_t))\]

These rollouts form the on-policy dataset for the next update phase.

🧠 LLM Example: PPO for RLHF

Sample Prompts:
Draw prompts such as “Explain overfitting in two sentences.”
Generate Actions (Tokens):
The Old Policy $ \pi_{\theta_{old}} $ autoregressively samples tokens one by one.
Each token = one RL “action.”
Log-probabilities are stored for every token.
Receive Feedback:
- A Reward Model (RM) scores the final completion (e.g., based on helpfulness).
- Add a KL penalty to keep behavior close to a reference model $ \pi_{ref} $ (like base GPT-3):

\[R_{\text{total}} = R_{\text{RM}} - \beta D_{\text{KL}}(\pi_{\theta_{old}} \,\|\, \pi_{ref})\]

Example:
$ R_{\text{RM}} = 0.8 $, $ D_{\text{KL}} = 0.05 $, $ \beta = 6 $ → $ R_{\text{total}} = 0.5 $

Store Trajectories:
Each completion (sequence of tokens) is stored as:

\[(s_t, a_t, r_t, s_{t+1}, P_{\theta_{old}}(a_t \mid s_t))\]

where:

$ s_t $: prompt + preceding tokens
$ a_t $: next token chosen
$ r_t $: sequence-level or token-level reward
$ P_{\theta_{old}} $: old policy probability

Step	PPO Meaning	LLM Equivalent
1	Sample environment states	Sample text prompts
2	Policy outputs action distribution	Token logits (SoftMax)
3	Sample an action	Sample a token
4	Receive reward and next state	Reward Model + KL penalty
5	Store trajectory	Save prompt, tokens, rewards, log-probs

⚖️ Phase 3: Advantage Calculation (Critic + GAE)

After collecting trajectories, PPO estimates how much better or worse each action was relative to expectations.

Value Prediction:
Feed all $ s_t, s_{t+1} $ into the Critic to get $ V(s_t) $, $ V(s_{t+1}) $.
TD Error (δ):

\[\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\]

Measures how surprising the outcome was.

Generalized Advantage Estimation (GAE):

\[\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}\]

Smoothly combines multi-step TD errors to trade off bias and variance.

🧠 LLM Example:

Reward for full completion $ r_T = 0.5 $, others $ r_t = 0 $.
Critic predicts $ V(s_T) = 0.3 $, $ V(s_{T-1}) = 0.25 $.
δ and GAE propagate positive advantage to tokens that contributed to the better output, and negative to unhelpful ones.

🧮 Phase 4: Optimization and Model Updates

This is where PPO performs its signature trust-region optimization — improving the policy safely while keeping it close to its old version.

🔹 Step 1: Compute New Policy Probabilities

At the start of optimization:

\[\pi_\theta \leftarrow \pi_{\theta_{old}}\]

So initially, both models are identical — $ r_t(\theta) = 1.0 $ everywhere.

We still run a forward pass through $ \pi_\theta $ to:

Enable gradient computation, and
Compare new vs. old probabilities during training.

Compute the importance ratio:

\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)}\]

🧠 LLM Example:
For the token “overfitting”:

$ \log \pi_{old} = -2.1 $, $ \log \pi_\theta = -1.8 $.
Ratio $ r_t = e^{0.3} = 1.35 $ → new model increased its probability by 35%.

Initially, $ r_t = 1.0 $, but after gradient updates, the ratios diverge slightly.

🔹 Step 2: Actor Update (PPO Objective)

Optimize the PPO-Clip loss:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \Big[ \min(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \Big]\]

$ \epsilon $ (0.1–0.2) bounds trust-region changes.
Gradient ascent increases probabilities of actions with positive advantages.

🧠 LLM Example:
$ r_t = 1.35 $, $ \epsilon = 0.2 $ → clipped to 1.2.
$ \hat{A}_t = 0.2 $ → unclipped term = 0.27, clipped = 0.24 → choose min = 0.24 for stability.

🔹 Step 3: Critic Update (TD Learning)

The Value Model is trained with TD loss:

\[L_V(\phi) = (V_\phi(s_t) - R_t^{\text{target}})^2\]

where $ R_t^{\text{target}} $ is the bootstrapped return from rewards and next-state values.

This improves the Critic’s baseline accuracy and ensures that advantages eventually center around zero.

🔹 Step 4: Data Reuse and Epoch Training

PPO is semi-on-policy:

Data come from $ \pi_{old} $ (on-policy).
But the importance ratios allow limited off-policy reuse for multiple epochs.

Each epoch:

Shuffle and mini-batch trajectories.
Recompute $ r_t(\theta) $, $ L^{\text{CLIP}} $, $ L_V $.
Perform several gradient updates (3–10 epochs).
Optionally add entropy bonus:

\[L_{\text{ent}} = -\alpha H[\pi_\theta(\cdot \mid s_t)]\]

to encourage exploration.

🧠 LLM Example:

A batch of 4 prompt-response pairs.
Train for 4 epochs with Adam optimizer.
Each epoch recomputes ratios and advantages.
After training, copy weights:

\[\pi_{\text{old}} \leftarrow \pi_\theta\]

Then collect new rollouts and repeat.

🔹 Step 5: Combined PPO Objective

\[L(\theta, \phi) = \underbrace{\mathbb{E}_t\!\left[\min(r_t \hat{A}_t, \text{clip}(r_t,1-\epsilon,1+\epsilon)\hat{A}_t)\right]}_{\text{Actor (PPO-Clip)}} - c_V \, \underbrace{\mathbb{E}_t[(V_\phi(s_t)-R_t^{\text{target}})^2]}_{\text{Critic Loss}} + c_H \, \underbrace{\mathbb{E}_t[H(\pi_\theta(\cdot \mid s_t))]}_{\text{Entropy Bonus}}\]

Typical coefficients:

$ c_V = 0.5 $
$ c_H = 0.01 $

🔁 PPO Training Loop Summary

Stage	What Happens	LLM Analogy
Phase 1	Prepare policy, reference, and value models	Actor = GPT head, Critic = value head
Phase 2	Generate rollouts with frozen πₒₗd	Generate completions via old model
Phase 3	Compute rewards + GAE advantages	Combine RM reward + KL penalty into tokenwise Â
Phase 4	Optimize Actor + Critic	Update πθ and Vϕ → copy πₒₗd ← πθ and repeat

At the start of each round,
$ \pi_\theta = \pi_{old} $,
so the first forward pass yields $ r_t = 1 $.
After one gradient step, the policies diverge slightly —
and that small, clipped improvement step is what makes PPO both stable and effective.

Phase	Key Objective	Mechanism	Example
1. Preparation	Initialize Actor, Critic, and Reference Models	Separate networks with shared embeddings (in LLMs)	Copy policy weights to π_old
2. Trajectory Collection	Generate data	Old policy generates completions and rewards	“Explain overfitting” → RM = 0.8, KL = 0.05
3. Advantage Calculation	Evaluate action quality	Use TD + GAE for smooth advantage estimates	γ=0.99, λ=0.95
4. Optimization	Update Actor and Critic	Use PPO-Clip objective, TD value loss, multiple epochs	r_t=1.35 → clipped to 1.2, A=0.2

🤖 Section 3: Reinforcement Learning in LLMs

The application of Reinforcement Learning (RL) to Large Language Models (LLMs) is crucial for aligning model behavior with human preferences and specific goals. The LLM training process—particularly through Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO)—relies on multiple interacting models that provide feedback, stability, and direction for learning.

The overall pipeline treats the LLM as the core policy (Actor) and introduces auxiliary models to produce reward signals and stabilize optimization.

🧩 Part 1: Large Language Model Mechanics as an RL Agent

alt_text

A Causal Language Model (LLM) forms the foundation for RL fine-tuning.

LLM as Policy Model (Actor):
The LLM serves as the Policy Model (π), outputting a probability distribution over possible next tokens (actions) given the current state (the input prompt and previously generated tokens).
Token Generation:
The LLM processes input embeddings through the causal transformer and outputs a logits vector of size $1 \times \text{vocab\_size}$, representing token probabilities.
Token generation proceeds auto-regressively, where each token is treated as an action in the RL framework.
LLM as Value Model (Critic):
The same transformer architecture can be repurposed as a Value Model (V) by adding a linear projection head that maps the final hidden state to a single scalar (1×1).
This scalar represents the expected value of the generated sequence or state.

💬 Part 2: Training the Reward Model (RM)

alt_text

Since LLMs don’t intrinsically produce reward signals, a Reward Model (RM) must be trained using human preference data.

Data Collection:
Human annotators rank pairs (or sets) of model responses to the same prompt, indicating which output is preferred. Ranking is often more consistent than absolute scoring.
Model Training:
The RM is trained using the Bradley–Terry model, optimized via Maximum Likelihood Estimation (MLE).
The loss function encourages the RM to assign a higher reward value to responses preferred by humans:

\[P(A \succ B) = \frac{e^{r_\phi(A)}}{e^{r_\phi(A)} + e^{r_\phi(B)}}\]

Reward Model Output:
The trained RM takes (prompt + completion) as input and outputs a single scalar reward indicating the quality of that completion.

⚙️ Part 3: Step-by-Step PPO Training in LLMs

Proximal Policy Optimization (PPO) in LLM training typically coordinates four cooperating models:

Policy Model $ \pi_\theta $ — Actor
Old Policy / Reference Policy $ \pi_{\theta_{\text{old}}} $ — data generator / denominator in ratios
Value Model $ V_\phi $ — Critic
Reward Model (RM) — scalar reward provider (or rule-based scorer)

Iteration anatomy:
1) Collect trajectories with frozen $ \pi_{\theta_{\text{old}}} $ →
2) Score and compute advantages with RM + $ V_\phi $ (GAE) →
3) Optimize $ \pi_\theta $ (actor) and $ V_\phi $ (critic) for a few epochs →
4) Refresh $ \pi_{\theta_{\text{old}}} \leftarrow \pi_\theta $ and repeat.

🧭 Step 1: Data Collection and Probability Generation

alt_text

Freeze the data generator:
Set $ \pi_{\theta_{\text{old}}} \leftarrow \pi_\theta $ and stop gradients on $ \pi_{\theta_{\text{old}}} $.
This guarantees the batch is on-policy with respect to a fixed policy.
Generate trajectories (prompts → completions):
Sample prompts $ x $, then roll out tokens $ a_{1:T} \sim \pi_{\theta_{\text{old}}}(\cdot \mid s_t) $, where the state $ s_t $ is the prompt plus the previous tokens.
Record old log-probabilities (denominators):
Store $ \log \pi_{\theta_{\text{old}}}(a_t \mid s_t) $ for all tokens.
These serve as the denominator in the importance ratio later.
(Optional) Masks and truncation:
Keep attention masks; mark special tokens; truncate overly long completions to a max length to bound compute and variance.

Notes

You are not updating anything here. You only collect:

\[(s_t, a_t, \log \pi_{\theta_{\text{old}}}(a_t \mid s_t), \text{length masks})\]

💎 Step 2: Reward and Advantage Calculation

alt_text

Intrinsic Reward (sequence-level or token-shaped):
The Reward Model (or rule-based scorer) outputs a scalar $ R_{\text{RM}} $ for each completion.
- Commonly sequence-level reward is assigned to the final token; intermediate tokens get $ r_t = 0 $.
- Some setups shape rewards across tokens (e.g., format adherence, intermediate checks).
This is typically sequence-level — only assigned at the final token:
\[r_t = \begin{cases} 0, & t < T \\ R_{\text{RM}}, & t = T \end{cases}\]
KL Divergence Penalty (alignment to reference):
Add a KL penalty to keep behavior close to a reference model $ \pi_{\text{ref}} $ (often the base pretrain):
\[R_{\text{Total}} \;=\; R_{\text{RM}} \;-\; \beta \cdot D_{KL}\!\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big)\]
- $ \beta $ may be adaptive: increase if observed KL > target; decrease otherwise.
The KL term is computed between the current policy and the reference policy over all tokens, usually as: $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) = \sum_t \pi_\theta(a_t \mid s_t) \log \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{ref}}(a_t \mid s_t)}$

In practice, this gets approximated as a per-token penalty:
\[r_t^{\text{KL}} = - \beta \left( \log \pi_\theta(a_t \mid s_t) - \log \pi_{\text{ref}}(a_t \mid s_t) \right)\]
So each token contributes its own negative reward proportional to how far it diverges from the reference model. Combine the sequence-level intrinsic reward and the token-level KL penalty:
\[r^{\text{total}}_t = \begin{cases} -\beta \cdot D_{\text{KL},t}, & t < T \\ R_{\text{RM}} - \beta \cdot D_{\text{KL},T}, & t = T \end{cases}\]
That means:
- Every token gets its own per-token total reward due to the KL penalty.
- The final token also includes the RM reward in addition to the KL penalty.
Value predictions (baselines):
Run the Critic to obtain $ V_t = V_\phi(s_t) $ and $ V_{t+1} = V_\phi(s_{t+1}) $.
Temporal-Difference error and GAE:
Define the TD error: $\delta_t \;=\; r_t + \gamma V_{t+1} - V_t$ Then compute Generalized Advantage Estimation (per token): $\hat{A}_t \;=\; \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}$
- Typical $ \gamma = 0.99 $, $ \lambda = 0.95 $.
- Normalize advantages per batch (zero mean, unit variance) to stabilize learning.
Value targets (for critic loss):
The value target is what the critic should predict. A common bootstrapped target is: $R^{\text{target}}_t = \hat{A}_t + V_t$

each token’s target is built using its corresponding advantage and its old value estimate. It’s called a bootstrapped target because it uses both:
- the advantage estimate (which depends on actual rewards + future estimates), and
- the current value prediction (baseline).
This corresponds to the estimated total return for that timestep. The critic loss later then minimizes:
\[\mathcal{L}_{\text{value}} = \frac{1}{2} (V_\phi(s_t) - R^{\text{target}}_t)^2\]

Notes

If rewards are sparse (only at final token), GAE propagates signal back through tokens with decay $ (\gamma \lambda)^l $.
Keep careful masking so padding tokens do not affect losses or statistics.

🧮 Step 3: Policy and Value Updates

At the start of optimization, set the trainable actor to the old weights:
$ \pi_\theta \leftarrow \pi_{\theta_{\text{old}}} $.
First forward pass will yield ratio $ r_t = 1 $, but we still need it to build the computation graph for backprop.
After the first gradient step, $ \pi_\theta $ diverges slightly from $ \pi_{\theta_{\text{old}}} $.

Compute new probabilities (numerators):
Forward the stored tokens through current trainable $ \pi_\theta $ and form the importance ratio: $r_t(\theta) \;=\; \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$
Policy update (Actor) with PPO-Clip:
Optimize the clipped objective: $L^{\text{CLIP}}(\theta) \;=\; \mathbb{E}_t\!\left[ \min\!\Big( r_t(\theta)\,\hat{A}_t,\; \text{clip}\!\big(r_t(\theta), 1-\epsilon, 1+\epsilon\big)\,\hat{A}_t \Big) \right]$
- Typical $ \epsilon \in [0.1, 0.2] $.
- Each token has its own advantage $\hat{A}_t$ (from GAE).
- The clipping ensures that if a token’s probability ratio $r_t(\theta)$ drifts too far (greater than $\epsilon$), its gradient contribution is limited.
- You average this over all valid tokens in the batch. So the actor objective is literally a per-token clipped loss, averaged across tokens and samples.
- Add entropy bonus to encourage exploration: $L_{\text{ent}} \;=\; -\alpha \, \mathbb{E}_t\!\big[ H(\pi_\theta(\cdot \mid s_t)) \big]$
For each token step t, we compute the entropy of the policy distribution:
\[H(\pi_\theta(\cdot \mid s_t)) = - \sum_a \pi_\theta(a \mid s_t) \, \log \pi_\theta(a \mid s_t)\]
where:
- $( \pi_\theta(a \mid s_t) )$ is the probability of choosing token $( a )$ given the state (context) $( s_t )$.
- The summation runs over all possible tokens in the vocabulary.
- High entropy → the model spreads probability across multiple tokens → more exploration.
- Low entropy → the model becomes overly confident → less diversity, more deterministic behavior.
Value update (Critic) with TD loss:
Regress the value function toward targets: $L_V(\phi) \;=\; \big(V_\phi(s_t) - R^{\text{target}}_t\big)^2$
- Often use value clipping (like PPO’s value head clipping) or Huber loss to reduce instability.
Data reuse (semi-on-policy with limited off-policy correction):
Run K epochs (e.g., 3–8) over minibatches of the same rollout batch.
Each epoch recomputes $ r_t(\theta) $ and losses.
- Early stop if KL to reference exceeds a threshold.
- Track and log approx KL, clip fraction, value loss, and entropy.
Refresh for next iteration:
After finishing epochs:
- Copy weights: $ \pi_{\theta_{\text{old}}} \leftarrow \pi_\theta $
- Collect fresh data with the new $ \pi_{\theta_{\text{old}}} $

In PPO for LLMs, the actor (policy) and value (critic) are trained together in one combined loss function during each minibatch update.Combined loss (one minibatch): $L(\theta,\phi) \;=\; \underbrace{\mathbb{E}_t\!\left[\min(r_t \hat{A}_t,\; \text{clip}(r_t,1-\epsilon,1+\epsilon)\hat{A}_t)\right]}_{\text{Actor}} \;-\; c_V \, \underbrace{\mathbb{E}_t\!\big[(V_\phi - R^{\text{target}})^2\big]}_{\text{Critic}} \;+\; c_H \, \underbrace{\mathbb{E}_t\!\big[H(\pi_\theta)\big]}_{\text{Entropy}}$

Implementation tips

Advantage whitening (normalize per batch) is standard.
Length normalization / penalty can prevent overly long outputs.
Clip fraction near 0.1–0.3 is a healthy training signal; exploding to ~1.0 suggests learning rate or ε is too high.
Learning rates: often actor and critic share the same optimizer but may use different lrs/decays.

🔚 Recap Checklist (What to verify each iteration)

Rollouts came from a frozen $ \pi_{\theta_{\text{old}}} $.
Rewards include RM score and KL penalty to $ \pi_{\text{ref}} $.
Advantages computed with GAE and normalized.
Ratios $ r_t $ computed with trainable $ \pi_\theta $ (initially equal to old, diverges after first step).
Actor optimized with PPO-Clip; Critic with TD loss; Entropy bonus applied.
Epochs (3–8), minibatching, early KL stop, and copy-back $ \pi_{\theta_{\text{old}}} \leftarrow \pi_\theta $.

Mental model:
Collect with old → Score → Compute GAE → Update new (bounded by clip/KL) → Make old = new → Repeat.

🧠 Part 4: Group Relative Policy Optimization (GRPO)

alt_text

GRPO (Group Relative Policy Optimization) is an advanced policy-based Reinforcement Learning (RL) algorithm developed by DeepSeek for optimizing Large Language Models (LLMs), notably in DeepSeekMath and DeepSeek-R1.
It builds upon PPO (Proximal Policy Optimization), which itself extends TRPO (Trust Region Policy Optimization) — but GRPO introduces a key innovation: group-based advantage estimation that eliminates the need for a Value Model (Critic).

1️⃣ Core Concept: Group Generation

The defining idea behind GRPO is its group-based data generation mechanism.

Group Definition:
For each single prompt (state), the model generates multiple completions — a group of candidate trajectories.
Data Generation:
A single input prompt is replicated multiple times so that the model (or reference policy) can generate diverse completions in parallel.
Each completion (trajectory) is then scored individually with a reward signal.

This process transforms a single-prompt interaction into a mini-batch of competing trajectories, which are used to compute relative rewards and advantages within that group.

🧠 LLM Example:
Prompt: “Prove that the derivative of sin(x) is cos(x).”
The model generates 4 different completions:

A correct symbolic proof.
A short textual explanation.
A wrong proof.
A partially correct derivation.

Each of these responses gets an individual reward (via Reward Model or rule check), and the relative ranking among them defines their advantages.

2️⃣ Flexible Reward Mechanisms

GRPO supports multiple reward strategies, depending on the task:

Reward Model (RM) Approach — e.g., DeepSeekMath
- Uses a standard, trained Reward Model that scores completions based on human preferences.
- The RM is trained using ranking data (e.g., Bradley–Terry model) and produces a scalar reward per completion.
Rule-Based Reward Approach — e.g., DeepSeek-R1
- For domains with objective correctness (e.g., math, programming), rewards can be derived directly from task-specific checks.
- Examples include:
  - Regex matching to verify the answer format.
  - Unit tests for code generation tasks.
  - Mathematical equality checks for symbolic reasoning.
- This approach eliminates the need for a trained Reward Model, reducing complexity and resource use.

🧠 LLM Example:
If a math model produces four solutions, GRPO can simply check:

✅ Correct final answer → +1.0
⚠️ Right reasoning but wrong numeric result → +0.5
❌ Incorrect → 0.0
These numeric rewards directly drive the policy update.

3️⃣ Advantage Calculation via Group Standardization

GRPO replaces the Critic-based Generalized Advantage Estimation (GAE) of PPO with a relative, group-based approach.

No Critic Model:
Unlike PPO, GRPO removes the Value Model (Critic) entirely.
This simplifies training by removing the dependency on $ V(s) $ and the TD-error pipeline.
Relative Advantage:
Instead of comparing actions to $ V(s) $, GRPO compares each completion’s reward to the average reward in the same group.
Standardization (Normalization):
The advantage for each completion is computed as: $A = \frac{R - \text{mean}(R)}{\text{STD}(R)}$ where $ R $ is the reward of the completion, and mean/std are computed over all completions for that same prompt.

This relative normalization ensures that the model learns to prefer completions better than its own average, rather than depending on an external baseline model.

🧠 LLM Example:
For a group of 4 completions with rewards: [1.0, 0.8, 0.4, 0.2]

Mean = 0.6, Std = 0.32
Advantages ≈ [+1.25, +0.63, −0.63, −1.25]
→ The best completion receives the strongest positive advantage; the worst receives the strongest negative advantage.

4️⃣ Model Components and Efficiency

Although GRPO and PPO share conceptual foundations, their model requirements and computational efficiency differ significantly:

Feature	PPO	GRPO	Notes
Policy Model (π)	Required	Required	The main trainable model generating completions.
Reference / Old Policy	Required (for importance sampling)	Often same as π	GRPO may share parameters between new and old policies.
Reward Model (RM)	Required	Optional (rule-based possible)	DeepSeekMath uses RM; DeepSeek-R1 uses rule-based scoring.
Value Model (Critic)	Required	Removed	GRPO eliminates GAE and TD updates.
Advantage Calculation	GAE using $ V(s) $	Group reward standardization	Relative to group mean and std.
KL Constraint	Trust region vs. π_old	KL penalty vs. π_ref	Keeps policy aligned with base model.

5️⃣ Training Process Overview

alt_text

Prompt Sampling: Select prompts from dataset.
Group Generation: For each prompt, sample N completions (e.g., N=4).
Reward Computation: Evaluate each completion via RM or rule-based checks.
Group Advantage Calculation: Normalize rewards → compute $ A = \frac{R - \text{mean}(R)}{\text{STD}(R)} $.
Policy Update:
- Use PPO-like objective: $L(\theta) = \mathbb{E}\left[ \min\left( r_t A, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A \right) \right]$
- With KL constraint to reference model $ \pi_{ref} $: $R_{\text{total}} = R - \beta D_{KL}(\pi_\theta \,\|\, \pi_{ref})$
Iteration: Repeat over multiple prompts and mini-batches.

6️⃣ Advantages of GRPO

🚀 Fewer Models Required:
No Critic network; optional Reward Model → lighter training setup.
💾 Memory Efficiency:
Removes GAE and TD updates → less GPU memory required.
⚖️ Stable and Interpretable:
The group normalization inherently bounds advantage magnitudes, improving stability.
🎯 Task-Specific Adaptability:
Rule-based reward evaluation enables efficient alignment on objective domains (math, coding).
🔁 Simplified Loop:
Often uses a single shared policy for both old and new roles, removing the need for separate importance sampling passes.

💡 In essence:
GRPO transforms PPO’s absolute reward framework into a relative competition between completions from the same prompt.
Each prompt becomes a “mini-tournament,” where the model learns from the relative ranking of its own outputs — achieving alignment efficiently, without needing a separate Critic network.

🎬 Analogy

alt_text

PPO resembles a movie production process — the Policy (Director) consults the Critic (Value Model) and tests ideas with a Focus Group (Reward Model) to refine outcomes.
GRPO, in contrast, works like a live focus-group test, directly comparing multiple versions of a scene and optimizing based on relative audience reaction within the group.

🧩 Summary

Reinforcement Learning forms the backbone of modern alignment and optimization in LLMs:

TD Learning → Q-Learning → Policy Gradient → TRPO → PPO → GRPO
The evolution shows a trade-off between stability, efficiency, and alignment quality.
In the LLM era, PPO and GRPO bridge human feedback with model fine-tuning, enabling more aligned and capable AI systems.