🧭 Policy-Based Reinforcement Learning — Directly Learning to Act
Policy-Based Reinforcement Learning (RL), also known as Policy Learning, focuses on directly modeling and optimizing the agent’s policy $( \pi )$, i.e., the agent’s behavior function.
This contrasts with Value-Based RL (like DQN), which indirectly learns the policy by estimating the optimal value function $( Q^*(s, a) )$.
This blog explains the foundations, mathematics, and algorithms of Policy-Based RL, including REINFORCE and Actor–Critic methods.
1. Policy Function Approximation
🎯 What Is a Policy?
The policy function $( \pi(a|s) )$ defines how an agent chooses actions based on the current state.
It is a probability distribution over possible actions.
Example: $[ \pi(\text{left}|s) = 0.2, \quad \pi(\text{right}|s) = 0.1, \quad \pi(\text{up}|s) = 0.7 ]$ When the agent is in state $( s )$, it randomly samples an action $( A \sim \pi(\cdot|s) )$, meaning “up” is most likely to be chosen.
⚙️ Why We Need a Policy Network
-
Simple Cases:
If there are few states and actions, the policy can be stored in a lookup table.Example:
| State | Left | Right | Up |
|---|---|---|---|
| $s_1$ | 0.3 | 0.4 | 0.3 |
| $s_2$ | 0.1 | 0.2 | 0.7 |
But in real-world problems (like robotics or video games), states are high-dimensional and continuous — impossible to store in a table.
-
Scalability Challenge:
When the number of states or actions grows large (or infinite), tabular methods break down. -
Neural Network Solution:
To handle large spaces, we use a Policy Network $\pi(a \mid s; \theta)$ parameterized by trainable weights $\theta$.
The network directly outputs the probabilities of all possible actions given state $s$.
🧠 Policy Network Architecture
- Input: State $s$ (e.g., image, sensor readings, position vector).
- Hidden Layers: Convolutional or dense layers to extract features.
-
Output: Action probabilities via a softmax layer, ensuring:
\[\sum_{a \in \mathcal{A}} \pi(a \mid s; \theta) = 1\]
Example:
If the input is a game screenshot, the network might output:
| Action | Probability |
|---|---|
| Left | 0.15 |
| Right | 0.10 |
| Jump | 0.75 |
The agent then randomly samples an action from this distribution (favoring “Jump”).
2. Policy Objective and Policy Gradient
🎯 Objective Function
The goal is to find parameters $( \theta )$ that maximize the expected performance of the policy.
This is expressed as:
$[
J(\theta) = E[V(S; \theta)]
]$
Here, $( V(s; \theta) )$ is the expected value of being in state $( s )$ under policy $( \pi(a|s; \theta) )$: $[ V(s; \theta) = \sum_a \pi(a|s; \theta) \, Q_\pi(s, a) ]$
🧮 Policy Gradient Ascent
To maximize $( J(\theta) )$, we update parameters via gradient ascent: $[ \theta \leftarrow \theta + \beta \frac{\partial V(s; \theta)}{\partial \theta} ]$ where $( \beta )$ is the learning rate (step size).
The policy gradient — the derivative of the expected return with respect to $( \theta )$ — is given by: $[ \frac{\partial V(s; \theta)}{\partial \theta} = E_{A \sim \pi(\cdot|s; \theta)} \left[ \frac{\partial \log \pi(A|s; \theta)}{\partial \theta} \, Q_\pi(s, A) \right] ]$
This formula says:
- Sample actions $( A )$ from the current policy.
- Weight their gradients by their corresponding Q-values (how good that action was).
🔍 Intuition
The term $\frac{\partial \log \pi(A \mid s; \theta)}{\partial \theta}$ acts as a directional signal — telling the network how to adjust its parameters to make good actions more probable and bad actions less probable.
Example:
If “Jump” yields high future reward, the network will increase $( \pi(\text{Jump}|s; \theta) )$.
If “Duck” leads to losing points, the probability of “Duck” will decrease.
3. Understanding the Policy Gradient Objective and Approximate State-Value Function
1️⃣ The Approximate State-Value Function $V(s; \theta)$
In Policy-Based Reinforcement Learning, we use a policy network $\pi(a \mid s; \theta)$ to represent the probability of taking an action $a$ given a state $s$, where $\theta$ are the trainable parameters.
The approximate state-value function is defined as: \(V(s; \theta) = \sum_a \pi(a \mid s; \theta) \, Q_\pi(s, a)\)
This represents the expected return starting from state $s$, following the stochastic policy $\pi(\cdot \mid s; \theta)$ thereafter.
- The term $\pi(a \mid s; \theta)$ gives the probability of each possible action.
- The term $Q_\pi(s, a)$ gives the expected reward from taking that action.
- Their product and summation capture the expected value over all possible actions.
Hence, $V(s; \theta)$ measures how good the current policy (parameterized by $\theta$) is when starting from state $s$.
2️⃣ The Objective Function $J(\theta)$ and Its Expectation
The objective function $J(\theta)$ is defined as the expectation of the state-value function: \(J(\theta) = \mathbb{E}_{S \sim p_\pi(S)}[V(S; \theta)]\)
Here:
- $S$ represents states sampled from the state distribution under the current policy.
- The expectation $\mathbb{E}[V(S; \theta)]$ measures the average performance of the policy over all states it encounters.
So, maximizing $J(\theta)$ means maximizing the expected long-term return of the policy.
Because it’s an expectation, we can only compute a sampled estimate during training — this is what makes it a stochastic gradient.
3️⃣ Why We Take the Gradient of $J(\theta)$
To improve the policy, we perform gradient ascent on $J(\theta)$: \(\theta \leftarrow \theta + \beta \, \nabla_\theta J(\theta)\)
where:
- $\beta$ is the learning rate,
- $\nabla_\theta J(\theta)$ is the policy gradient — the direction in parameter space that most increases the expected return.
This is analogous to climbing a hill — each update nudges $\theta$ uphill toward higher rewards.
4️⃣ Deriving the Policy Gradient from $V(s; \theta)$
Starting with: \(V(s; \theta) = \sum_a \pi(a \mid s; \theta) \, Q_\pi(s, a)\)
Taking the derivative with respect to $\theta$ gives: \(\nabla_\theta V(s; \theta) = \sum_a \nabla_\theta \pi(a \mid s; \theta) \, Q_\pi(s, a)\)
- The term $\nabla_\theta \pi(a \mid s; \theta)$ measures how the policy probabilities change when $\theta$ changes.
- The term $Q_\pi(s, a)$ acts as a weight — actions with high Q-values will push the gradient stronger in their direction.
Using the logarithmic trick: \(\nabla_\theta \pi(a \mid s; \theta) = \pi(a \mid s; \theta) \, \nabla_\theta \log \pi(a \mid s; \theta)\)
we can rewrite the gradient as: \(\nabla_\theta V(s; \theta) = \mathbb{E}_{A \sim \pi(\cdot \mid s; \theta)} \!\left[ \nabla_\theta \log \pi(A \mid s; \theta) \, Q_\pi(s, A) \right]\)
This is the Policy Gradient Theorem, which forms the mathematical foundation for algorithms like REINFORCE and Actor–Critic.
5️⃣ Stochastic Policy Gradient Estimate (Practical Form)
In practice, we approximate this expectation using samples: \(g(a_t, \theta_t) = \nabla_\theta \log \pi(a_t \mid s_t; \theta_t) \, Q_\pi(s_t, a_t)\)
This gives an unbiased stochastic estimate of the true gradient.
The policy parameters are then updated as:
\(\theta_{t+1} = \theta_t + \beta \, g(a_t, \theta_t)\)
- In REINFORCE, $Q_\pi(s_t, a_t)$ is replaced by the observed discounted return $u_t$.
- In Actor–Critic, $Q_\pi(s_t, a_t)$ is approximated by a critic network that learns via TD learning.
6️⃣ Why $Q_\pi(s, a)$ Appears in the Policy Gradient
The Q-function serves as a score multiplier:
- If $Q_\pi(s, a)$ is large → increase the probability $\pi(a \mid s)$.
- If $Q_\pi(s, a)$ is small or negative → decrease $\pi(a \mid s)$.
Thus, $Q_\pi(s, a)$ directly determines how strongly the policy should favor or avoid certain actions.
This ensures that the policy gradually shifts probability mass toward actions that yield higher expected returns, leading to improved behavior over time.
🧭 Summary
| Concept | Meaning | Formula |
|---|---|---|
| Policy Network | Defines action probabilities | $\pi(a \mid s; \theta)$ |
| Approx. State Value | Expected return under $\pi$ | $V(s; \theta) = \sum_a \pi(a \mid s; \theta) \, Q_\pi(s, a)$ |
| Objective Function | Expected performance | $J(\theta) = \mathbb{E}[V(S; \theta)]$ |
| Policy Gradient | Direction to improve policy | $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi(A \mid S; \theta) \, Q_\pi(S, A)]$ |
| Stochastic Estimate | Sample-based update | $g(a_t, \theta_t) = \nabla_\theta \log \pi(a_t \mid s_t; \theta_t) \, Q_\pi(s_t, a_t)$ |
🧮 Two Forms of the Policy Gradient — From Summation to Expectation
The policy gradient is the derivative of the approximate state-value function $V(s; \theta)$ with respect to the policy parameters $\theta$.
It is fundamental because Policy-Based Reinforcement Learning aims to maximize the objective function $J(\theta) = \mathbb{E}[V(S; \theta)]$ using gradient ascent.
There are two key mathematical forms of the policy gradient, derived from the same principle but applied differently depending on whether the action space is discrete or continuous.
🧩 Form 1: The Summation Form (Derivative of $V(s; \theta)$)
This is the original, exact formulation — it explicitly sums over all possible actions in the discrete action set $\mathcal{A}$.
\[V(s; \theta) = \sum_{a \in \mathcal{A}} \pi(a \mid s; \theta) \, Q_\pi(s, a)\]Taking the derivative with respect to $\theta$: \(\nabla_\theta V(s; \theta) = \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s; \theta) \, Q_\pi(s, a)\)
This shows:
- The gradient depends on how the policy probability $\pi(a \mid s; \theta)$ changes when $\theta$ changes.
- The Q-function acts as a weight — if an action yields a higher $Q_\pi(s, a)$, its probability gets reinforced more strongly.
🧠 Interpretation:
Each action’s contribution to the gradient is proportional to both how much the policy changes and how valuable the action is.
⚠️ Limitation:
- The summation form is only practical for discrete action spaces, where enumerating all actions is feasible.
- In continuous control (e.g., steering angles, torques), the action set $\mathcal{A}$ is infinite, so direct summation is impossible.
Form 2: The Expectation Form (Policy Gradient Theorem)
To generalize for continuous actions and enable sample-based learning, we use the log-derivative trick: \(\nabla_\theta \pi(a \mid s; \theta) = \pi(a \mid s; \theta) \, \nabla_\theta \log \pi(a \mid s; \theta)\)
Substituting into the previous formula: \(\nabla_\theta V(s; \theta) = \sum_{a \in \mathcal{A}} \pi(a \mid s; \theta) \, \nabla_\theta \log \pi(a \mid s; \theta) \, Q_\pi(s, a)\)
This can be expressed compactly as an expectation: \(\nabla_\theta V(s; \theta) = \mathbb{E}_{A \sim \pi(\cdot \mid s; \theta)} \!\left[ \nabla_\theta \log \pi(A \mid s; \theta) \, Q_\pi(s, A) \right]\)
🎯 Advantages of the Expectation Form:
- Works for continuous and discrete actions.
- Allows sampling-based estimation — no need to enumerate all actions.
- Enables stochastic gradient ascent (used in algorithms like REINFORCE and Actor–Critic).
Practical Stochastic Estimate (Sample-Based Update)
In practice, we sample one action $a_t$ from the policy distribution $\pi(\cdot \mid s_t; \theta_t)$ and compute: \(g(a_t, \theta_t) = \nabla_\theta \log \pi(a_t \mid s_t; \theta_t) \, Q_\pi(s_t, a_t)\)
This $g(a_t, \theta_t)$ is an unbiased stochastic estimate of the true gradient.
Then we perform a gradient ascent update: \(\theta_{t+1} = \theta_t + \beta \, g(a_t, \theta_t)\)
Here:
- $\beta$ is the learning rate.
- $Q_\pi(s_t, a_t)$ can be approximated by:
- The discounted return $u_t$ (Monte Carlo / REINFORCE), or
- The critic’s estimate (in Actor–Critic methods).
🔍 Summary Comparison
| Form | Expression | Works For | Description |
|---|---|---|---|
| Summation Form | $\nabla_\theta V(s; \theta) = \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a \mid s; \theta) \, Q_\pi(s, a)$ | Discrete actions | Exact but computationally expensive for large action spaces |
| Expectation Form | $\nabla_\theta V(s; \theta) = \mathbb{E}{A \sim \pi(\cdot \mid s; \theta)} !\left[ \nabla\theta \log \pi(A \mid s; \theta) \, Q_\pi(s, A) \right]$ | Discrete + Continuous | Enables stochastic sampling and gradient ascent |
| Sampled Estimate | $g(a_t, \theta_t) = \nabla_\theta \log \pi(a_t \mid s_t; \theta_t) \, Q_\pi(s_t, a_t)$ | Both | Used in REINFORCE and Actor–Critic updates |
4. The Policy Gradient Algorithm
Because computing the exact expectation $( E[\cdot] )$ is often infeasible, Policy-Based RL uses sampling to estimate gradients — this leads to stochastic policy gradient algorithms. The Expectation Form of the policy gradient is a theoretical identity, not an algorithm by itself.
It expresses the gradient of the expected return as an expectation over actions sampled from the policy:
- The formula tells us what direction to move in parameter space to improve the policy.
- It assumes we could perfectly evaluate $Q_\pi(s, a)$ for every state and action — which we cannot do directly.
- Therefore, in practice, we approximate this expectation by sampling trajectories and estimating returns.
The Monte Carlo (MC) process is one way to approximate the expectation above.
🔄 Single-Step Update Process
At each timestep $t$:
- Observe State: $s_t$
- Sample Action: $a_t \sim \pi(\cdot \mid s_t; \theta_t)$
- Estimate $Q_\pi$: Compute an estimate $q_t \approx Q_\pi(s_t, a_t)$
- Compute Log-Gradient:
\(d_{\text{log}, t} = \frac{\partial \log \pi(a_t \mid s_t; \theta)}{\partial \theta}\) - Compute Gradient Estimate:
\(g(a_t, \theta_t) = q_t \, d_{\text{log}, t}\) This is an unbiased estimate of the true policy gradient. - Update Parameters:
\(\theta_{t+1} = \theta_t + \beta \, g(a_t, \theta_t)\)
This process updates the policy so that actions with higher estimated rewards become more likely in the future.
5️⃣ Methods for Estimating $Q_\pi(s, a)$
The way we approximate $Q_\pi(s, a)$ defines different policy gradient algorithms:
🧩 Option 1: REINFORCE (Monte Carlo Method)
Since $Q_\pi(s, a)$ is unknown, the REINFORCE algorithm replaces it with the observed discounted return from a sampled trajectory: \(u_t = \sum_{k=t}^{H} \gamma^{k-t} r_k\)
Then we estimate the gradient using a single sample: \(g(a_t, \theta_t) = \nabla_\theta \log \pi(a_t \mid s_t; \theta_t) \, u_t\)
and perform the parameter update: \(\theta_{t+1} = \theta_t + \beta \, g(a_t, \theta_t)\)
So, REINFORCE is a Monte Carlo implementation of the Policy Gradient Theorem —
it computes the gradient empirically by playing full episodes and using their observed returns to approximate the true expectation.
Thus:
- The update itself still happens per timestep.
- But the information it uses (the discounted return $u_t$) comes from the entire sampled trajectory.
- This is why we say REINFORCE is Monte Carlo — because it waits until the episode ends to compute returns.
🧠 In short:
Sampling one action at a time is part of the update rule,
but collecting the reward signal from a full trajectory makes the method Monte Carlo.
⚖️ Option 2: Actor–Critic Method
Instead of waiting until the end of the episode, the Actor–Critic framework uses two neural networks:
| Component | Role | Function |
|---|---|---|
| 🎭 Actor | Policy Network | Selects actions via $\pi(a \mid s; \theta)$ |
| 🧮 Critic | Value Network | Estimates $Q_\pi(s, a)$ or $V_\pi(s)$ |
The Critic provides instant feedback to the Actor, stabilizing and accelerating training.
So we no longer need to wait until the episode finishes — we can update every timestep.
This makes Actor–Critic faster and less variable than Monte Carlo.
The update rule becomes: \(\theta_{t+1} = \theta_t + \beta \, \big(r_t + \gamma V(s_{t+1}) - V(s_t)\big) \, \frac{\partial \log \pi(a_t \mid s_t; \theta)}{\partial \theta}\)
This term $\big(r_t + \gamma V(s_{t+1}) - V(s_t)\big)$ is the Temporal Difference (TD) error,
indicating how much better or worse the outcome was than expected.
🧠 Summary and Insights
| Concept | Description | Formula / Idea |
|---|---|---|
| Policy Function | Defines how actions are chosen | $\pi(a \mid s; \theta)$ |
| Objective Function | Expected value of the policy | $J(\theta) = E[V(S; \theta)]$ |
| Policy Gradient | Direction to improve policy | $E[\nabla_\theta \log \pi(A \mid S; \theta) \, Q_\pi(S, A)]$ |
| REINFORCE | Monte Carlo method using full-episode returns | $q_t = u_t$ |
| Actor\text{–}Critic | Uses separate networks for policy and value | TD-based updates |
💡 Key Takeaways
- Policy-Based RL directly optimizes behavior, rather than inferring it from value estimates.
- It’s particularly powerful for:
- Continuous action spaces (e.g., robotics control).
- Stochastic environments.
- Learning diverse behaviors via exploration.
- Algorithms like REINFORCE and Actor–Critic form the foundation of modern methods like A2C, PPO, and SAC.
In essence, Policy-Based RL lets the agent learn how to act directly, discovering strategies that maximize long-term rewards — one gradient step at a time.
