Value-Based Reinforcement Learning Foundations
Value-based reinforcement learning (RL) focuses on estimating how valuable it is to take a particular action in a given state — quantified as the expected discounted future reward.
This approach underpins algorithms like Q-learning and Deep Q-Networks (DQN).
1️. Value-Based Reinforcement Learning Foundations
🧩 The Action-Value Function $( Q_\pi(s, a) )$
The Action-Value Function for a policy $( \pi )$, denoted $( Q_\pi(s_t, a_t) )$, is the expected discounted return given that the agent starts from state $( s_t )$, takes action $( a_t )$, and then follows policy $( \pi )$ thereafter.
$[ Q_\pi(s_t, a_t) = E[U_t \mid S_t = s_t, A_t = a_t] ]$
The discounted return $( U_t )$ is defined as: $[ U_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots ]$ where $( \gamma )$ (gamma) is the discount factor that determines how much the agent values future rewards.
The expectation $( E[\cdot] )$ accounts for:
- Future actions $( A_{t+1}, A_{t+2}, … )$ drawn from the policy $( \pi )$.
- Future states $S_{t+1}, S_{t+2}, \ldots$ drawn from the environment’s transition probability $p(s’ \mid s, a)$.
🧠 Example:
Imagine an agent playing Super Mario.
If Mario is at position $( s_t )$ and chooses the action “jump,” then $( Q_\pi(s_t, \text{jump}) )$ is the expected total score (coins, enemies defeated, etc.) Mario can still accumulate if he continues playing according to policy $( \pi )$.
🌟 The Optimal Action-Value Function $Q^\ast(s, a)$
The goal of value-based RL is to find the optimal action-value function, $Q^\ast(s, a)$.
\[Q^\ast(s_t, a_t) = \max_\pi Q_\pi(s_t, a_t)\]- Definition: $Q^\ast(s, a)$ represents the maximum expected discounted return achievable by taking action $a$ in state $s$, and then behaving optimally thereafter.
- Significance: No policy can outperform $Q^\ast(s, a)$; it represents the best possible outcome.
- Optimal Action Selection:
\(a^\ast = \arg\max_a Q^\ast(s, a)\)
The optimal policy $\pi^\ast$ always chooses the action with the highest $Q^\ast(s, a)$. - Challenge: $Q^\ast(s, a)$ is unknown, and directly estimating it for complex environments is computationally infeasible.
🎮 Example:
If Mario’s $( Q^\ast(s, a) )$ values for the current frame are:
| Action | $Q^\ast(s,a)$ |
|---|---|
| Jump | 2500 |
| Left | 1000 |
| Right | 3000 |
Then the optimal policy will choose “Right”, since it yields the maximum expected reward.
2️. Deep Q-Network (DQN)
The Deep Q-Network (DQN) is a breakthrough solution that uses a neural network to approximate $( Q^\ast(s, a) )$.
Introduced by DeepMind (2015), it allowed agents to play Atari games directly from pixels — achieving human-level performance.
🧮 Network Structure
- Function Approximation:
DQN uses a neural network $( Q(s, a; w) )$, parameterized by weights $( w )$, to approximate $( Q^\ast(s, a) )$. - Input: The state $( s )$ (e.g., the current game screen).
- Output: A vector of Q-values for all possible actions $( a \in A )$.
Example:
If the action space is {left, right, jump}, the network might output:
| Action | Q(s, a; w) |
|---|---|
| Left | 2000 |
| Right | 1000 |
| Jump | 3000 |
Here, the agent selects: $[ a_t = \arg\max_a Q(s_t, a; w) ]$ → “Jump”, since it gives the highest predicted value.
🧭 Exploration vs. Exploitation
To balance exploration (trying new actions) and exploitation (choosing the best-known action), DQN often uses an ε-greedy policy:
- With probability ε, choose a random action (explore).
- With probability 1 - ε, choose $( \arg\max_a Q(s, a; w) )$ (exploit).
3️. Temporal Difference (TD) Learning
Temporal Difference (TD) Learning is the key algorithm used to train the DQN.
It updates the Q-function incrementally, based on new experience, without waiting for the episode to finish.
🔁 Recursive Identity
The return $( U_t )$ can be defined recursively: $[ U_t = R_t + \gamma U_{t+1} ]$
Since $( Q(s_t, a_t; w) )$ approximates $( U_t )$, we have: $[ Q(s_t, a_t; w) \approx R_t + \gamma Q(s_{t+1}, a_{t+1}; w) ]$
This forms the foundation for bootstrapping — updating current estimates using future estimates.
🧠 The Principle of Temporal Difference (TD) Learning
1️⃣ The Need for Early Updates
In traditional supervised learning, a model $Q(w)$ estimating travel time (e.g., from NYC → Atlanta) must complete the trip before learning from the actual total duration.
For example:
- Predicted total time: $q = 1000$ minutes
- Actual total time: $y = 860$ minutes
The loss is computed after completion:
$\mathcal{L} = (q - y)^2$
However, Temporal Difference (TD) Learning asks:
Can we update earlier, before finishing the trip?
Yes — TD allows updating $w$ after each intermediate milestone, making learning more efficient.
2️⃣ The Recursive Identity (Bellman Equation)
TD learning is built on the recursive property of expected return:
\[U_t = R_t + \gamma \, U_{t+1}\]Here:
- $U_t$ = cumulative discounted future reward
- $R_t$ = immediate reward
- $\gamma$ = discount factor $(0 < \gamma < 1)$
This identity says:
The total return equals the immediate reward plus the discounted expected return of the next step.
3️⃣ TD Target (Bootstrapping)
TD constructs a TD Target $(y)$ — a better estimate of the true return:
\[y_t = R_t + \gamma \, Q(s_{t+1}, a_{t+1}; w)\]This mixes:
- the actual reward ($R_t$), and
- the predicted value of the next state ($Q(s_{t+1}, a_{t+1}; w)$).
This method is called bootstrapping because it “pulls itself up” using its own future estimate.
🚗 TD Example: Driving from NYC to Atlanta
We can model this as:
\[T_{NYC→Atlanta} \approx T_{NYC→DC} + T_{DC→Atlanta}\]| Element | Prediction (Model $Q(w)$) | Actual Experience |
|---|---|---|
| Initial Prediction ($q$) | $1000$ min (NYC→Atlanta) | N/A |
| Intermediate Step | $600$ min (DC→Atlanta) | $r = 300$ min (NYC→DC) |
Now that we reached DC, we can compute a TD Target:
\[y = (\text{Actual NYC→DC}) + (\text{Predicted DC→Atlanta})\] \[y = 300 + 600 = 900 \text{ minutes}\]Thus, the TD target $y = 900$ is a more reliable estimate than the initial $1000$ minutes.
4️⃣ The TD Error ($\delta$)
The TD error measures how wrong the model was:
Method A: Prediction vs. TD Target
\[\delta = q - y = 1000 - 900 = 100\]This 100-minute error guides the gradient update:
\[\nabla_w \mathcal{L} = (q - y) \, \nabla_w Q(w)\]Method B: Implied Segment Comparison
- Model implied NYC→DC = $1000 - 600 = 400$ minutes
- Actual NYC→DC = $300$ minutes
- $\delta = 400 - 300 = 100$
Both perspectives agree — the model overestimated by 100 minutes and must adjust downward.
4. 🤖 TD Learning in Deep Q-Networks (DQN)
In DQN, the neural network $Q(s, a; w)$ approximates the optimal function $Q^\ast(s, a)$.
| Component | Formula | Description |
|---|---|---|
| Prediction | $q_t = Q(s_t, a_t; w)$ | Network’s output for current state-action |
| TD Target | $y_t = r_t + \gamma \max_a Q(s_{t+1}, a; w)$ | From Bellman optimality |
| TD Error | $\delta_t = q_t - y_t$ | Difference between prediction and target |
| Loss | $L_t = \tfrac{1}{2} (q_t - y_t)^2$ | Used for gradient descent |
Weight update rule:
\[w_{t+1} = w_t - \alpha \, (q_t - y_t) \, \nabla_w Q(s_t, a_t; w_t)\]This iterative process — predicting, bootstrapping, and correcting via TD error — enables DQN to progressively refine $Q^\ast(s, a)$ across experience.
🔍 Practical Intuition (Common Questions)
Does the Q-function output one score per action?
Yes. For a given state $s_t$, DQN outputs a vector of values: \([Q(s_t, a_1; w), Q(s_t, a_2; w), \dots, Q(s_t, a_n; w)]\) Each value estimates the expected discounted return of taking that action now, then following the learned policy.
Does each action produce a reward $r_t$?
Yes. After every action, the environment returns an immediate reward $r_t$ (possibly positive, zero, or negative).
- Taxi (Gym-style example):
- step penalty: $-1$
- illegal pickup/drop-off: $-10$
- successful drop-off: $+20$
- Robotics (typical shaping):
- progress term (e.g., closer to goal): positive
- control effort penalty (large torques): negative
- collision/safety penalty: negative
- task completion bonus: positive terminal reward
What is the TD target $y_t$?
\(y_t = r_t + \gamma \max_{a'} Q_{\text{target}}(s_{t+1}, a')\)
- $r_t$: immediate reward at current step
- $\gamma \max_{a’} Q_{\text{target}}(s_{t+1}, a’)$: discounted estimate of the best future return from the next state
So $y_t$ is not just “current return.” It is a one-step Bellman target: reward now + estimated future value.
Why can reward be added to a Q score?
Because both represent return in the same unit. By Bellman recursion: \(G_t = r_t + \gamma G_{t+1}\) and Q-values are expectations of this return. So adding immediate reward to discounted future value is exactly the correct decomposition.
Is $Q_{\text{target}}$ the same as $Q$?
Same network architecture, different parameters.
- Online network $Q(s,a;w)$: updated every gradient step.
- Target network $Q_{\text{target}}(s,a;w^-)$: a delayed copy, updated periodically (or softly).
This delay stabilizes training and reduces feedback loops from chasing a rapidly moving target.
What does $\max_{a’}$ mean?
It means: evaluate all possible next actions in $s_{t+1}$ and pick the largest predicted Q-value: \(\max_{a'} Q_{\text{target}}(s_{t+1}, a')\) This is “the value of the best next action,” not necessarily the sampled exploratory action.
Then:
$[
y_t = 10 + 0.9 \times 100 = 100
]$
If the current prediction $( Q(s_t, a_t; w) = 80 )$,
the TD error is $( (80 - 100) = -20 )$,
meaning the agent underestimated the value of this action.
⚙️ Training Algorithm
Each training step updates the neural network parameters $( w )$ using gradient descent on the TD loss:
- Observe & Predict:
Predict $( q_t = Q(s_t, a_t; w_t) )$ - Environment Response:
Observe reward $( r_t )$ and next state $( s_{t+1} )$ - Compute TD Target:
$( y_t = r_t + \gamma \max_a Q(s_{t+1}, a; w_t) )$ - Compute Loss:
$[ L_t = \frac{1}{2}(q_t - y_t)^2 ]$ - Update Weights:
$[ w_{t+1} = w_t - \alpha (q_t - y_t) \nabla_w Q(s_t, a_t; w_t) ]$ where $( \alpha )$ is the learning rate.
This process repeats as the agent interacts with the environment, gradually improving its estimation of $( Q^\ast(s, a) )$.
Why the TD Target Uses the Maximum Operator
| Concept | Formula | Explanation |
|---|---|---|
| TD Target | $y_t = r_t + \gamma \max_a Q(s_{t+1}, a; w)$ | From Bellman optimality |
- The network predicts $Q(s_{t+1}, a; w)$ for all actions.
- The maximum represents the best possible future decision: \(\max_a Q(s_{t+1}, a; w)\)
- This expresses the Bellman Optimality Equation, meaning the update always assumes the agent acts optimally in the future.
Thus, DQN learns toward $Q^\ast(s,a)$ — the optimal value function — instead of just following the current policy $Q_\pi(s,a)$.
Why DQN Only Looks One Step Ahead but Learns the Whole Path
DQN bootstraps from one-step lookahead: \(y_t = r_t + \gamma \max_a Q(s_{t+1}, a; w)\) Even though it updates using only the next state, this recursive update propagates future knowledge backward.
Over many episodes:
- Reward information moves backward through earlier states.
- The Q-values across time become consistent.
- The network converges toward the global optimum $Q^\ast(s,a)$.
This is why a one-step TD update can still produce a long-horizon optimal policy.
Training Dynamics and Number of Steps
If you have 100 episodes with 10 time steps each:
- Total transitions: $100 \times 10 = 1000$
- Each transition provides one TD update.
During training:
- Each $(s_t, a_t, r_t, s_{t+1})$ forms a sample.
- Compute the target $y_t = r_t + \gamma \max_a Q(s_{t+1}, a; w)$.
- Update $Q(s_t, a_t; w)$ via gradient descent.
Hence, one episode contributes 10 updates, and 100 episodes → 1000 TD updates. However, these transitions can be reused multiple times through a Replay Buffer.
Replay Buffer (Experience Replay)
The Replay Buffer stores past experiences to improve stability and efficiency.
Each experience tuple: \((s_t, a_t, r_t, s_{t+1}, \text{done})\) is appended to a buffer of fixed capacity (e.g., 100,000 samples).
During training:
- Random mini-batches are drawn from this buffer.
- Each batch is used to compute TD targets and perform gradient descent.
Benefits
| Benefit | Explanation |
|---|---|
| Breaks correlation | Random sampling removes temporal dependence between transitions. |
| Improves stability | Training resembles i.i.d. supervised data. |
| Boosts sample efficiency | Experiences can be reused multiple times. |
| Enables off-policy learning | Allows training with experiences from older policies. |
Prioritized Replay
Not all experiences are equally useful.
Prioritized Experience Replay samples transitions based on their TD error magnitude.
Using priority weights $P_i$ for each experience $i$:
\(P_i = \frac{|\delta_i|^\alpha}{\sum_k |\delta_k|^\alpha}\)
Larger errors → more surprising transitions → more frequent updates.
This focuses learning on states where the network is most uncertain or wrong.
🧭 Summary of DQN + TD Learning
| Component | Concept | Key Equation |
|---|---|---|
| Bootstrapping | One-step update using current network | $y_t = r_t + \gamma \max_a Q(s_{t+1}, a; w)$ |
| Error Signal | Measures model inaccuracy | $\delta_t = q_t - y_t$ |
| Optimization | Gradient descent on squared error | $L_t = \frac{1}{2}(q_t - y_t)^2$ |
| Replay Buffer | Reuses experience, improves stability | Sample random mini-batches |
| Goal | Learn $Q^\ast(s,a)$ to act optimally | $a_t^\ast = \arg\max_a Q(s_t, a; w)$ |
Through many one-step TD updates — sampled from diverse replay experiences —
DQN gradually propagates reward information backward, learning the optimal long-term policy even though each update looks ahead just one step.
5. 🎯 Why DQN Learns $Q^\ast(s, a)$ (Optimal Action–Value Function) — Not $V(s)$ or $\pi(a \mid s)$
Let’s break down this important distinction between DQN, Value Function, and Policy Learning — since they all aim to “choose good actions,” but in different ways.
The Core Difference: What Is Being Learned
| Method Type | Learns… | Output Meaning | Goal |
|---|---|---|---|
| Value-Based (DQN) | $Q^\ast(s, a)$ | Expected discounted return for each possible action in state $s$ | Pick $\arg\max_a Q^\ast(s, a)$ |
| Policy-Based (PG / Actor) | $\pi(a \mid s; \theta)$ | Probability distribution over actions | Sample $a \sim \pi(a \mid s; \theta)$ |
| Value Function (V) | $V(s)$ | Expected return under the current policy | Evaluates how good a state is (but not which action to take) |
So:
- $V(s)$ → evaluates states.
- $Q(s,a)$ → evaluates state–action pairs.
- $\pi(a \mid s)$ → directly defines the action probabilities.
Why DQN Learns $Q^\ast(s,a)$, Not $V(s)$
The Bellman optimality equation defines the optimal action-value function: \(Q^\ast(s,a) = \mathbb{E} \big[ r_t + \gamma \max_{a'} Q^\ast(s',a') \mid s,a \big]\)
- The $\max_{a’}$ term ensures the function encodes the best possible future actions.
- $Q^\ast(s,a)$ directly represents “the total future reward if I take this action now and then act optimally.”
Hence, if you know $Q^\ast(s,a)$, you can derive both the value function and the optimal policy:
\[V^\ast(s) = \max_{a} Q^\ast(s,a)\] \[\pi^\ast(a \mid s) = \begin{cases} 1, & \text{if } a = \arg\max_{a'} Q^\ast(s,a') \\ 0, & \text{otherwise} \end{cases}\]That’s why DQN doesn’t need a separate policy network —
the policy emerges automatically as a greedy function over $Q^\ast$.
Does DQN Output the “Best Action”?
Yes — the output layer of the DQN predicts a Q-value (score) for each possible action in the current state: \([Q(s, a_1; w), Q(s, a_2; w), \dots, Q(s, a_n; w)]\)
- The input is the state (e.g., an image in Atari or a vector in control tasks).
- The output is a vector of Q-scores, one for each action.
- The chosen action is: \(a_t = \arg\max_a Q(s_t, a; w)\)
This is how DQN acts greedily based on predicted Q-values.
⚙️ Is DQN Deterministic? What About State Transitions?
Let’s clarify two separate aspects here — the agent’s policy (DQN) and the environment’s dynamics (state transitions).
DQN’s Policy Is Deterministic (by Design)
In standard Deep Q-Networks (DQN):
- The network outputs a Q-value for each discrete action:
\(Q(s,a;w)\) - The agent chooses the action with the highest Q-value: \(a_t = \arg\max_a Q(s_t,a;w)\)
This means the policy is deterministic once the network parameters $w$ are fixed — it always picks the same action for the same state.
✅ However, during training we add randomness for exploration:
- The most common strategy is ε-greedy:
- With probability ε, pick a random action (exploration).
- With probability 1 − ε, pick the best action (exploitation).
So the behavior policy during training is stochastic, but the target policy (the one being learned) is deterministic: \(\pi^\ast(s) = \arg\max_a Q^\ast(s,a)\)
DQN Works Best for Discrete Action Spaces
- Each output neuron of the Q-network corresponds to one possible action.
- Computing $\max_a Q(s,a)$ is easy when $a$ is discrete.
But when the action space is continuous (like steering angles or forces),
you cannot take a simple max over all $a$ — it’s infinite.
For that reason, vanilla DQN cannot handle continuous control.
➡️ Solutions for continuous actions:
- DDPG (Deep Deterministic Policy Gradient) — uses an actor–critic structure where the actor outputs continuous actions.
- TD3 and SAC — improved variants for stability and exploration.
The Environment (State Transitions) Can Be Stochastic
The state transition function is defined as: \(p(s_{t+1} \mid s_t, a_t)\) This represents the probability distribution of the next state given the current state and action.
- In deterministic environments, the same action in the same state always leads to the same next state.
- In stochastic environments, transitions are sampled from this distribution.
Thus:
- DQN’s policy is deterministic (it picks a fixed action per state).
- The environment can still be stochastic — DQN learns the expected return under that randomness.
🧭 Summary
| Aspect | Deterministic? | Notes |
|---|---|---|
| DQN Policy | ✅ Yes (greedy $\arg\max_a Q$) | Stochastic only during exploration (ε-greedy) |
| Action Space | 🚫 Discrete only | Continuous actions require DDPG/TD3/SAC |
| Environment Transition $p(s’ \mid s,a)$ | ❌ Usually stochastic | DQN learns expected Q-values over these outcomes |
| Q-function target | Deterministic estimate of expectation | $Q^\ast(s,a) = \mathbb{E}[R_t + \gamma \max_a Q(s’,a)]$ |
💡 Intuition:
DQN itself acts deterministically once trained — but it learns in a probabilistic world, averaging over random transitions and rewards to approximate the optimal deterministic action for each state.
So Is It “Like” Policy-Based RL?
Yes — conceptually, both aim to find the best action.
But they differ fundamentally in what they learn and how they update:
| Aspect | Value-Based (DQN) | Policy-Based (PG, Actor–Critic) |
|---|---|---|
| What is learned | Deterministic value function $Q(s, a)$ | Stochastic policy $\pi(a \mid s; \theta)$ |
| Output | Scores (Q-values) for each action | Probability distribution over actions |
| How to pick actions | $\arg\max_a Q(s, a)$ | Sample from $\pi(a \mid s)$ |
| Gradient signal | TD error $(q_t - y_t)$ | Policy gradient $Q(s, a) \nabla_\theta \log \pi(a \mid s; \theta)$ |
| Exploration | $\epsilon$-greedy (add randomness externally) | Built-in stochasticity in $\pi(a \mid s)$ |
| Advantage | Stable, off-policy, efficient | Works for continuous actions |
| Limitation | Discrete actions only | Higher variance, harder to optimize |
So although both methods choose actions that maximize expected return,
- DQN does so indirectly (via $\max_a Q$).
- Policy-based RL does so directly (via parameterizing $\pi(a \mid s)$).
