Value Function Methods in Reinforcement Learning
This lecture is about one core idea:
- Learn a value function that scores actions.
- Act by choosing the action with the highest score.
If the lecture feels heavy, use this reading order:
- Section 2 (tiny tabular example)
- Sections 6 and 7 (FQI and online Q-learning)
- Section 8 (why convergence is hard with deep networks)
1. Motivation: Why Value-Only Methods Exist
In actor-critic, we have:
- Actor: policy network $\pi_\theta(a \mid s)$
- Critic: value estimate (usually $V$ or $Q$)
Value-based methods ask:
If we already know how good each action is, do we still need a separate actor network?
The value-based answer is:
\[\pi(s) = \arg\max_a Q(s,a)\]Step-by-step mini example
At one state $s$:
- $Q(s,\text{left}) = 1.2$
- $Q(s,\text{right}) = 2.1$
Step 1: compare scores.
Step 2: choose action with larger score.
Step 3: policy at this state is right.
That is the full value-based control rule.
2. Tabular Dynamic Programming (Using the Same Example as Lecture 7 Slides)
The lecture’s dynamic programming example assumes:
- 16 discrete states (a 4x4 gridworld)
- 4 actions per state (
up,down,left,right) - Known transition model $p(s’ \mid s,a)$
In slide notation, transitions are stored in a tensor:
\[T[s,s',a] = p(s' \mid s,a), \quad T \in \mathbb{R}^{16 \times 16 \times 4}\]This is exactly why DP is possible: we can enumerate all states/actions and compute exact expectations.
2.1 Policy Evaluation: “Given a policy, compute its value table”
For a deterministic policy $\pi(s)$:
\[V^\pi(s) \leftarrow r(s,\pi(s)) + \gamma \sum_{s'} p(s' \mid s,\pi(s))V^\pi(s')\]Step-by-step update on one grid state (slide-style DP)
Assume:
- discount $\gamma=0.9$
- current state is $s=10$
- current policy chooses
rightat state 10 - reward $r(10,\text{right})=-1$
- transition row from tensor:
- $T[10,11,\text{right}] = 0.8$
- $T[10,10,\text{right}] = 0.2$
- current value estimates:
- $V(11)=4.0$
- $V(10)=2.0$
Policy-evaluation backup:
\[V_{\text{new}}(10)= -1 + 0.9\left(0.8\cdot V(11)+0.2\cdot V(10)\right)\] \[= -1 + 0.9\left(0.8\cdot 4.0 + 0.2\cdot 2.0\right) = -1 + 0.9\cdot 3.6 = 2.24\]So one DP backup updates state 10 from 2.0 to 2.24.
Doing this for all 16 states repeatedly gives $V^\pi$.
2.2 Policy Improvement: “Greedify using the value table”
Once we have $V^\pi$, improve policy by one-step lookahead:
\[\pi_{\text{new}}(s)=\arg\max_a \left[r(s,a)+\gamma\sum_{s'}p(s' \mid s,a)V^\pi(s')\right]\]Step-by-step improvement at one state
Suppose at state $s=10$, from current value table:
upone-step return = $1.1$downone-step return = $1.7$leftone-step return = $0.9$rightone-step return = $2.24$
Then:
\[\pi_{\text{new}}(10) = \text{right}\]Do this at all 16 states and you complete one policy-improvement step.
2.3 Policy Iteration Loop on the 16-State Example
Policy iteration alternates:
- Evaluate current policy to get $V^\pi$
- Improve policy with greedy argmax
Concrete loop
- Start with a simple policy (for example, random or “always right”).
- Run many policy-evaluation backups over all 16 states until values stabilize.
- Greedify all states using one-step lookahead.
- Repeat until policy stops changing.
When policy no longer changes, you have reached an optimal policy for this tabular MDP.
2.4 Value Iteration on the Same 16-State Example
Value iteration merges evaluation + improvement into one update:
\[V_{k+1}(s)=\max_a \left[r(s,a)+\gamma\sum_{s'}p(s' \mid s,a)V_k(s')\right]\]One explicit value-iteration update (same state 10)
Suppose from current table:
up: $1.1$down: $1.7$left: $0.9$right: $2.24$
Then:
\[V_{k+1}(10)=\max(1.1,1.7,0.9,2.24)=2.24\]Perform this at all 16 states each iteration.
After convergence, derive policy with:
3. Bellman Operator and Contraction (Why Tabular Algorithms Converge)
Define Bellman optimality operator:
\[(\mathcal{B}V)(s)=\max_a \left[r(s,a)+\gamma\sum_{s'}p(s' \mid s,a)V(s')\right]\]Contraction statement:
\[\lVert \mathcal{B}V-\mathcal{B}\bar V \rVert_\infty \le \gamma \lVert V-\bar V \rVert_\infty\]Step-by-step numeric check (toy MDP)
Take two value vectors:
- $V=[0,0]$ for $(s_0,s_1)$
- $\bar V=[10,10]$
Distance before backup:
\[\lVert V-\bar V \rVert_\infty = 10\]Apply Bellman backup:
- $\mathcal{B}V = [1,3]$
- $\mathcal{B}\bar V = [9,3]$
Distance after backup:
\[\lVert \mathcal{B}V-\mathcal{B}\bar V \rVert_\infty = 8\]And $8 \le 0.9 \cdot 10 = 9$, so distance shrank. Repeating this drives values to one fixed point $V^\star$.
4. Why Tabular Methods Break in Real Problems
Tabular means “store one value per state or per state-action pair.”
That fails when state spaces are huge.
Concrete scale example
Suppose input is one grayscale image of size $84\times84$.
- Number of pixels: $7056$
- Each pixel has 256 possibilities
- Number of possible observations: $256^{7056}$
That is astronomically large, so you cannot keep a lookup table.
What function approximation changes
Instead of a table, learn:
\[V_\phi(s) \quad \text{or} \quad Q_\phi(s,a)\]A neural network shares parameters across states, so similar states can share learned structure.
5. Fitted Value Iteration (FVI)
FVI keeps Bellman logic but replaces exact tabular updates with supervised regression.
5.1 Update Pattern
\[y_i=\max_a\left[r(s_i,a)+\gamma \mathbb{E}_{s'\sim p(\cdot \mid s_i,a)}[V_\phi(s')]\right]\]Then fit:
\[\min_\phi \frac{1}{2}\sum_i \left(V_\phi(s_i)-y_i\right)^2\]Step-by-step example
Suppose current predictions are:
- $V_\phi(s_0)=1.2$
- $V_\phi(s_1)=2.0$
Targets from toy MDP:
- At $s_1$: $y_1=\max(0,3)=3$
- At $s_0$: $y_0=\max(1,0+0.9\cdot2.0)=1.8$
Regression wants to move:
- $V_\phi(s_1): 2.0 \to 3.0$
- $V_\phi(s_0): 1.2 \to 1.8$
5.2 Practical Limitations
FVI still needs the expectation over next states for each action, so you usually need either:
- known transition model, or
- expensive simulation access per action at chosen states
That is why model-free RL often prefers Q-based methods.
6. Fitted Q-Iteration (FQI)
FQI avoids the model expectation by learning $Q(s,a)$ from sampled transitions.
6.1 Core Target
Given replay data $\mathcal{D}={(s_i,a_i,r_i,s’_i)}$:
\[y_i=r_i+\gamma\max_{a'}Q_\phi(s'_i,a')\]Then fit:
\[\min_\phi \frac{1}{2}\sum_i\left(Q_\phi(s_i,a_i)-y_i\right)^2\]6.2 Why this is useful
- model-free
- off-policy (reuse old data)
- no separate actor needed (policy is greedy over $Q$)
6.3 Step-by-step batch example
Use this batch:
\[\mathcal{D}=\{(s_0,\text{right},0,s_1), (s_0,\text{left},1,\text{terminal}), (s_1,\text{right},3,\text{terminal}), (s_1,\text{left},0,\text{terminal})\}\]Assume current network at $s_1$ gives:
- $Q_\phi(s_1,\text{right})=2.2$
- $Q_\phi(s_1,\text{left})=0.4$
Compute targets:
- $(s_0,\text{right},0,s_1)$: \(y=0+0.9\cdot\max(2.2,0.4)=1.98\)
- $(s_0,\text{left},1,\text{terminal})$: \(y=1\)
- $(s_1,\text{right},3,\text{terminal})$: \(y=3\)
- $(s_1,\text{left},0,\text{terminal})$: \(y=0\)
Then one supervised update pushes each $Q_\phi(s,a)$ toward these labels.
7. Online Q-Learning and Exploration
Online Q-learning is the per-transition version of FQI.
7.1 Single-step update
For transition $(s_t,a_t,r_t,s_{t+1})$:
\[y_t=r_t+\gamma\max_{a'}Q_\phi(s_{t+1},a')\] \[\delta_t = Q_\phi(s_t,a_t)-y_t\]Then do one gradient step to reduce $\delta_t^2$.
7.2 Step-by-step propagation example
Initialize all $Q=0$. Let $\alpha=0.5$.
Observed transitions in order:
- $s_1, \text{right}, r=3, \text{terminal}$
- $s_0, \text{right}, r=0, s_1$
Update 1 on $(s_1,\text{right})$:
\[y=3, \quad Q_{\text{new}}(s_1,\text{right})=0+0.5(3-0)=1.5\]Update 2 on $(s_0,\text{right})$:
\[y=0+0.9\cdot\max(Q(s_1,\text{left}),Q(s_1,\text{right}))=0.9\cdot1.5=1.35\] \[Q_{\text{new}}(s_0,\text{right})=0+0.5(1.35-0)=0.675\]You can see reward information moving backward from $s_1$ to $s_0$.
7.3 Why exploration is necessary
If you always pick current argmax early, you may never try better actions.
Epsilon-greedy rule
- with probability $1-\epsilon$: greedy action
- with probability $\epsilon$: random action
Boltzmann rule
\[\pi(a \mid s) \propto \exp\left(\frac{Q_\phi(s,a)}{\tau}\right)\]- lower $\tau$: more greedy
- higher $\tau$: more exploratory
8. Why Deep Value Methods Can Be Unstable
This is the mathematically hard part of the lecture.
8.1 Two operators in deep value learning
Think of each iteration as two steps:
- Bellman backup $\mathcal{B}$: compute ideal targets
- Projection $\Pi$: fit those targets with limited model class (network)
Combined update is roughly:
\[V \leftarrow \Pi\,\mathcal{B}V\]8.2 Concrete intuition for “mixed norms” problem
Suppose true optimum is $V^\star=[10,0]$.
- Bellman-like step gives candidate $[9,1]$ (very close to optimum)
- Your model class only allows constant vectors, so projection gives $[5,5]$
What happened:
- Projection may reduce Euclidean fit to target class
- But max-error distance to true optimum can get worse
So “each substep looks reasonable” does not imply global convergence.
8.3 Why Q-learning is not full gradient descent
Target depends on the same parameters:
\[y_i=r_i+\gamma\max_{a'}Q_\phi(s'_i,a')\]As $\phi$ changes, $y_i$ changes too. Standard Q-learning treats $y_i$ as fixed within each step, so this is called a semi-gradient method.
8.4 Step-by-step moving target example
Let $r=1$, $\gamma=0.9$.
If max next-Q estimate changes during training:
- A: max next-Q $=4.0$ gives $y=4.6$
- B: max next-Q $=5.0$ gives $y=5.5$
- C: max next-Q $=3.5$ gives $y=4.15$
The target itself moves. That is why target networks and replay buffers are so important in practice.
9. End-to-End Mini Trace (From Data to Policy)
This ties all sections together in one short run.
Step 1: collect transitions
$(s_0,\text{right},0,s_1)$ and $(s_1,\text{right},3,\text{terminal})$.
Step 2: update $Q(s_1,\text{right})$ upward from reward 3.
Step 3: next time $(s_0,\text{right})$ is updated, it uses bootstrapped value from $s_1$.
Step 4: now $Q(s_0,\text{right})$ exceeds $Q(s_0,\text{left})=1$.
Step 5: greedy policy flips to right at $s_0$.
This is the core mechanism of value-based RL:
- learn values from data
- propagate future reward backward
- choose actions via argmax
10. Practical Reading Guide
If Section 8 feels hard, this sequence usually works better:
- Understand Section 2 toy MDP completely.
- Practice Section 6 target calculations by hand.
- Follow Section 7 online updates step by step.
- Only then read Section 8 theory.
11. Final Takeaways
- Value-based RL turns control into repeated Bellman target fitting.
- Tabular methods are clean and convergent because Bellman backup is a contraction.
- FQI and online Q-learning make the method model-free and data-efficient.
- Exploration is mandatory for discovering high-value actions.
- Deep function approximation is powerful but introduces instability and moving-target issues.
