π§ Imitation Learning via Privileged Teachers and Generative Models like Diffusions
This lecture builds upon previous discussions on Imitation Learning (IL) and delves into advanced techniques and current research areas.
π 1. Recap of Previous Lecture (Imitation Learning Part 1)
Lecture 6 begins with a brief review of key concepts from Part 1:
- Notations, MDP, POMDP: Reinforcement learning foundations.
- What is imitation learning? Learning from expert demonstrations.
- Difference from supervised learning: Highlights ILβs sequential nature and distribution shift challenges.
- Does behavior cloning (BC) work? BC is the simplest IL method but has major limitations.
- Fixing BC:
- Better data: Improve expert demonstrations or augment them.
- Better algorithm: DAgger (Dataset Aggregation) helps fix distribution shift.
β 2. Key Themes of Imitation Learning in this lecture
π§ 2.1 Imitation Learning with Powerful Models
π§ Using History
- Problem: Naive IL models act based only on current observation: $p(a_t \mid o_t)$.
- Limitation: Human behavior is history-dependent. Reacting identically to the same frame is unnatural.
- Solution: Model $p(a_t \mid o_0, \dots, o_t)$ using time-series techniques from Lecture 4.
π Modeling Multi-modal Behavior
- Motivation: Expert behavior is often multi-modal (multiple correct actions).
- Methods:
- Gaussian Mixture Models
- Latent Variable Models (e.g., conditional VAE)
- Diffusion Models (details in later lectures)
- GANs and other generative techniques
π§βπ« 2.2 Imitation Learning with Privileged Teachers

What is the Privileged Teacher Approach?
The Privileged Teacher approach is a strategy in Imitation Learning (IL) where:
- A teacher agent is trained in a simulation environment using privileged state information (
s_t), which may not be available in the real world. - A student agent is then trained to imitate the teacher using only realistic observations (
o_t) β such as camera images or proprioception β to reflect what it would experience in the real world.
π‘ Why Use a Privileged Teacher?
| Component | Description |
|---|---|
| Teacher Policy | $\pi_{\text{teacher}}(a_t \mid s_t)$, trained using full ground-truth state. |
| Student Policy | $\pi_{\text{student}}(a_t \mid o_t)$, trained using partial/realistic observation. |
| Training Flow | Teacher runs in sim β collects $(o_t, a_t)$ β student imitates this. |
| Environment | Simulation gives access to true state $s_t$, which is not available in real-world deployment. |
β Benefits of This Approach
- White-box Supervision: Teacher policy is transparent and modifiable.
- Efficient Optimization: Easier to learn optimal behavior from ground truth.
- Invariance: Enables generalization across varied conditions (e.g. day/night driving).
- Data Augmentation: Easier to simulate different conditions and viewpoints.
- Safer and Cheaper: No need to collect risky real-world data for student training.
π Applications of Privileged Teacher
π Driving
- Stage 1: Train a privileged teacher using full information (e.g., traffic lights, other carsβ positions, velocities).
- Stage 2: Student only sees sensor inputs like images or LiDAR and learns to imitate the teacher.
- Advantage: Safer than letting students learn directly in the real world with limited perception.
π¦Ώ Locomotion
- Privileged Information: Contact forces, terrain maps, external disturbances, etc. (available in simulation).
- Student Inputs: Proprioceptive sensors β IMU, joint angles, velocities.
- Training: Privileged agent (teacher) is trained using RL (e.g., PPO), then student learns to imitate using only partial observation.
- Example: Widely used in simulated quadrupeds and humanoids.
β¨ Student Learning Not in Action Space
- Variants:
- Latent Space Learning: Student learns to match latent representations instead of raw actions.
- Example: RMA (Recurrent Modulation Agent).
- Perception Space Learning: Student learns to predict intermediate observations (like sensor rays).
- Example: ABS (Adaptive Behavior Shaping).
- Latent Space Learning: Student learns to match latent representations instead of raw actions.
- Why?: Useful when action space is too complex, or when alignment in latent/perception space is more robust.
π Drones
- Teacher: MPC controller (Model Predictive Control) trained with access to full state (e.g., exact position, velocity, wind).
- Student: Learns from camera or IMU data only.
- Benefit: Allows training high-performance policies in simulation and transferring them safely to drones with only onboard sensors.
π§ Summary
Privileged teacher methods elegantly combine:
- Powerful policy learning (via RL + full state) in simulation,
- With safe and realistic student learning (via IL) using real-world-like observations.
This method is a key tool in sim-to-real transfer, data-efficient learning, and robust control under partial observability.
π¨ 2.3 Deep Imitation Learning via Generative Models

βA very active research area!β
Deep Imitation Learning (IL) via Generative Models is a highly active research area in robot learning. This approach integrates generative models into imitation learning to help robots learn complex behaviors from expert demonstrations.
π² Generative Models in Imitation Learning
Generative models are characterized by their ability to:
- Learn a distribution $( p_\theta )$ that approximates a given data distribution $( p_{\text{data}} )$.
In imitation learning, $( p_{\text{data}} )$ comes from expert demonstrations. These models are often conditional. - Generate novel samples $( \tilde{x} \sim p_\theta )$ once the distribution is learned.
The learned distribution can represent different levels of expert behavior:
- Action level (conditional): $( p(a_t \mid o_t) )$
- Observation-action pair: $( p(o_t, a_t) )$ or state-action: $( p(s_t, a_t) )$
- A sequence of actions
- A full state-action trajectory
Key Benefit: These models can capture multi-modal behavior, common in expert demonstrations (e.g., multiple valid actions in a driving scenario).
βοΈ Generative Adversarial Imitation Learning (GAIL)
GAIL integrates Generative Adversarial Networks (GANs) with imitation learning.
Core Loop:
- Sample trajectories from the student policy.
- Update the discriminator to distinguish expert vs. student trajectories.
- Train the student policy to fool the discriminator: $[ \pi_{\text{student}} \approx \pi_{\text{expert}} \Rightarrow D(\tau_{\text{student}}) \approx D(\tau_{\text{expert}}) ]$
Goal: The student learns to imitate the expert by minimizing discriminator accuracy.
π‘ Open question: Can GAIL model multi-modal behavior effectively?
ποΈ VAE + Imitation Learning (VAE+IL)
Variational Autoencoders (VAEs) are also used in IL to model latent expert behavior.
Example: Action Chunking with Transformers (ACT)
- Based on Conditional VAE (CVAE).
- Encoder: $[ z = \text{Encoder}(a_{1:T}, o) ]$
- Decoder: $[ \hat{a}_{1:T} = \text{Decoder}(z, oβ) ]$
- Supports action chunking and temporal ensemble, useful for long-horizon planning.
π«οΈ Diffusion Models + Imitation Learning
Diffusion models are powerful generative models recently adopted for IL.
Two Main Steps:
-
Forward Diffusion Process: Gradually corrupt real data $( x_0 )$ with Gaussian noise: $[ q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) ]$ repeated over steps $( t = 1, \dots, T )$, until pure noise $( x_T )$.
-
Reverse Diffusion Process: Learn reverse process: $[ p_\theta(x_{t-1} \mid x_t) ]$ that reconstructs $( x_0 )$ from noise $( x_T )$ by denoising step-by-step.
π§ͺ Theoretical Insights:
- Related to stochastic gradient Langevin dynamics: $[ x_{t+1} = x_t + \frac{\epsilon}{2} \nabla \log p(x_t) + \sqrt{\epsilon} \cdot \eta, \quad \eta \sim \mathcal{N}(0, I) ]$
- In the limit $( t \to \infty, D \to 0 )$, the generated samples match the true data distribution.
Applications:
- Trajectory-level modeling: Learn entire action trajectories (e.g., diffuser).
- Enables high-fidelity and diverse imitation, capturing subtle expert nuances.
π§ Summary: Generative Models + IL
Generative models integrate well with IL, offering tools for uncertainty, diversity, and realistic sequence generation. Generative models provide a flexible, expressive framework to enhance imitation learning β especially in robotics, where behavior is structured, dynamic, and multi-modal.
| Model Type | Learn Distribution Over | Example |
|---|---|---|
| GAN | $( P(o_t, a_t) )$ or full trajectories | GAIL |
| VAE | $P(a_t \mid o_t)$, action chunks | ACT |
| Diffusion | Trajectories, sequences | Diffuser |
- IL + GANs: Learns to imitate via adversarial feedback; matches expert distribution.
- IL + VAEs: Captures latent structure of expert behavior; supports sequence prediction.
- IL + Diffusion: Enables realistic and diverse trajectory generation; well-suited for robotics.
3. Diffusion Policy in Imitation Learning More Details
Diffusion Policy treats trajectory generation as a generative modeling problem:
- It models the full distribution of action sequences conditioned on observations.
- Uses a denoising diffusion process to sample high-quality trajectories that imitate expert behavior.
π§© Model Inputs and Outputs
Input to the model:
- $( o_{1:T} )$: Observation history (e.g., images, joint positions, proprioception) β usually fixed throughout the denoising process.
- $( a_{1:T}^{(T)} )$: A fully noised version of the action trajectory (sampled from Gaussian noise).
- (Optional) time step $( t )$ β used as a conditioning variable to guide denoising steps.
Output of the model:
- $( \hat{a}_{1:T}^{(t-1)} )$: The predicted denoised action sequence at step $( t-1 )$, given the noisy sequence at step $( t )$.
π Training Procedure
- Data Collection:
- Collect expert trajectories: pairs of $( (o_{1:T}, a_{1:T}) )$
- Forward Process (Noise Addition):
- Sample noise level $( t \sim {1, β¦, T} )$
- Corrupt expert actions: \(a_{1:T}^{(t)} = \sqrt{\bar{\alpha}_t} \cdot a_{1:T} + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon,\quad \epsilon \sim \mathcal{N}(0, I)\)
- Reverse Process (Training Objective):
- Train the model to predict the added noise $( \epsilon )$ or the clean action: \(\mathcal{L} = \mathbb{E}_{a_{1:T}, \epsilon, t} \left[ \left\| \hat{\epsilon}_\theta(o_{1:T}, a_{1:T}^{(t)}, t) - \epsilon \right\|^2 \right]\)
π Inference (Trajectory Generation)
At test time:
- Sample pure Gaussian noise: $( a_{1:T}^{(T)} \sim \mathcal{N}(0, I) )$
-
Iteratively denoise using the trained model: $[ a_{1:T}^{(t-1)} = f_\theta(a_{1:T}^{(t)}, o_{1:T}, t) ]$ from $( t = T \rightarrow 0 )$
- Final output: $( \hat{a}_{1:T}^{(0)} )$, the predicted action trajectory.
π¦ Summary Table
| Component | Description |
|---|---|
| Input | Observation $( o_{1:T} )$, noisy trajectory $( a_{1:T}^{(t)} )$ |
| Output | Denoised trajectory $( \hat{a}_{1:T}^{(t-1)} )$ or noise estimate |
| Conditioned On | Observation sequence and timestep $( t )$ |
| Training Target | Minimize noise prediction or denoising error |
| Inference Output | High-quality full trajectory $( a_{1:T} )$ |
| Benefit | Models multi-modal, high-fidelity expert behavior across time |
π§ Why Diffusion Policy is Useful in Robotics
- Multi-modality: Can represent many valid behaviors (e.g., different grasps).
- Stability: Better than autoregressive models which accumulate error.
- Trajectory-level control: Avoids myopic step-by-step prediction.
