π0 Architecture Anatomy
π0 is a Vision-Language-Action (VLA) model for robot control. The key idea is to combine a pretrained vision-language backbone with a continuous action generator. Instead of asking a language model to emit robot actions as text tokens, π0 predicts a velocity field that transforms random action noise into a smooth chunk of future robot actions.
This post explains the architecture from the Physical Intelligence OpenPI repository, especially:
src/openpi/models/model.pysrc/openpi/models/pi0_config.pysrc/openpi/models/pi0.pysrc/openpi/models/gemma.py
The OpenPI README describes π0 as a flow-based vision-language-action model with base checkpoints pretrained on 10k+ hours of robot data. The code implements three related families: π0, π0-FAST, and π0.5. This post focuses on the original π0 flow-matching model.
1. The Mental Model
π0 has two main parts:
| Part | Role | Typical width |
|---|---|---|
| PaliGemma VLM expert | Encodes images and language | 2048 |
| Action expert | Encodes robot state, noisy actions, and flow timestep | 1024 |
The model is not a simple pipeline where the VLM produces a sentence and another module converts that sentence into motion. It is closer to a multi-expert transformer:
- Image tokens and language tokens go through the PaliGemma expert.
- Robot state and noisy action tokens go through the action expert.
- Both experts participate in the same transformer attention layers.
- The action-token outputs are projected into continuous action velocities.
In code, the idea is roughly:
paligemma_config = get_config("gemma_2b")
action_config = get_config("gemma_300m")
llm = GemmaModule(
configs=[paligemma_config, action_config],
embed_dtype="bfloat16",
)
The first expert loads the PaliGemma/Gemma-style weights. The second expert is an action expert with the same transformer depth but smaller hidden width.
2. Default Input and Output Shapes
The default π0 config in OpenPI is:
| Config field | Value | Meaning |
|---|---|---|
action_dim |
32 | Maximum robot action/state dimension |
action_horizon |
50 | Number of future action steps predicted per chunk |
max_token_len |
48 | Maximum prompt token length for π0 |
IMAGE_RESOLUTION |
224 x 224 | Model image resolution |
IMAGE_KEYS |
3 camera names | base, left wrist, right wrist |
The expected observation dictionary looks like this:
example = {
"image": {
"base_0_rgb": float32[B, 224, 224, 3],
"left_wrist_0_rgb": float32[B, 224, 224, 3],
"right_wrist_0_rgb": float32[B, 224, 224, 3],
},
"image_mask": {
"base_0_rgb": bool[B],
"left_wrist_0_rgb": bool[B],
"right_wrist_0_rgb": bool[B],
},
"state": float32[B, 32],
"tokenized_prompt": int32[B, 48],
"tokenized_prompt_mask": bool[B, 48],
"actions": float32[B, 50, 32],
}
For inference, the model receives the observation and returns:
predicted_actions = float32[B, 50, 32]
Only the first part of the 32-dimensional vector may be meaningful for a particular robot. A smaller 7-DoF arm can use a subset of the action vector, while unused dimensions are padded or ignored by the data transforms.
3. Observation Preprocessing
OpenPI standardizes observations before the model sees them.
Images are expected as RGB arrays. If they arrive as uint8, they are converted from [0, 255] into [-1, 1]. During training, the code can apply image augmentation:
- crop and resize for non-wrist cameras;
- small rotations;
- color jitter;
- resize-with-padding to 224 x 224.
The model expects three image keys:
IMAGE_KEYS = (
"base_0_rgb",
"left_wrist_0_rgb",
"right_wrist_0_rgb",
)
This means a custom dataset must either provide these keys or use transforms that rename/fill them. Missing cameras can be represented with image masks, but the model input schema still expects the standard names.
4. Prefix Tokens: Images and Language
OpenPI splits the transformer input into two conceptual regions:
- prefix: scene and instruction tokens;
- suffix: robot state and action tokens.
The prefix is built by embed_prefix.
Image Tokens
Each 224 x 224 image is passed through the SigLIP image encoder. With a 14 x 14 patch size, one image becomes roughly:
image_tokens: float[B, 256, 2048]
With three cameras:
all_image_tokens: float[B, 768, 2048]
Each image also has a mask:
image_mask: bool[B]
expanded_image_mask: bool[B, 256]
The mask tells the transformer whether a camera view is real or padded/missing.
Language Tokens
The prompt is already tokenized before entering the model:
tokenized_prompt: int32[B, 48]
tokenized_prompt_mask: bool[B, 48]
The Gemma embedder maps token ids to vectors:
text_tokens: float[B, 48, 2048]
So the full π0 prefix is usually:
prefix_tokens = concat([3 camera token groups, text_tokens], axis=1)
prefix_tokens: float[B, 816, 2048] # 768 image + 48 text
prefix_mask: bool[B, 816]
The prefix is handled by the PaliGemma expert, whose width is 2048.
5. Suffix Tokens: State, Noisy Actions, and Time
The suffix is built by embed_suffix.
For original π0, the suffix contains:
- one state token;
- 50 noisy action tokens.
State Token
The low-dimensional robot state has shape:
state: float[B, 32]
The state projection maps it into the action expert width:
state_token = Linear(32 -> 1024)(state)
state_token: float[B, 1, 1024]
This token gives the action expert the current proprioceptive state: joint positions, gripper state, base state, or whatever the dataset transform packs into the 32-dimensional state vector.
Noisy Action Tokens
During training, the model receives a corrupted version of the ground-truth action chunk:
noisy_actions: float[B, 50, 32]
Each action vector is projected into the action expert width:
action_tokens = Linear(32 -> 1024)(noisy_actions)
action_tokens: float[B, 50, 1024]
Flow Timestep Embedding
π0 is a flow-matching model, so it also needs to know the noise level t.
OpenPI uses a sinusoidal embedding for the scalar timestep:
time: float[B]
time_emb: float[B, 1024]
Then the same time embedding is repeated across the 50 action positions:
time_tokens: float[B, 50, 1024]
The action token and time token are concatenated:
action_plus_time: float[B, 50, 2048]
Then a small MLP compresses it back to the action expert width:
action_time_tokens = MLP(2048 -> 1024 -> 1024)(action_plus_time)
action_time_tokens: float[B, 50, 1024]
So the full suffix for π0 is:
suffix_tokens = concat([state_token, action_time_tokens], axis=1)
suffix_tokens: float[B, 51, 1024]
suffix_mask: bool[B, 51]
The suffix is handled by the action expert, whose width is 1024.
6. The Attention Mask: How Prefix and Suffix Communicate
The most important implementation detail is the attention mask.
π0 does not use a separate encoder-decoder cross-attention layer. Instead, it uses a shared transformer stack with a blockwise causal self-attention mask.
Conceptually:
| Query token | Can attend to image/language prefix? | Can attend to state? | Can attend to action tokens? |
|---|---|---|---|
| Image/language token | Yes, bidirectionally inside prefix | No | No |
| State token | Yes | Yes | No |
| Action token | Yes | Yes | Yes, within the action block |
This gives the action tokens access to the visual scene, language instruction, and current robot state, while preventing robot-action tokens from changing the prefix representation.
The code builds this behavior with two masks:
input_mask: bool[B, sequence_length]
ar_mask: bool[sequence_length]
Then it turns them into:
attn_mask: bool[B, sequence_length, sequence_length]
For a default π0 forward pass:
prefix_length = 816
suffix_length = 51
total_length = 867
attn_mask: bool[B, 867, 867]
The prefix uses False entries in ar_mask, meaning prefix tokens share the same attention block and can attend bidirectionally. The state token starts a new block. The first action token starts another block, and the remaining action tokens share that action block.
That is why this is better described as masked self-attention across two experts, not ordinary encoder-decoder cross-attention.
7. The Multi-Expert Gemma Transformer
OpenPI’s Gemma module supports multiple experts. In π0, there are two:
| Expert index | Token group | Width | Depth | MLP dim | Heads | KV heads | Head dim |
|---|---|---|---|---|---|---|---|
| 0 | PaliGemma prefix | 2048 | 18 | 16384 | 8 | 1 | 256 |
| 1 | Action suffix | 1024 | 18 | 4096 | 8 | 1 | 256 |
The two experts have different widths and MLPs, but they share attention compatibility:
- same number of attention heads;
- same number of KV heads;
- same head dimension;
- same transformer depth.
Inside attention:
prefix: float[B, prefix_len, 2048] -> q/k/v heads
suffix: float[B, suffix_len, 1024] -> q/k/v heads
concat q/k/v along sequence dimension
apply masked attention
split outputs back into prefix and suffix streams
The attention mechanism is therefore the place where visual-language context and action tokens interact.
What Happens in One Transformer Block
Each block follows the usual transformer pattern:
- RMSNorm.
- Multi-head attention.
- Residual connection.
- RMSNorm.
- Feed-forward network.
- Residual connection.
For π0, the prefix and suffix use different parameters for projections and MLPs, but attention is computed over the combined sequence. In the source code, later expert parameters receive suffixes like _1; this lets the first expert load PaliGemma weights while the action expert is initialized separately.
8. Output Head: Predicting a Velocity Field
After the transformer, π0 only uses the final hidden states for the action tokens:
suffix_out: float[B, 51, 1024]
action_hidden = suffix_out[:, -50:]
action_hidden: float[B, 50, 1024]
Then a final linear layer predicts the flow velocity:
v_t = Linear(1024 -> 32)(action_hidden)
v_t: float[B, 50, 32]
This is the output of the network during training: not actions directly, but a vector field telling the model how to move the current noisy action chunk.
9. Flow Matching Loss
Flow matching trains the model to denoise action trajectories by predicting a velocity field.
Let:
abe the clean expert action chunk;epsilonbe Gaussian noise;tbe the sampled noise level;x_tbe the interpolated noisy action;u_tbe the target velocity.
OpenPI uses:
epsilon = Normal(0, 1)
t ~ Beta(1.5, 1)
x_t = t * epsilon + (1 - t) * a
u_t = epsilon - a
Shapes:
a: float[B, 50, 32]
epsilon: float[B, 50, 32]
t: float[B]
x_t: float[B, 50, 32]
u_t: float[B, 50, 32]
The model receives (observation, x_t, t) and predicts:
v_theta(x_t, t, observation): float[B, 50, 32]
The loss is mean squared error over the action dimension:
loss_per_step = mean((v_theta - u_t) ** 2, axis=-1)
loss_per_step: float[B, 50]
So π0 is trained to answer:
Given the scene, instruction, robot state, noisy future actions, and noise level, what velocity moves this noisy action chunk toward the clean demonstrated action chunk?
This is why π0 can generate continuous actions without discretizing the action space.
10. Inference: Euler Integration from Noise to Actions
At inference time, there is no clean action chunk. The model starts from Gaussian noise:
x_1 = Normal(0, 1): float[B, 50, 32]
Then it integrates backward from noise toward action space. OpenPI uses 10 Euler steps by default:
dt = -1.0 / num_steps
time = 1.0
for step in range(num_steps):
v_t = model(observation, x_t, time)
x_t = x_t + dt * v_t
time = time + dt
return x_t
With num_steps = 10, the model repeatedly refines the full 50-step action chunk.
The convention in OpenPI is:
t = 1: pure noise;t = 0: clean action.
The source code notes that this is the opposite sign convention from the π0 paper, but the math is equivalent as long as the implementation is consistent.
11. KV Cache: Why Inference Is Efficient
A naive implementation would recompute image and language features at every denoising step. π0 avoids that.
During inference:
- Run the prefix once:
prefix_tokens = image + language tokens
kv_cache = transformer(prefix_tokens)
- For each Euler step, only recompute the suffix:
suffix_tokens = state + noisy_action_tokens + time
v_t = transformer(suffix_tokens, kv_cache=prefix_cache)
The prefix does not change across the 10 refinement steps. The camera images, prompt, and instruction are fixed while the model denoises one action chunk. Reusing the prefix KV cache saves substantial compute.
In shape terms:
cached_prefix_kv: stores prefix_len = 816 tokens
each suffix pass: processes suffix_len = 51 tokens
The suffix still attends to the prefix through the cached keys and values. This is the practical equivalent of cross-attending to the observation, but implemented inside the same self-attention stack.
12. End-to-End Shape Trace
Here is the full shape trace for a default batch size B.
| Stage | Tensor | Shape |
|---|---|---|
| Input image, each camera | image[name] |
[B, 224, 224, 3] |
| SigLIP output, each camera | image_tokens |
[B, 256, 2048] |
| All image tokens | concat camera tokens | [B, 768, 2048] |
| Prompt ids | tokenized_prompt |
[B, 48] |
| Prompt embeddings | text_tokens |
[B, 48, 2048] |
| Prefix tokens | images + text | [B, 816, 2048] |
| Robot state | state |
[B, 32] |
| State token | linear projection | [B, 1, 1024] |
| Action noise | x_t |
[B, 50, 32] |
| Action token projection | action_tokens |
[B, 50, 1024] |
| Time embedding | time_emb |
[B, 1024] |
| Time tokens | repeated time | [B, 50, 1024] |
| Action + time | concat | [B, 50, 2048] |
| Action-time MLP output | action_time_tokens |
[B, 50, 1024] |
| Suffix tokens | state + action-time | [B, 51, 1024] |
| Attention mask | blockwise self-attention | [B, 867, 867] |
| Suffix transformer output | suffix_out |
[B, 51, 1024] |
| Action hidden states | last 50 suffix tokens | [B, 50, 1024] |
| Velocity output | v_t |
[B, 50, 32] |
| Training loss | MSE over action dim | [B, 50] |
| Inference output | denoised action chunk | [B, 50, 32] |
13. Minimal Pseudocode Version
This pseudocode is not copied from OpenPI, but it captures the architecture.
class Pi0:
def encode_prefix(self, obs):
image_tokens = []
for camera in ["base_0_rgb", "left_wrist_0_rgb", "right_wrist_0_rgb"]:
image_tokens.append(siglip(obs.image[camera])) # [B, 256, 2048]
text_tokens = gemma_embed(obs.tokenized_prompt) # [B, 48, 2048]
return concat(image_tokens + [text_tokens], axis=1) # [B, 816, 2048]
def encode_suffix(self, obs, noisy_actions, t):
state_token = linear_state(obs.state)[:, None, :] # [B, 1, 1024]
action_tokens = linear_action(noisy_actions) # [B, 50, 1024]
time_emb = sincos(t, dim=1024) # [B, 1024]
time_tokens = repeat(time_emb, length=50) # [B, 50, 1024]
action_time = concat([action_tokens, time_tokens], axis=-1)
action_time = mlp(action_time) # [B, 50, 1024]
return concat([state_token, action_time], axis=1) # [B, 51, 1024]
def velocity(self, obs, noisy_actions, t):
prefix = self.encode_prefix(obs)
suffix = self.encode_suffix(obs, noisy_actions, t)
mask = blockwise_attention_mask(prefix, suffix)
prefix_out, suffix_out = multi_expert_gemma([prefix, suffix], mask)
return linear_out(suffix_out[:, -50:]) # [B, 50, 32]
Training:
def pi0_loss(obs, clean_actions):
eps = normal_like(clean_actions)
t = sample_beta(shape=[B])
x_t = t[:, None, None] * eps + (1 - t[:, None, None]) * clean_actions
target_velocity = eps - clean_actions
pred_velocity = model.velocity(obs, x_t, t)
return mean_square_error(pred_velocity, target_velocity)
Inference:
def sample_actions(obs, steps=10):
x = normal([B, 50, 32])
cache = model.encode_prefix_once(obs)
t = 1.0
dt = -1.0 / steps
for _ in range(steps):
v = model.velocity_with_prefix_cache(obs, x, t, cache)
x = x + dt * v
t = t + dt
return x
14. π0 versus π0.5 in the Same Code
The OpenPI Pi0 class also supports π0.5 behavior through the pi05 config flag. The code comments list two implementation differences:
- π0.5 puts state into the discrete language-token side rather than using the continuous state suffix token.
- π0.5 injects the flow timestep through adaptive RMSNorm instead of concatenating time embeddings to action tokens.
In π0:
state -> continuous suffix token
time -> concat with action token -> MLP
In π0.5:
state -> discrete input path
time -> adaRMSNorm conditioning
That is why the same class has branches like:
if pi05:
use_time_for_adaptive_rmsnorm()
else:
concatenate_time_with_action_tokens()
For understanding original π0, the important path is the non-π0.5 branch.
15. Why This Architecture Works
π0 solves three hard robotics problems at once.
It preserves semantic knowledge
The PaliGemma expert carries visual-language knowledge from web-scale pretraining. Images and instructions are encoded in a semantic space before action generation happens.
It keeps actions continuous
Robot actions are not words. A robot arm needs smooth continuous values for joints, end-effector poses, grippers, or base commands. Flow matching lets π0 generate continuous action chunks directly.
It supports high-frequency control
The model predicts a 50-step chunk instead of one action at a time. This makes behavior smoother and reduces compounding error. During inference, the prefix KV cache avoids recomputing image and language context at every flow step.
It can adapt across robots
The fixed 32-dimensional action/state interface acts as a common envelope. Different robots can map their native control spaces into this vector, with unused dimensions padded or masked by transforms.
16. Key Takeaway
The most compact description of π0 is:
π0 is a two-expert transformer where PaliGemma encodes image-language context, a smaller action expert encodes state and noisy action chunks, blockwise self-attention lets action tokens attend to the observation, and flow matching trains the model to turn Gaussian action noise into a 50-step continuous robot trajectory.
So when reading the OpenPI code, follow this path:
Pi0Configdefines the default shapes: 3 images, 48 prompt tokens, 32 action dimensions, 50 action steps.Observationdefines the input schema.embed_prefixturns images and language into PaliGemma tokens.embed_suffixturns state, noisy actions, and timestep into action-expert tokens.make_attn_maskbuilds blockwise attention.Gemma.Moduleruns both experts through the shared transformer depth.action_out_projpredicts the flow velocity.compute_losstrains with flow-matching MSE.sample_actionsintegrates from noise to actions with Euler steps and prefix KV caching.
That is the anatomy of π0 in OpenPI.
