π0 Architecture Anatomy

π0 is a Vision-Language-Action (VLA) model for robot control. The key idea is to combine a pretrained vision-language backbone with a continuous action generator. Instead of asking a language model to emit robot actions as text tokens, π0 predicts a velocity field that transforms random action noise into a smooth chunk of future robot actions.

This post explains the architecture from the Physical Intelligence OpenPI repository, especially:

The OpenPI README describes π0 as a flow-based vision-language-action model with base checkpoints pretrained on 10k+ hours of robot data. The code implements three related families: π0, π0-FAST, and π0.5. This post focuses on the original π0 flow-matching model.

1. The Mental Model

π0 has two main parts:

Part	Role	Typical width
PaliGemma VLM expert	Encodes images and language	2048
Action expert	Encodes robot state, noisy actions, and flow timestep	1024

The model is not a simple pipeline where the VLM produces a sentence and another module converts that sentence into motion. It is closer to a multi-expert transformer:

Image tokens and language tokens go through the PaliGemma expert.
Robot state and noisy action tokens go through the action expert.
Both experts participate in the same transformer attention layers.
The action-token outputs are projected into continuous action velocities.

In code, the idea is roughly:

paligemma_config = get_config("gemma_2b")
action_config = get_config("gemma_300m")

llm = GemmaModule(
    configs=[paligemma_config, action_config],
    embed_dtype="bfloat16",
)

The first expert loads the PaliGemma/Gemma-style weights. The second expert is an action expert with the same transformer depth but smaller hidden width.

2. Default Input and Output Shapes

The default π0 config in OpenPI is:

Config field	Value	Meaning
`action_dim`	32	Maximum robot action/state dimension
`action_horizon`	50	Number of future action steps predicted per chunk
`max_token_len`	48	Maximum prompt token length for π0
`IMAGE_RESOLUTION`	224 x 224	Model image resolution
`IMAGE_KEYS`	3 camera names	base, left wrist, right wrist

The expected observation dictionary looks like this:

example = {
    "image": {
        "base_0_rgb": float32[B, 224, 224, 3],
        "left_wrist_0_rgb": float32[B, 224, 224, 3],
        "right_wrist_0_rgb": float32[B, 224, 224, 3],
    },
    "image_mask": {
        "base_0_rgb": bool[B],
        "left_wrist_0_rgb": bool[B],
        "right_wrist_0_rgb": bool[B],
    },
    "state": float32[B, 32],
    "tokenized_prompt": int32[B, 48],
    "tokenized_prompt_mask": bool[B, 48],
    "actions": float32[B, 50, 32],
}

For inference, the model receives the observation and returns:

predicted_actions = float32[B, 50, 32]

Only the first part of the 32-dimensional vector may be meaningful for a particular robot. A smaller 7-DoF arm can use a subset of the action vector, while unused dimensions are padded or ignored by the data transforms.

3. Observation Preprocessing

OpenPI standardizes observations before the model sees them.

Images are expected as RGB arrays. If they arrive as uint8, they are converted from [0, 255] into [-1, 1]. During training, the code can apply image augmentation:

crop and resize for non-wrist cameras;
small rotations;
color jitter;
resize-with-padding to 224 x 224.

The model expects three image keys:

IMAGE_KEYS = (
    "base_0_rgb",
    "left_wrist_0_rgb",
    "right_wrist_0_rgb",
)

This means a custom dataset must either provide these keys or use transforms that rename/fill them. Missing cameras can be represented with image masks, but the model input schema still expects the standard names.

4. Prefix Tokens: Images and Language

OpenPI splits the transformer input into two conceptual regions:

prefix: scene and instruction tokens;
suffix: robot state and action tokens.

The prefix is built by embed_prefix.

Image Tokens

Each 224 x 224 image is passed through the SigLIP image encoder. With a 14 x 14 patch size, one image becomes roughly:

image_tokens: float[B, 256, 2048]

With three cameras:

all_image_tokens: float[B, 768, 2048]

Each image also has a mask:

image_mask: bool[B]
expanded_image_mask: bool[B, 256]

The mask tells the transformer whether a camera view is real or padded/missing.

Language Tokens

The prompt is already tokenized before entering the model:

tokenized_prompt: int32[B, 48]
tokenized_prompt_mask: bool[B, 48]

The Gemma embedder maps token ids to vectors:

text_tokens: float[B, 48, 2048]

So the full π0 prefix is usually:

prefix_tokens = concat([3 camera token groups, text_tokens], axis=1)
prefix_tokens: float[B, 816, 2048]  # 768 image + 48 text
prefix_mask: bool[B, 816]

The prefix is handled by the PaliGemma expert, whose width is 2048.

5. Suffix Tokens: State, Noisy Actions, and Time

The suffix is built by embed_suffix.

For original π0, the suffix contains:

one state token;
50 noisy action tokens.

State Token

The low-dimensional robot state has shape:

state: float[B, 32]

The state projection maps it into the action expert width:

state_token = Linear(32 -> 1024)(state)
state_token: float[B, 1, 1024]

This token gives the action expert the current proprioceptive state: joint positions, gripper state, base state, or whatever the dataset transform packs into the 32-dimensional state vector.

Noisy Action Tokens

During training, the model receives a corrupted version of the ground-truth action chunk:

noisy_actions: float[B, 50, 32]

Each action vector is projected into the action expert width:

action_tokens = Linear(32 -> 1024)(noisy_actions)
action_tokens: float[B, 50, 1024]

Flow Timestep Embedding

π0 is a flow-matching model, so it also needs to know the noise level t.

OpenPI uses a sinusoidal embedding for the scalar timestep:

time: float[B]
time_emb: float[B, 1024]

Then the same time embedding is repeated across the 50 action positions:

time_tokens: float[B, 50, 1024]

The action token and time token are concatenated:

action_plus_time: float[B, 50, 2048]

Then a small MLP compresses it back to the action expert width:

action_time_tokens = MLP(2048 -> 1024 -> 1024)(action_plus_time)
action_time_tokens: float[B, 50, 1024]

So the full suffix for π0 is:

suffix_tokens = concat([state_token, action_time_tokens], axis=1)
suffix_tokens: float[B, 51, 1024]
suffix_mask: bool[B, 51]

The suffix is handled by the action expert, whose width is 1024.

6. The Attention Mask: How Prefix and Suffix Communicate

The most important implementation detail is the attention mask.

π0 does not use a separate encoder-decoder cross-attention layer. Instead, it uses a shared transformer stack with a blockwise causal self-attention mask.

Conceptually:

Query token	Can attend to image/language prefix?	Can attend to state?	Can attend to action tokens?
Image/language token	Yes, bidirectionally inside prefix	No	No
State token	Yes	Yes	No
Action token	Yes	Yes	Yes, within the action block

This gives the action tokens access to the visual scene, language instruction, and current robot state, while preventing robot-action tokens from changing the prefix representation.

The code builds this behavior with two masks:

input_mask: bool[B, sequence_length]
ar_mask: bool[sequence_length]

Then it turns them into:

attn_mask: bool[B, sequence_length, sequence_length]

For a default π0 forward pass:

prefix_length = 816
suffix_length = 51
total_length = 867

attn_mask: bool[B, 867, 867]

The prefix uses False entries in ar_mask, meaning prefix tokens share the same attention block and can attend bidirectionally. The state token starts a new block. The first action token starts another block, and the remaining action tokens share that action block.

That is why this is better described as masked self-attention across two experts, not ordinary encoder-decoder cross-attention.

7. The Multi-Expert Gemma Transformer

OpenPI’s Gemma module supports multiple experts. In π0, there are two:

Expert index	Token group	Width	Depth	MLP dim	Heads	KV heads	Head dim
0	PaliGemma prefix	2048	18	16384	8	1	256
1	Action suffix	1024	18	4096	8	1	256

The two experts have different widths and MLPs, but they share attention compatibility:

same number of attention heads;
same number of KV heads;
same head dimension;
same transformer depth.

Inside attention:

prefix: float[B, prefix_len, 2048] -> q/k/v heads
suffix: float[B, suffix_len, 1024] -> q/k/v heads
concat q/k/v along sequence dimension
apply masked attention
split outputs back into prefix and suffix streams

The attention mechanism is therefore the place where visual-language context and action tokens interact.

What Happens in One Transformer Block

Each block follows the usual transformer pattern:

RMSNorm.
Multi-head attention.
Residual connection.
RMSNorm.
Feed-forward network.
Residual connection.

For π0, the prefix and suffix use different parameters for projections and MLPs, but attention is computed over the combined sequence. In the source code, later expert parameters receive suffixes like _1; this lets the first expert load PaliGemma weights while the action expert is initialized separately.

8. Output Head: Predicting a Velocity Field

After the transformer, π0 only uses the final hidden states for the action tokens:

suffix_out: float[B, 51, 1024]
action_hidden = suffix_out[:, -50:]
action_hidden: float[B, 50, 1024]

Then a final linear layer predicts the flow velocity:

v_t = Linear(1024 -> 32)(action_hidden)
v_t: float[B, 50, 32]

This is the output of the network during training: not actions directly, but a vector field telling the model how to move the current noisy action chunk.

9. Flow Matching Loss

Flow matching trains the model to denoise action trajectories by predicting a velocity field.

Let:

a be the clean expert action chunk;
epsilon be Gaussian noise;
t be the sampled noise level;
x_t be the interpolated noisy action;
u_t be the target velocity.

OpenPI uses:

epsilon = Normal(0, 1)
t ~ Beta(1.5, 1)

x_t = t * epsilon + (1 - t) * a
u_t = epsilon - a

Shapes:

a:       float[B, 50, 32]
epsilon: float[B, 50, 32]
t:       float[B]
x_t:     float[B, 50, 32]
u_t:     float[B, 50, 32]

The model receives (observation, x_t, t) and predicts:

v_theta(x_t, t, observation): float[B, 50, 32]

The loss is mean squared error over the action dimension:

loss_per_step = mean((v_theta - u_t) ** 2, axis=-1)
loss_per_step: float[B, 50]

So π0 is trained to answer:

Given the scene, instruction, robot state, noisy future actions, and noise level, what velocity moves this noisy action chunk toward the clean demonstrated action chunk?

This is why π0 can generate continuous actions without discretizing the action space.

10. Inference: Euler Integration from Noise to Actions

At inference time, there is no clean action chunk. The model starts from Gaussian noise:

x_1 = Normal(0, 1): float[B, 50, 32]

Then it integrates backward from noise toward action space. OpenPI uses 10 Euler steps by default:

dt = -1.0 / num_steps
time = 1.0

for step in range(num_steps):
    v_t = model(observation, x_t, time)
    x_t = x_t + dt * v_t
    time = time + dt

return x_t

With num_steps = 10, the model repeatedly refines the full 50-step action chunk.

The convention in OpenPI is:

t = 1: pure noise;
t = 0: clean action.

The source code notes that this is the opposite sign convention from the π0 paper, but the math is equivalent as long as the implementation is consistent.

11. KV Cache: Why Inference Is Efficient

A naive implementation would recompute image and language features at every denoising step. π0 avoids that.

During inference:

Run the prefix once:

prefix_tokens = image + language tokens
kv_cache = transformer(prefix_tokens)

For each Euler step, only recompute the suffix:

suffix_tokens = state + noisy_action_tokens + time
v_t = transformer(suffix_tokens, kv_cache=prefix_cache)

The prefix does not change across the 10 refinement steps. The camera images, prompt, and instruction are fixed while the model denoises one action chunk. Reusing the prefix KV cache saves substantial compute.

In shape terms:

cached_prefix_kv: stores prefix_len = 816 tokens
each suffix pass: processes suffix_len = 51 tokens

The suffix still attends to the prefix through the cached keys and values. This is the practical equivalent of cross-attending to the observation, but implemented inside the same self-attention stack.

12. End-to-End Shape Trace

Here is the full shape trace for a default batch size B.

Stage	Tensor	Shape
Input image, each camera	`image[name]`	`[B, 224, 224, 3]`
SigLIP output, each camera	`image_tokens`	`[B, 256, 2048]`
All image tokens	concat camera tokens	`[B, 768, 2048]`
Prompt ids	`tokenized_prompt`	`[B, 48]`
Prompt embeddings	`text_tokens`	`[B, 48, 2048]`
Prefix tokens	images + text	`[B, 816, 2048]`
Robot state	`state`	`[B, 32]`
State token	linear projection	`[B, 1, 1024]`
Action noise	`x_t`	`[B, 50, 32]`
Action token projection	`action_tokens`	`[B, 50, 1024]`
Time embedding	`time_emb`	`[B, 1024]`
Time tokens	repeated time	`[B, 50, 1024]`
Action + time	concat	`[B, 50, 2048]`
Action-time MLP output	`action_time_tokens`	`[B, 50, 1024]`
Suffix tokens	state + action-time	`[B, 51, 1024]`
Attention mask	blockwise self-attention	`[B, 867, 867]`
Suffix transformer output	`suffix_out`	`[B, 51, 1024]`
Action hidden states	last 50 suffix tokens	`[B, 50, 1024]`
Velocity output	`v_t`	`[B, 50, 32]`
Training loss	MSE over action dim	`[B, 50]`
Inference output	denoised action chunk	`[B, 50, 32]`

13. Minimal Pseudocode Version

This pseudocode is not copied from OpenPI, but it captures the architecture.

class Pi0:
    def encode_prefix(self, obs):
        image_tokens = []
        for camera in ["base_0_rgb", "left_wrist_0_rgb", "right_wrist_0_rgb"]:
            image_tokens.append(siglip(obs.image[camera]))   # [B, 256, 2048]

        text_tokens = gemma_embed(obs.tokenized_prompt)      # [B, 48, 2048]
        return concat(image_tokens + [text_tokens], axis=1)  # [B, 816, 2048]

    def encode_suffix(self, obs, noisy_actions, t):
        state_token = linear_state(obs.state)[:, None, :]    # [B, 1, 1024]
        action_tokens = linear_action(noisy_actions)         # [B, 50, 1024]
        time_emb = sincos(t, dim=1024)                       # [B, 1024]
        time_tokens = repeat(time_emb, length=50)            # [B, 50, 1024]

        action_time = concat([action_tokens, time_tokens], axis=-1)
        action_time = mlp(action_time)                       # [B, 50, 1024]
        return concat([state_token, action_time], axis=1)    # [B, 51, 1024]

    def velocity(self, obs, noisy_actions, t):
        prefix = self.encode_prefix(obs)
        suffix = self.encode_suffix(obs, noisy_actions, t)
        mask = blockwise_attention_mask(prefix, suffix)
        prefix_out, suffix_out = multi_expert_gemma([prefix, suffix], mask)
        return linear_out(suffix_out[:, -50:])               # [B, 50, 32]

Training:

def pi0_loss(obs, clean_actions):
    eps = normal_like(clean_actions)
    t = sample_beta(shape=[B])
    x_t = t[:, None, None] * eps + (1 - t[:, None, None]) * clean_actions
    target_velocity = eps - clean_actions

    pred_velocity = model.velocity(obs, x_t, t)
    return mean_square_error(pred_velocity, target_velocity)

Inference:

def sample_actions(obs, steps=10):
    x = normal([B, 50, 32])
    cache = model.encode_prefix_once(obs)

    t = 1.0
    dt = -1.0 / steps
    for _ in range(steps):
        v = model.velocity_with_prefix_cache(obs, x, t, cache)
        x = x + dt * v
        t = t + dt

    return x

14. π0 versus π0.5 in the Same Code

The OpenPI Pi0 class also supports π0.5 behavior through the pi05 config flag. The code comments list two implementation differences:

π0.5 puts state into the discrete language-token side rather than using the continuous state suffix token.
π0.5 injects the flow timestep through adaptive RMSNorm instead of concatenating time embeddings to action tokens.

In π0:

state -> continuous suffix token
time -> concat with action token -> MLP

In π0.5:

state -> discrete input path
time -> adaRMSNorm conditioning

That is why the same class has branches like:

if pi05:
    use_time_for_adaptive_rmsnorm()
else:
    concatenate_time_with_action_tokens()

For understanding original π0, the important path is the non-π0.5 branch.

15. Why This Architecture Works

π0 solves three hard robotics problems at once.

It preserves semantic knowledge

The PaliGemma expert carries visual-language knowledge from web-scale pretraining. Images and instructions are encoded in a semantic space before action generation happens.

It keeps actions continuous

Robot actions are not words. A robot arm needs smooth continuous values for joints, end-effector poses, grippers, or base commands. Flow matching lets π0 generate continuous action chunks directly.

It supports high-frequency control

The model predicts a 50-step chunk instead of one action at a time. This makes behavior smoother and reduces compounding error. During inference, the prefix KV cache avoids recomputing image and language context at every flow step.

It can adapt across robots

The fixed 32-dimensional action/state interface acts as a common envelope. Different robots can map their native control spaces into this vector, with unused dimensions padded or masked by transforms.

16. Key Takeaway

The most compact description of π0 is:

π0 is a two-expert transformer where PaliGemma encodes image-language context, a smaller action expert encodes state and noisy action chunks, blockwise self-attention lets action tokens attend to the observation, and flow matching trains the model to turn Gaussian action noise into a 50-step continuous robot trajectory.

So when reading the OpenPI code, follow this path:

Pi0Config defines the default shapes: 3 images, 48 prompt tokens, 32 action dimensions, 50 action steps.
Observation defines the input schema.
embed_prefix turns images and language into PaliGemma tokens.
embed_suffix turns state, noisy actions, and timestep into action-expert tokens.
make_attn_mask builds blockwise attention.
Gemma.Module runs both experts through the shared transformer depth.
action_out_proj predicts the flow velocity.
compute_loss trains with flow-matching MSE.
sample_actions integrates from noise to actions with Euler steps and prefix KV caching.

That is the anatomy of π0 in OpenPI.