In the rollout loop, you said: ‘Feed the current state (image + robot state + language) into the VLA model.’ At that step, do I get different rollouts? How and why

By ihsumlee , 2 March 2026

content

Polished version of your question:

> “In the rollout loop, you said: ‘Feed the current state (image + robot state + language) into the VLA model.’ At that step, do I get different rollouts? How and why?”

## What “feed current state into the VLA model” really means

At time step \(t\), you build an input:
- **Image** \(I_t\) (camera frame(s))
- **Robot state** \(q_t\) (joint positions/velocities, gripper state, end-effector pose, etc.)
- **Language instruction** \(x\) (usually fixed for the whole episode)

Then the VLA model outputs a **distribution over the next action** (or the next action *chunk*):
\[
\pi_\theta(a_t \mid I_t, q_t, x)
\]

That’s it. It’s just “compute what to do next given what you see and the instruction.”

---

## Do you get different rollouts *at that step*?

### Not automatically. You only get different rollouts if the *action selection* is stochastic.
The “feed state into model” part is usually deterministic **given fixed weights and fixed numeric settings**.
The divergence happens in the next micro-step:

1) **Model forward pass**: produce a distribution (or a diffusion denoising trajectory / logits)
2) **Sampling/decoding**: choose an action from that distribution (this can be random)
3) **Environment step**: execute action → new state \(s_{t+1}\)

So the answer is:

- **Different rollouts happen because step (2) or (3) introduces randomness.**
- Feeding the same state into the model can still lead to different actions if your decoding is sampling-based.

---

## Exactly how different rollouts arise (mechanism)

Assume we reset to the same initial state \(s_0\). The language \(x\) is the same.

### Case A — Token / categorical actions (OpenVLA-like)
The model outputs logits → softmax probabilities:
\[
p(a_t) = \text{softmax}(z_t / T)
\]
- If you do **argmax**, action is deterministic.
- If you **sample** from \(p(a_t)\), action is stochastic.
- Different random seeds ⇒ different sampled \(a_t\).

Once \(a_0\) differs, the next state differs:
\[
s_1 = f(s_0, a_0)
\]
and from there observations diverge, so the entire trajectory diverges.

### Case B — Diffusion actions (TinyVLA-like)
Diffusion policies generate actions by starting from noise and denoising:

- sample noise \(\epsilon \sim \mathcal{N}(0, I)\)
- denoise conditioned on \((I_t, q_t, x)\)
- output action (or action chunk)

Different noise seed ⇒ different \(\epsilon\) ⇒ different generated action \(a_t\) even if \((I_t,q_t,x)\) is identical.

Then the same “branching” happens:
- different \(a_0\) → different \(s_1\) → different \(I_1, q_1\) → even more divergence.

This is why **one small randomness at one step** can create **completely different rollouts**.

---

## Why this works even if you think your policy is “deterministic”

Many people say “my model is deterministic” meaning:
- same input gives same output tensor

But in RL rollout, what matters is:
- do you **sample** actions or always take the single best?
- does your diffusion sampling use random noise?
- does your physics engine have small nondeterminism?

If *any* of these are stochastic, rollouts can differ.

**In practice with TinyVLA-style diffusion, the policy is not truly deterministic unless you fix seeds and force deterministic kernels.** So yes, you can get multiple rollouts from the same initial state.

---

## Short example (1-step branching)
Same start:
- \(s_0\) (same image, same pose, same instruction)

Rollout #1:
- sample noise seed=1 → action \(a_0^{(1)}\) → next state \(s_1^{(1)}\)

Rollout #2:
- sample noise seed=2 → action \(a_0^{(2)}\) → next state \(s_1^{(2)}\)

Now \(s_1^{(1)} \neq s_1^{(2)}\), so the camera image differs, and the policy reacts differently from step 1 onward → two distinct trajectories.

---

## The most important takeaway
You do **not** get different rollouts because the input includes “current state.”
You get different rollouts because **the policy produces a distribution and you sample from it**, and once one action differs, the closed-loop environment causes the rest of the trajectory to diverge.

---

If you tell me which of these your TinyVLA-based controller uses during inference:

- **(i)** diffusion action chunk (multi-step),
- **(ii)** diffusion one-step,
- **(iii)** discrete token actions,

I’ll show the *exact* point in your inference code where you “turn on” rollout diversity (temperature / noise scale / seed / sampling mode) without changing the model weights.

Tags