content
## What “all tokens” includes
If one rollout produces action tokens like:
- time 1: tokens \((u_{1,1}, u_{1,2}, \ldots)\)
- time 2: tokens \((u_{2,1}, u_{2,2}, \ldots)\)
- …
- time T: tokens \((u_{T,1}, u_{T,2}, \ldots)\)
Then:
- if the rollout succeeds → every \(u_{t,k}\) gets reward 1
- if it fails → every \(u_{t,k}\) gets reward 0