This section explains two tricks used in **SimpleVLA-RL** to improve exploration and stabilize training when rewards are **binary** (success=1 / fail=0).
---
## 1) Dynamic Sampling
### The problem it solves (vanishing gradients)
GRPO computes advantages using **group-relative normalization**: it compares each trajectory’s reward to the **mean/std of rewards within the group**. If all trajectories in a group have the **same reward**, then every trajectory’s normalized advantage becomes **0**, which produces **zero gradients** and unstable learning.
**Example**
- Group rewards: `[0, 0, 0, 0]` → mean=0, std≈0 → all advantages ≈ 0
- Group rewards: `[1, 1, 1, 1]` → mean=1, std≈0 → all advantages ≈ 0
No contrast → GRPO cannot “prefer” any trajectory → no learning signal.
### The solution (what Dynamic Sampling does)
During rollout, they **drop groups** where:
- all trajectories succeed, or
- all trajectories fail,
and keep sampling until the batch contains only **mixed-outcome groups** (some success, some failure).
They formalize it as requiring the number of successful trajectories in a group to satisfy:
\[
0 < \#\text{success} < G
\]
where \(G\) is group size.
### Why it works (intuition)
GRPO needs **within-group contrast**:
- successful rollouts get **positive** relative advantage
- failed rollouts get **negative** relative advantage
So the policy learns *which sampled behaviors* to increase/decrease.
### Pseudocode intuition
```text
repeat:
sample a group of G rollouts from the same initial condition
compute rewards (0/1)
until (0 < num_success < G) # keep only mixed-outcome groups
use these groups to compute GRPO loss and update the policy
## Clipping bound (PPO/GRPO)
Define the probability ratio at time step \(t\):
\[
r_t(\theta)=\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}
\]
PPO/GRPO uses a **clipping bound** to limit how much the policy can change in one update:
\[
\text{clip}\big(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)
\]
### Standard example (symmetric clipping)
If \(\epsilon = 0.2\), then:
\[
r_t(\theta) \in [0.8,\;1.2]
\]
### “Clip higher” example (raise only the upper bound)
A common modification is to keep the lower bound but increase the upper bound:
\[
r_t(\theta) \in [0.8,\;1.28]
\]
- Example:
## Example of clipping bound (PPO/GRPO)
Assume the clipping range is \([1-\epsilon, 1+\epsilon] = [0.8, 1.2]\).
### Step 1: Define the ratio
\[
r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}
\]
Suppose for one sampled action \(a_t\):
- Old policy probability: \(\pi_{\text{old}}(a_t\mid s_t)=0.10\)
- New policy probability: \(\pi_{\theta}(a_t\mid s_t)=0.15\)
Then:
\[
r_t = \frac{0.15}{0.10} = 1.5
\]
### Step 2: Apply clipping
Clipping range is \([0.8, 1.2]\), so:
\[
\text{clip}(r_t,0.8,1.2)=\text{clip}(1.5,0.8,1.2)=1.2
\]
So even though the ratio is \(1.5\), PPO/GRPO treats it as **at most 1.2** in the clipped term.
---
## Another example (ratio too small)
Suppose:
- \(\pi_{\text{old}}(a_t\mid s_t)=0.20\)
- \(\pi_{\theta}(a_t\mid s_t)=0.10\)
Then:
\[
r_t=\frac{0.10}{0.20}=0.5
\]
Apply clipping:
\[
\text{clip}(0.5,0.8,1.2)=0.8
\]
So PPO/GRPO treats the ratio as **at least 0.8** in the clipped term.