Dynamic Sampling and Clipping Bound

By ihsumlee , 2 March 2026
content

This section explains two tricks used in **SimpleVLA-RL** to improve exploration and stabilize training when rewards are **binary** (success=1 / fail=0).

---

## 1) Dynamic Sampling

### The problem it solves (vanishing gradients)
GRPO computes advantages using **group-relative normalization**: it compares each trajectory’s reward to the **mean/std of rewards within the group**. If all trajectories in a group have the **same reward**, then every trajectory’s normalized advantage becomes **0**, which produces **zero gradients** and unstable learning. 

**Example**
- Group rewards: `[0, 0, 0, 0]` → mean=0, std≈0 → all advantages ≈ 0  
- Group rewards: `[1, 1, 1, 1]` → mean=1, std≈0 → all advantages ≈ 0  

No contrast → GRPO cannot “prefer” any trajectory → no learning signal.

### The solution (what Dynamic Sampling does)
During rollout, they **drop groups** where:
- all trajectories succeed, or
- all trajectories fail,

and keep sampling until the batch contains only **mixed-outcome groups** (some success, some failure).  

They formalize it as requiring the number of successful trajectories in a group to satisfy:  
\[
0 < \#\text{success} < G
\]
where \(G\) is group size.  

### Why it works (intuition)
GRPO needs **within-group contrast**:
- successful rollouts get **positive** relative advantage
- failed rollouts get **negative** relative advantage  
So the policy learns *which sampled behaviors* to increase/decrease.

### Pseudocode intuition
```text
repeat:
 sample a group of G rollouts from the same initial condition
 compute rewards (0/1)
until (0 < num_success < G)   # keep only mixed-outcome groups
use these groups to compute GRPO loss and update the policy

 

## Clipping bound (PPO/GRPO)

Define the probability ratio at time step \(t\):
\[
r_t(\theta)=\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}
\]

PPO/GRPO uses a **clipping bound** to limit how much the policy can change in one update:
\[
\text{clip}\big(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)
\]

### Standard example (symmetric clipping)
If \(\epsilon = 0.2\), then:
\[
r_t(\theta) \in [0.8,\;1.2]
\]

### “Clip higher” example (raise only the upper bound)
A common modification is to keep the lower bound but increase the upper bound:
\[
r_t(\theta) \in [0.8,\;1.28]
\]

  • Example:

## Example of clipping bound (PPO/GRPO)

Assume the clipping range is \([1-\epsilon, 1+\epsilon] = [0.8, 1.2]\).

### Step 1: Define the ratio
\[
r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}
\]

Suppose for one sampled action \(a_t\):
- Old policy probability: \(\pi_{\text{old}}(a_t\mid s_t)=0.10\)
- New policy probability: \(\pi_{\theta}(a_t\mid s_t)=0.15\)

Then:
\[
r_t = \frac{0.15}{0.10} = 1.5
\]

### Step 2: Apply clipping
Clipping range is \([0.8, 1.2]\), so:
\[
\text{clip}(r_t,0.8,1.2)=\text{clip}(1.5,0.8,1.2)=1.2
\]

So even though the ratio is \(1.5\), PPO/GRPO treats it as **at most 1.2** in the clipped term.

---

## Another example (ratio too small)

Suppose:
- \(\pi_{\text{old}}(a_t\mid s_t)=0.20\)
- \(\pi_{\theta}(a_t\mid s_t)=0.10\)

Then:
\[
r_t=\frac{0.10}{0.20}=0.5
\]

Apply clipping:
\[
\text{clip}(0.5,0.8,1.2)=0.8
\]

So PPO/GRPO treats the ratio as **at least 0.8** in the clipped term.

Tags