# GRPO (Group Relative Policy Optimization) — a clear explanation
GRPO is a **PPO-style policy optimization method** that **does not train a value critic**.
Instead of estimating advantages with a learned value function, it computes a **relative advantage inside a group of rollouts** generated from the **same prompt / same initial condition**.
---
## 1) What problem GRPO solves
In standard PPO you often need a **critic** \(V(s)\) to compute advantages:
\[
A_t = R_t - V(s_t)
\]
But in outcome-only robotics RL (success/fail), training a stable critic can be difficult and expensive.
GRPO avoids that by:
- sampling **multiple trajectories** for the same input
- scoring them with a reward (e.g., success = 1, fail = 0)
- using **within-group comparison** to decide which trajectories are better
---
## 2) GRPO setup: “group” of trajectories
Fix one input \(x\) (typically: instruction + initial observation / initial env state).
Generate a group of \(N\) trajectories:
\[
\tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(N)}
\]
Each trajectory has a final reward \(r^{(i)}\) (often binary: 0 or 1).
---
## 3) Convert trajectory rewards into **relative advantages**
Compute the group mean and standard deviation:
\[
\mu = \frac{1}{N}\sum_{i=1}^N r^{(i)},
\qquad
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (r^{(i)}-\mu)^2 + \epsilon}
\]
Then define a **group-relative advantage** for each trajectory:
\[
\hat{A}^{(i)} = \frac{r^{(i)} - \mu}{\sigma}
\]
### Intuition
- If a trajectory is better than the group average → \(\hat{A}^{(i)} > 0\) (increase its probability)
- If worse than average → \(\hat{A}^{(i)} < 0\) (decrease its probability)
This is why you need **multiple rollouts per same input**: GRPO learns by comparing them.
---
## 4) Turn the advantage into a PPO-style update (but critic-free)
For each trajectory \(\tau^{(i)}\), you have a sequence of action choices across time:
\[
(a_1^{(i)}, a_2^{(i)}, \ldots, a_T^{(i)})
\]
and the corresponding states/observations \((s_t^{(i)})\).
Define the PPO probability ratio at each step:
\[
\rho_t^{(i)}(\theta)=\frac{\pi_\theta(a_t^{(i)}\mid s_t^{(i)})}{\pi_{\theta_{\text{old}}}(a_t^{(i)}\mid s_t^{(i)})}
\]
Then GRPO uses the **clipped surrogate objective** (same spirit as PPO):
\[
L(\theta)=
\mathbb{E}_{i,t}\Big[
\min\big(
\rho_t^{(i)}(\theta)\hat{A}^{(i)},
\text{clip}(\rho_t^{(i)}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}^{(i)}
\big)
\Big]
\]
### Key detail
Notice the advantage is \(\hat{A}^{(i)}\):
it is computed from **group-relative reward**, not from a critic.
---
## 5) Why “propagate reward to every token/action step”?
If reward is outcome-only (success/fail at the end), GRPO treats the whole trajectory as good/bad.
So the same \(\hat{A}^{(i)}\) is applied to **every time step** in that trajectory:
- all actions in a successful trajectory are reinforced
- all actions in a failed trajectory are suppressed
This is **trajectory-level credit assignment** (simple but effective when you have group comparison).
---
## 6) A concrete toy example (binary rewards)
Suppose you sample \(N=8\) rollouts from the same initial state:
Rewards:
\[
[1, 1, 1, 0, 0, 0, 0, 0]
\]
Mean:
\[
\mu = 3/8 = 0.375
\]
A success rollout has advantage:
\[
\hat{A}_{\text{succ}} \propto 1 - 0.375 = +0.625
\]
A fail rollout has advantage:
\[
\hat{A}_{\text{fail}} \propto 0 - 0.375 = -0.375
\]
So GRPO will:
- increase the probability of actions taken in the successful rollouts
- decrease the probability of actions taken in the failed rollouts
No critic needed.
---
## 7) GRPO vs PPO (one-line comparison)
- **PPO**: needs a value function (critic) to estimate advantage
- **GRPO**: estimates advantage by **comparing multiple rollouts within the same group**
That’s why GRPO is attractive for outcome-only robotics RL.
---