GRPO

By ihsumlee , 2 March 2026
content

# GRPO (Group Relative Policy Optimization) — a clear explanation

GRPO is a **PPO-style policy optimization method** that **does not train a value critic**.  
Instead of estimating advantages with a learned value function, it computes a **relative advantage inside a group of rollouts** generated from the **same prompt / same initial condition**.

---

## 1) What problem GRPO solves

In standard PPO you often need a **critic** \(V(s)\) to compute advantages:
\[
A_t = R_t - V(s_t)
\]
But in outcome-only robotics RL (success/fail), training a stable critic can be difficult and expensive.

GRPO avoids that by:
- sampling **multiple trajectories** for the same input
- scoring them with a reward (e.g., success = 1, fail = 0)
- using **within-group comparison** to decide which trajectories are better

---

## 2) GRPO setup: “group” of trajectories

Fix one input \(x\) (typically: instruction + initial observation / initial env state).

Generate a group of \(N\) trajectories:
\[
\tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(N)}
\]
Each trajectory has a final reward \(r^{(i)}\) (often binary: 0 or 1).

---

## 3) Convert trajectory rewards into **relative advantages**

Compute the group mean and standard deviation:
\[
\mu = \frac{1}{N}\sum_{i=1}^N r^{(i)}, 
\qquad
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (r^{(i)}-\mu)^2 + \epsilon}
\]

Then define a **group-relative advantage** for each trajectory:
\[
\hat{A}^{(i)} = \frac{r^{(i)} - \mu}{\sigma}
\]

### Intuition
- If a trajectory is better than the group average → \(\hat{A}^{(i)} > 0\) (increase its probability)
- If worse than average → \(\hat{A}^{(i)} < 0\) (decrease its probability)

This is why you need **multiple rollouts per same input**: GRPO learns by comparing them.

---

## 4) Turn the advantage into a PPO-style update (but critic-free)

For each trajectory \(\tau^{(i)}\), you have a sequence of action choices across time:
\[
(a_1^{(i)}, a_2^{(i)}, \ldots, a_T^{(i)})
\]
and the corresponding states/observations \((s_t^{(i)})\).

Define the PPO probability ratio at each step:
\[
\rho_t^{(i)}(\theta)=\frac{\pi_\theta(a_t^{(i)}\mid s_t^{(i)})}{\pi_{\theta_{\text{old}}}(a_t^{(i)}\mid s_t^{(i)})}
\]

Then GRPO uses the **clipped surrogate objective** (same spirit as PPO):
\[
L(\theta)=
\mathbb{E}_{i,t}\Big[
\min\big(
\rho_t^{(i)}(\theta)\hat{A}^{(i)},
\text{clip}(\rho_t^{(i)}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}^{(i)}
\big)
\Big]
\]

### Key detail
Notice the advantage is \(\hat{A}^{(i)}\):  
it is computed from **group-relative reward**, not from a critic.

---

## 5) Why “propagate reward to every token/action step”?

If reward is outcome-only (success/fail at the end), GRPO treats the whole trajectory as good/bad.
So the same \(\hat{A}^{(i)}\) is applied to **every time step** in that trajectory:
- all actions in a successful trajectory are reinforced
- all actions in a failed trajectory are suppressed

This is **trajectory-level credit assignment** (simple but effective when you have group comparison).

---

## 6) A concrete toy example (binary rewards)

Suppose you sample \(N=8\) rollouts from the same initial state:

Rewards:
\[
[1, 1, 1, 0, 0, 0, 0, 0]
\]

Mean:
\[
\mu = 3/8 = 0.375
\]

A success rollout has advantage:
\[
\hat{A}_{\text{succ}} \propto 1 - 0.375 = +0.625
\]
A fail rollout has advantage:
\[
\hat{A}_{\text{fail}} \propto 0 - 0.375 = -0.375
\]

So GRPO will:
- increase the probability of actions taken in the successful rollouts
- decrease the probability of actions taken in the failed rollouts

No critic needed.

---

## 7) GRPO vs PPO (one-line comparison)

- **PPO**: needs a value function (critic) to estimate advantage  
- **GRPO**: estimates advantage by **comparing multiple rollouts within the same group**

That’s why GRPO is attractive for outcome-only robotics RL.

---

Tags