Here’s a polished version of your question:
“In Step C (policy update with GRPO), how can I apply policy updates if my TinyVLA-based model was already trained with supervised learning (SL/SFT)? I don’t want to train from scratch—only fine-tune. Can I fine-tune only the adapters, or only the linear layers?”
Yes — GRPO is typically used as an RL fine-tuning stage on top of an SFT-pretrained VLA, not training from scratch. In practice, updating only adapters (or a small subset of layers) is not only allowed, it’s often the best first attempt for stability and compute.
Below is the clear “how + what to tune” picture.
1) What GRPO “policy update” means in your case
Even though your model was trained by SL/SFT, it still defines a stochastic policy during rollout:
In TinyVLA-style diffusion, stochasticity comes from sampling noise in the diffusion process (unless you force fully deterministic sampling).
GRPO then adjusts parameters $\theta$ to increase the probability of action sequences that led to success, and decrease those that led to failure, using group-relative advantages.
So GRPO is simply:
“fine-tune the existing policy using outcome-based RL.”
No scratch training needed.
2) What you must have to do GRPO (regardless of adapters vs full FT)
GRPO/PPO-style updates require you to compute (explicitly or implicitly):
$$
r_t(\theta)=\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}
$$
Meaning you need:
the probability (or log-prob) of the sampled actions under the current policy
the probability (or log-prob) of the same actions under the old (behavior) policy snapshot
This is the key engineering requirement.
Once you can compute (or approximate) these log-probs, you can apply GRPO to any parameter subset: adapters, heads, linear layers, etc.
For diffusion policies, this “log-prob” is trickier than for discrete tokens, but it’s still doable (many diffusion-policy RL implementations compute likelihood surrogates from the noise prediction / score model).
3) Can you fine-tune only adapters? Yes. Here are the main options.
Option A (recommended first):
Tune only task adapters / LoRA adapters
What you update
your incremental task adapters (or a new “RL adapter”)
optionally the action head (final MLP) + LayerNorm scale/bias
What you freeze
vision backbone
language backbone
most transformer blocks
Why this is a good default
stable (RL gradients can be noisy)
low memory + fast iteration
preserves SFT behavior (less catastrophic drift)
When it might be insufficient
if the RL improvements require changing “representations,” not just the final mapping to actions
if reward is very sparse and adapters are too small to shift behavior meaningfully
Option B:
Tune only action head / last linear layers
What you update
final action projection layers / action MLP head
sometimes the last 1–2 transformer blocks
Pros
simplest
least risk of destroying vision-language alignment
Cons
can be too weak: if failures are due to perception/grounding, head-only tuning may not fix it
Option C:
Tune adapters + a small “policy subnetwork”
A strong practical middle ground:
tune adapters
plus last K blocks (e.g., last 2–4 transformer layers)
plus action head
This gives more capacity while still staying far from full fine-tuning.
Option D: Full fine-tuning (least recommended early)
This can work but is higher risk:
more compute
more instability
more forgetting of SFT skills
Usually you try this only after A/B/C plateau.
4) How this maps to your “incremental task adapters” design
You have two clean strategies:
Strategy 1:
Per-task RL fine-tune on each task adapter
keep the base frozen
pick adapter for task i
collect rollouts for task i
GRPO update only that adapter (and maybe head)
This matches your incremental adapter philosophy and keeps tasks isolated.
Strategy 2:
Train a shared “RL improvement adapter”
a single adapter trained with GRPO across multiple tasks
can improve general robustness/exploration without modifying task-specific adapters
This is useful if you want one “robustness adapter” that you can turn on.
5) Practical recommendation for your first implementation
If your goal is: “Try SimpleVLA-RL ideas without destabilizing TinyVLA”, do this:
Start from your SFT TinyVLA checkpoint
Freeze everything except:
task adapter (or LoRA modules)
action head
(optional) LayerNorm parameters
Implement GRPO using group rollouts (same init state, multiple sampled trajectories)
If it plateaus, expand trainable set to:
last 1–2 transformer blocks + adapters
That’s the most common “safe scaling” path.