Fine tune a trained model via GPRO

By ihsumlee , 3 March 2026

content

Here’s a polished version of your question:

“In Step C (policy update with GRPO), how can I apply policy updates if my TinyVLA-based model was already trained with supervised learning (SL/SFT)? I don’t want to train from scratch—only fine-tune. Can I fine-tune only the adapters, or only the linear layers?”

Yes — GRPO is typically used as an RL fine-tuning stage on top of an SFT-pretrained VLA, not training from scratch. In practice, updating only adapters (or a small subset of layers) is not only allowed, it’s often the best first attempt for stability and compute.

Below is the clear “how + what to tune” picture.

1) What GRPO “policy update” means in your case

Even though your model was trained by SL/SFT, it still defines a stochastic policy during rollout:

In TinyVLA-style diffusion, stochasticity comes from sampling noise in the diffusion process (unless you force fully deterministic sampling).
GRPO then adjusts parameters $\theta$ to increase the probability of action sequences that led to success, and decrease those that led to failure, using group-relative advantages.

So GRPO is simply:

“fine-tune the existing policy using outcome-based RL.”

No scratch training needed.

2) What you must have to do GRPO (regardless of adapters vs full FT)

GRPO/PPO-style updates require you to compute (explicitly or implicitly):

$$
r_t(\theta)=\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}
$$

Meaning you need:

the probability (or log-prob) of the sampled actions under the current policy
the probability (or log-prob) of the same actions under the old (behavior) policy snapshot

This is the key engineering requirement.

Once you can compute (or approximate) these log-probs, you can apply GRPO to any parameter subset: adapters, heads, linear layers, etc.

For diffusion policies, this “log-prob” is trickier than for discrete tokens, but it’s still doable (many diffusion-policy RL implementations compute likelihood surrogates from the noise prediction / score model).

3) Can you fine-tune only adapters? Yes. Here are the main options.

Option A (recommended first):

Tune only task adapters / LoRA adapters

What you update

your incremental task adapters (or a new “RL adapter”)
optionally the action head (final MLP) + LayerNorm scale/bias

What you freeze

vision backbone
language backbone
most transformer blocks

Why this is a good default

stable (RL gradients can be noisy)
low memory + fast iteration
preserves SFT behavior (less catastrophic drift)

When it might be insufficient

if the RL improvements require changing “representations,” not just the final mapping to actions
if reward is very sparse and adapters are too small to shift behavior meaningfully

Option B:

Tune only action head / last linear layers

What you update

final action projection layers / action MLP head
sometimes the last 1–2 transformer blocks

Pros

simplest
least risk of destroying vision-language alignment

Cons

can be too weak: if failures are due to perception/grounding, head-only tuning may not fix it

Option C:

Tune adapters + a small “policy subnetwork”

A strong practical middle ground:

tune adapters
plus last K blocks (e.g., last 2–4 transformer layers)
plus action head

This gives more capacity while still staying far from full fine-tuning.

Option D: Full fine-tuning (least recommended early)

This can work but is higher risk:

more compute
more instability
more forgetting of SFT skills

Usually you try this only after A/B/C plateau.

4) How this maps to your “incremental task adapters” design

You have two clean strategies:

Strategy 1:

Per-task RL fine-tune on each task adapter

keep the base frozen
pick adapter for task i
collect rollouts for task i
GRPO update only that adapter (and maybe head)

This matches your incremental adapter philosophy and keeps tasks isolated.

Strategy 2:

Train a shared “RL improvement adapter”

a single adapter trained with GRPO across multiple tasks
can improve general robustness/exploration without modifying task-specific adapters

This is useful if you want one “robustness adapter” that you can turn on.

5) Practical recommendation for your first implementation

If your goal is: “Try SimpleVLA-RL ideas without destabilizing TinyVLA”, do this:

Start from your SFT TinyVLA checkpoint
Freeze everything except:
- task adapter (or LoRA modules)
- action head
- (optional) LayerNorm parameters
Implement GRPO using group rollouts (same init state, multiple sampled trajectories)
If it plateaus, expand trainable set to:
- last 1–2 transformer blocks + adapters

That’s the most common “safe scaling” path.