Fine tune a trained model via GPRO

By ihsumlee , 3 March 2026
content

Here’s a polished version of your question:

“In Step C (policy update with GRPO), how can I apply policy updates if my TinyVLA-based model was already trained with supervised learning (SL/SFT)? I don’t want to train from scratch—only fine-tune. Can I fine-tune only the adapters, or only the linear layers?”

Yes — GRPO is typically used as an RL fine-tuning stage on top of an SFT-pretrained VLA, not training from scratch. In practice, updating only adapters (or a small subset of layers) is not only allowed, it’s often the best first attempt for stability and compute.

Below is the clear “how + what to tune” picture.


1) What GRPO “policy update” means in your case

Even though your model was trained by SL/SFT, it still defines a stochastic policy during rollout:

  • In TinyVLA-style diffusion, stochasticity comes from sampling noise in the diffusion process (unless you force fully deterministic sampling).

  • GRPO then adjusts parameters $\theta$ to increase the probability of action sequences that led to success, and decrease those that led to failure, using group-relative advantages.

So GRPO is simply:

“fine-tune the existing policy using outcome-based RL.”

No scratch training needed.


2) What you must have to do GRPO (regardless of adapters vs full FT)

GRPO/PPO-style updates require you to compute (explicitly or implicitly):

$$
r_t(\theta)=\frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)}
$$

Meaning you need:

  1. the probability (or log-prob) of the sampled actions under the current policy

  2. the probability (or log-prob) of the same actions under the old (behavior) policy snapshot

This is the key engineering requirement.

Once you can compute (or approximate) these log-probs, you can apply GRPO to any parameter subset: adapters, heads, linear layers, etc.

For diffusion policies, this “log-prob” is trickier than for discrete tokens, but it’s still doable (many diffusion-policy RL implementations compute likelihood surrogates from the noise prediction / score model).


3) Can you fine-tune only adapters? Yes. Here are the main options.

Option A (recommended first):

Tune only task adapters / LoRA adapters

What you update

  • your incremental task adapters (or a new “RL adapter”)

  • optionally the action head (final MLP) + LayerNorm scale/bias

What you freeze

  • vision backbone

  • language backbone

  • most transformer blocks

Why this is a good default

  • stable (RL gradients can be noisy)

  • low memory + fast iteration

  • preserves SFT behavior (less catastrophic drift)

When it might be insufficient

  • if the RL improvements require changing “representations,” not just the final mapping to actions

  • if reward is very sparse and adapters are too small to shift behavior meaningfully


Option B:

Tune only action head / last linear layers

What you update

  • final action projection layers / action MLP head

  • sometimes the last 1–2 transformer blocks

Pros

  • simplest

  • least risk of destroying vision-language alignment

Cons

  • can be too weak: if failures are due to perception/grounding, head-only tuning may not fix it


Option C:

Tune adapters + a small “policy subnetwork”

A strong practical middle ground:

  • tune adapters

  • plus last K blocks (e.g., last 2–4 transformer layers)

  • plus action head

This gives more capacity while still staying far from full fine-tuning.


Option D: Full fine-tuning (least recommended early)

This can work but is higher risk:

  • more compute

  • more instability

  • more forgetting of SFT skills

Usually you try this only after A/B/C plateau.


4) How this maps to your “incremental task adapters” design

You have two clean strategies:

Strategy 1:

Per-task RL fine-tune on each task adapter

  • keep the base frozen

  • pick adapter for task i

  • collect rollouts for task i

  • GRPO update only that adapter (and maybe head)

This matches your incremental adapter philosophy and keeps tasks isolated.

Strategy 2:

Train a shared “RL improvement adapter”

  • a single adapter trained with GRPO across multiple tasks

  • can improve general robustness/exploration without modifying task-specific adapters

This is useful if you want one “robustness adapter” that you can turn on.


5) Practical recommendation for your first implementation

If your goal is: “Try SimpleVLA-RL ideas without destabilizing TinyVLA”, do this:

  1. Start from your SFT TinyVLA checkpoint

  2. Freeze everything except:

    • task adapter (or LoRA modules)

    • action head

    • (optional) LayerNorm parameters     

  3. Implement GRPO using group rollouts (same init state, multiple sampled trajectories)

  4. If it plateaus, expand trainable set to:

    • last 1–2 transformer blocks + adapters

That’s the most common “safe scaling” path.


Tags