By ihsumlee , 26 March 2026

content

Section 5 overall idea

Section 5 explains the move from hierarchical robotics systems to integrated end-to-end systems. In earlier hierarchical systems, the robot pipeline was split into perception, planning, and control. Section 5 says that this design is interpretable, but it suffers from symbolic bottlenecks and error propagation. In contrast, recent Vision-Language-Action (VLA) systems map sensory input and language commands directly to robot actions in a single model. In these systems, LLMs are no longer only “planners”; they become part of a joint policy that helps connect semantics to real control. Foundational examples named by the survey include RT-1, RT-2, and PaLM-E.

So the whole section is built around three questions:

How should actions be represented?
How should these models be trained?
What kinds of architectures are used?

If you remember only one sentence, remember this:

Section 5 is about how modern VLAs unify perception, language, and action into one trainable control system.

Step 1: Why do we need end-to-end approaches?

Before Section 5 explains the methods, it first explains the motivation.

In Section 4, hierarchical systems already became smarter because they used LLMs for reasoning and planning. But the robot was still operating through separated modules. That means the language model might output a good high-level plan, but the low-level controller could still fail because the whole system was not jointly learned. The survey says these modular systems still suffer from symbolic bottlenecks, fragile interfaces, and error propagation. That is the main reason researchers moved toward end-to-end trainable VLA systems.

So the logic is:

Traditional systems: stable, but not semantic
Hierarchical LLM systems: semantic, but still modular
End-to-end VLA systems: try to learn semantics and action together

This is the conceptual bridge into Section 5.

Step 2: The first big topic is action representation

The survey says the first core issue is action representation. This is very important because, in end-to-end VLA, the model must output something the robot can actually execute. The action space affects low-level control fidelity, learning stability, and the model’s ability to ground language into physical behavior.

In simple words:

If the model cannot represent actions well, then even if it “understands” the instruction, it still cannot move correctly.

The survey divides action representation into three paradigms.

2.1 Discretized actions

This is the idea used by many token-based VLA models. Continuous robot actions are converted into discrete bins or tokens, so that the robot policy looks more like language modeling. Then the model can predict actions token by token, like text generation.

This idea is powerful because it lets robotics reuse the infrastructure of LLM/VLM training. OpenVLA describes VLA models exactly in this spirit: they fuse robot control actions directly into VLM backbones, and OpenVLA itself predicts 7-dimensional robot control actions through a language-model-style backbone.

A very important advantage is that discretization makes robot control compatible with pretrained sequence models. But the downside is that it can be slow and sometimes unnatural, because continuous motor control is forced into discrete token prediction.

For your own intuition:

language models like discrete tokens
robots live in continuous motion
discretization is the bridge between these two worlds

OpenVLA is a representative example here.

2.2 Continuous actions

Instead of converting actions into tokens, another idea is to predict continuous actions directly. This is more natural for physical control because robot joints, end-effector poses, and gripper values are continuous in the real world.

The benefit is better physical fidelity. The challenge is that continuous prediction is harder to combine directly with LLM-style generation frameworks. The survey treats this as another main design direction in VLA action representation.

2.3 Generative trajectory modeling

The third idea is to generate not just one immediate action, but a whole chunk or trajectory of future actions. This often appears in diffusion-style or generative policies.

This is especially important because robot manipulation is inherently sequential. A single good action is not enough; the model needs a smooth, temporally coherent sequence. The survey therefore includes generative action modeling as a major paradigm under action representation.

This connects very strongly to TinyVLA. TinyVLA argues that many existing VLAs generate discrete action tokens autoregressively, which causes high latency. To solve this, TinyVLA attaches a diffusion-based head to a pretrained multimodal model for direct robot action output. That is why TinyVLA is a good example of moving away from slow token-by-token action prediction toward a more efficient generative action head.

So, at the end of this step, you should remember:

discrete action tokens → easy to align with LLMs, but can be slow
continuous actions → physically natural
trajectory/generative outputs → better temporal coherence

Step 3: The second big topic is training strategy

After deciding how actions are represented, the next question is:

How do we train these VLA systems?

Section 5 says that scaling is essential. Foundational systems such as RT-1, RT-2, and PaLM-E show that increasing data scale and model capacity leads to stronger zero-shot generalization and cross-task transfer.

The key training idea is this:

large pretrained vision-language or language backbones already have broad semantic knowledge
robot datasets teach them how to convert that knowledge into physical action

This is why the section presents VLA as a form of multimodal transfer learning into control.

OpenVLA is a clear example. It is trained on 970k robot demonstrations from Open X-Embodiment, and its paper explains that one major benefit of VLAs is that robotics can directly benefit from rapid progress in VLMs.

So the basic training recipe is often:

start with a pretrained VLM or LLM backbone
add action prediction capability
train or fine-tune on robot demonstrations
hope that semantic priors from internet-scale data improve robot generalization

That is one of the central messages of Section 5.

Step 4: The third big topic is architecture

This is the structural heart of Section 5. The survey divides end-to-end VLA architectures into two main types:

monolithic architectures
modular architectures

This distinction is extremely important.

4.1 Monolithic architectures

A monolithic VLA is trained as a more fully unified model. Perception, language understanding, and action generation are jointly optimized in one large architecture.

The survey says this gives strong performance because of fully joint optimization. It names RT-2 as a major leap because it fine-tuned web-scale vision-language models on robotic data, transferring semantic and commonsense knowledge to robot control. It also says OpenVLA made monolithic designs more accessible through open-source release.

So the monolithic philosophy is:

one big brain, trained together

The main advantages are stronger integration and strong performance. But the survey also clearly states the drawbacks:

fine-tuning multi-billion-parameter backbones needs a lot of data and compute
end-to-end adaptation can cause catastrophic forgetting of pretrained knowledge

This is a very important trade-off.

OpenVLA is an excellent practical example. It directly fine-tunes a large pretrained vision-language model to generate robot actions by treating them as tokens in the language model vocabulary, and it is explicitly presented as a more end-to-end approach than stitched-together generalist policies.

4.2 Modular architectures

To reduce the cost of monolithic models, the survey says modular architectures use a foundation-model-as-service idea. A large vision-language backbone stays mostly frozen and serves as a perception-and-reasoning engine, while a smaller policy head or adapter learns to generate robot actions.

The survey names PaLM-E and RoboFlamingo as early representatives. Their idea is that a frozen backbone provides rich features, and a separate motor policy uses those features to control the robot.

This approach has several advantages:

preserves pretrained knowledge
reduces adaptation cost
supports transfer across tasks and embodiments

The survey also says policy heads have evolved from simple linear/recurrent layers to more expressive diffusion-based heads, which can generate smooth, multimodal trajectories for fine manipulation.

This is exactly where TinyVLA fits very naturally. TinyVLA keeps pretrained parts, uses LoRA for parameter-efficient tuning, and adds a policy decoder / diffusion-based head to output executable actions. Its main goal is to build a fast, data-efficient VLA without relying on huge robot pretraining like OpenX.

So if you want a clean comparison:

OpenVLA leans toward more monolithic end-to-end token-based VLA
TinyVLA is closer to an efficiency-oriented modular VLA with a diffusion action head and parameter-efficient tuning

Step 5: New trends inside modular VLA

Section 5 does not stop at simple modular design. It says modularity also enables more advanced cognitive structures.

The survey highlights dual-system architectures, inspired by fast “System 1” intuition and slow “System 2” deliberation. It names CogACT, Hume, and One-Two-VLA as examples. These systems first produce fast action proposals, then refine them using a slower reasoning module.

It also names WorldVLA, which adds a learned world model for predictive long-horizon control.

This part is important because it shows that end-to-end VLA is no longer just “one model predicts actions.” It is becoming richer:

some VLAs emphasize speed
some emphasize reasoning refinement
some add world prediction for longer horizons

So Section 5 is not describing one single architecture. It is describing an evolving family of integrated control models.

Step 6: What is the survey’s final judgment on monolithic vs modular?

The survey gives a balanced discussion.

It says monolithic and modular VLAs represent two different philosophies for general-purpose embodied intelligence. Monolithic systems offer strong performance through joint optimization, but they are expensive and inflexible. Modular systems are more reusable and easier to deploy, but may sacrifice global optimality and long-horizon consistency.

Then the survey makes an important conclusion:

neither paradigm alone is enough for robust real-world autonomy

Instead, the authors suggest that hybrid designs may be the practical middle ground. These hybrid systems try to combine:

strong pretrained semantic knowledge
selectively trainable control modules
efficient deployment
better adaptation to new tasks and robots

This is a very important conclusion for your own research, because your TinyVLA + LoRA + runtime-orchestration direction fits this practical hybrid philosophy very well.

Step 7: What challenges remain, according to Section 5?

Even though Section 5 is optimistic about VLAs, it does not claim the problem is solved. The survey says that real progress still depends on solving broader issues such as:

data quality
open-world generalization
cross-embodiment transfer
continual learning
resource-aware deployment
safety

This is also consistent with OpenVLA’s own discussion. OpenVLA notes limitations such as single-image observations, insufficient inference speed for very high-frequency control, and reliability still below ideal deployment quality.

So Section 5 is not just “VLA is better.” It is more careful:

VLA is the new direction, but it still faces serious deployment challenges.

Step 8: A very simple mental map of Section 5

You can remember the whole section like this:

Part A: Why move to end-to-end?

Because hierarchical systems still break at module boundaries.

Part B: What must an end-to-end VLA decide?

It must choose how to represent actions:

discrete
continuous
generative trajectory/chunking

Part C: How are these systems trained?

By transferring knowledge from large pretrained vision-language backbones into robot action learning using robot demonstration data.

Part D: What architectures exist?

monolithic VLA
modular VLA
emerging hybrids

Part E: What is the big conclusion?

VLA is powerful, but practical deployment requires balancing performance, efficiency, adaptability, and safety.

Step 9: If I translate Section 5 into your TinyVLA/OpenVLA language

Here is the most practical interpretation for you:

RT-2 / OpenVLA style: show the power of large pretrained VLM/LLM backbones directly generating robot actions, often with tokenized actions and more end-to-end fine-tuning
TinyVLA style: shows that you can keep the VLA idea, but make it faster and more data-efficient by using a smaller VLM, LoRA-style adaptation, and a diffusion action head instead of slow autoregressive action-token decoding

So your own research context sits very naturally inside the exact trade-off that Section 5 discusses.

Final one-paragraph summary

Section 5 explains that robotic manipulation is shifting from modular LLM-assisted pipelines to end-to-end VLA systems that jointly learn perception, language grounding, and control. The section says that the field revolves around three design questions: how to represent actions, how to train large multimodal policies, and how to structure the architecture. It then compares monolithic VLAs, which are strongly integrated but expensive, with modular VLAs, which are more efficient and adaptable but less globally optimized. The final message is that future progress will likely come from hybrid VLA designs that combine pretrained semantic backbones with efficient trainable control modules for practical deployment.