Course Introduction: Learning AI Robots — ROS, VLM, VLN, and VLA

Artificial intelligence is rapidly changing the way robots perceive, reason, navigate, and act in real-world environments. Traditional robots usually rely on carefully designed rules, fixed programs, and task-specific control pipelines. However, recent advances in Vision-Language Models (VLMs), Vision-Language Navigation (VLN), and Vision-Language-Action (VLA) models are moving robots toward a new generation of intelligent embodied systems that can understand visual scenes, interpret human language, and execute complex actions.

This course introduces the fundamental concepts and practical technologies required to build AI robots. The course begins with Robot Operating System 2 (ROS 2), which provides the software infrastructure for robot communication, sensing, control, visualization, and system integration. You will learn how different robot modules exchange information through topics, services, actions, coordinate transformations, and sensor data streams. These ROS foundations are essential for building reliable robotic systems.

After learning the basic robotics framework, the course introduces perception and language understanding in robots. Vision-Language Models allow robots to connect images with natural language. For example, a robot can observe a scene, identify objects, understand spatial relationships, and answer questions about the environment. These capabilities are important because future robots should not only detect objects but also understand human instructions such as “pick up the white glue,” “find the tool box,” or “move to the table.”

The course then extends from perception to navigation. Vision-Language Navigation focuses on how robots move in an environment by following language instructions. In VLN, a robot may receive a command such as “go through the corridor and stop beside the cabinet.” To complete this task, the robot must combine visual perception, spatial understanding, mapping, localization, and decision-making. This part of the course connects classical robotics methods, such as SLAM and path planning, with modern AI-based language reasoning.

Finally, the course introduces Vision-Language-Action models, which represent one of the most important research directions in embodied AI and intelligent robotics. VLA models unify visual input, language instruction, and robot action generation in one learning framework. Instead of only recognizing objects or planning paths, a VLA robot can directly generate executable actions from visual observations and human commands. Recent VLA studies show that these models can be fine-tuned for robotic manipulation tasks by using demonstration learning, preference learning, and reinforcement learning. For example, a long-horizon manipulation task can be decomposed into meaningful stages such as Reach, Grasp, Transport, and Place, allowing the robot to learn each stage more effectively instead of treating the whole task as one single action sequence.

The goal of this course is to help you understand both the engineering side and the research side of AI robots. From the engineering perspective, you will learn how to build robot software using ROS 2, connect sensors and actuators, and design modular robotic systems. From the AI perspective, you will learn how VLM, VLN, and VLA models enable robots to understand multimodal information and perform more flexible tasks. By the end of the course, you should be able to explain how an AI robot works, design a basic ROS-based robot system, and understand how modern language-and-vision-based models can be applied to future robotic intelligence.

This course is especially suitable for students who want to enter the fields of intelligent robotics, embodied AI, autonomous mobile robots, robot manipulation, and human-robot collaboration. Through lectures, demonstrations, and hands-on practice, you will gradually build the ability to design robots that can see, understand, navigate, and act.