Hi-Way: Hierarchical Framework for Continuous Vision-Language Navigation via Map Guidance and Waypoint Reasoning

European Conference on Computer Vision (ECCV 2026)

示例图片

Highlights

  • We propose a hierarchical continuous-VLN framework with top-down modules for task planning, environment perception and waypoint reasoning, and low-level execution, which decouples motion control from vision--language reasoning and improves long-horizon stability.
  • We develop a lightweight multi-task VLN network using a Q-Former to unify RGB-D and language into shared queries for vision--language matching and route/waypoint reasoning, and introduce an adaptive memory decay to mitigate long-horizon forgetting.
  • We build a ROS/Gazebo-based evaluation setting by migrating Matterport3D scenes and integrating realistic execution, sensor models, and obstacle avoidance, reducing the sim-to-real gap.

Realworld Demos

These demos show real-world Hi-Way performance of robot.

Instruction: Turn around and go straight forward to the blue box.

Instruction: Turn around and go straight forward to the blue box.

Instruction: Go straight to the blue box and then turn right to exit the room.

Abstract

Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavour to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometer and depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from VLN-CE trajectories, including action planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves SOTA performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field. We will release the code and data to benefit the community.

Method Overview

NaVid

Pipeline of Hi-Way. Our hierarchical framework decomposes the complex VLN task into three controllable modules. (1) High-level planning: an LLM-driven planner decomposes the instruction into an ordered sequence of sub-tasks. (2) Middle-level perception and waypoint reasoning: the robot acquires a surround view via in-place rotation, performs image-text matching to build a interest obstacle map, and selects the optimal waypoint to refine the route. (3) Low-level action execution: a ROS-aligned controller executes waypoint navigation through velocity commands (/cmd_vel).

VLN-Waypoint Dataset Construction

NaVid

VLN-Waypoint dataset construction pipeline. (a) Starting from classic VLN observations (RGB/depth) and multi-step action sequences. (b) An action-to-waypoint policy maps the next four actions to a discrete waypoint label (including special cases such as stop). (c) Using depth-based 3D ground segmentation, we sample traversable waypoint candidates and project them into the image to obtain waypoint-annotated views. (d) We package each sample as an instruction-route pair with the current and history frames, together with the projected waypoint candidates, forming our VLN-Waypoint dataset.

Habitat Model Inference Demos

These GIFs showcase our model inference demos on the Habitat simulator.

Habitat Demo 1
Habitat Demo 2
Habitat Demo 3
Habitat Demo 4

Comparative Experiments in Real Wrold

Comparison between our method and NaVid on visible and non-visible cases.

Our Method (Hi-Way)
Visible Case
Hi-Way Visible Case
Hi-Way: Walk forward and turn left. Wait by the first door on the left.
Our Method (Hi-Way)
Non-visible Case
Hi-Way Non-visible Case
Hi-Way: Walk forward and turn left. Wait near the first doorway.
NaVid
Visible Case
NaVid Visible Case
NaVid: Exit the bathroom. Turn left and enter the bedroom. Wait near the bed.
NaVid
Non-visible Case
NaVid Non-visible Case
NaVid: Walk through the doorway and turn left. Walk past the pool and wait by the first chair.

Single-task vs Multiple-task Comparison

Comparison of different methods on single-task and multiple-task navigation scenarios.

Our Method (Hi-Way)
Single-task
Hi-Way Single-task
Hi-Way: Turn left and walk to the chair.
NaVid
Single-task
NaVid Single-task
NaVid: Turn left and walk to the chair.
Uni-NaVid
Single-task
Uni-NaVid Single-task
Uni-NaVid: Turn left and walk to the chair.
Our Method (Hi-Way)
Multiple-task
Hi-Way Multiple-task
Hi-Way: Exit the room, turn right, and walk to the kitchen table.
NaVid
Multiple-task
NaVid Multiple-task
NaVid: Exit the room, turn right, and walk to the kitchen table.
Uni-NaVid
Multiple-task
Uni-NaVid Multiple-task
Uni-NaVid: Exit the room, turn right, and walk to the kitchen table.

Gazebo VLN Simulation

Tests conducted in our Gazebo-based VLN simulation environment.
Instruction: Walk into the restroom and find the toilet.

Robot Execution Visualization
Intermediate Results

BibTeX