Hi-Way: Hierarchical Framework for Continuous Vision-Language Navigation via Map Guidance and Waypoint Reasoning

Paper Video Code Hugging Face

European Conference on Computer Vision (ECCV 2026)

Highlights

We propose a hierarchical continuous-VLN framework with top-down modules for task planning, environment perception and waypoint reasoning, and low-level execution, which decouples motion control from vision--language reasoning and improves long-horizon stability.
We develop a lightweight multi-task VLN network using a Q-Former to unify RGB-D and language into shared queries for vision--language matching and route/waypoint reasoning, and introduce an adaptive memory decay to mitigate long-horizon forgetting.
We build a ROS/Gazebo-based evaluation setting by migrating Matterport3D scenes and integrating realistic execution, sensor models, and obstacle avoidance, reducing the sim-to-real gap.

Realworld Demos

These demos show real-world Hi-Way performance of robot.

Instruction: Turn around and go straight forward to the blue box.

Instruction: Go straight to the blue box and then turn right to exit the room.

Method Overview

Pipeline of Hi-Way. Our hierarchical framework decomposes the complex VLN task into three controllable modules. (1) High-level planning: an LLM-driven planner decomposes the instruction into an ordered sequence of sub-tasks. (2) Middle-level perception and waypoint reasoning: the robot acquires a surround view via in-place rotation, performs image-text matching to build a interest obstacle map, and selects the optimal waypoint to refine the route. (3) Low-level action execution: a ROS-aligned controller executes waypoint navigation through velocity commands (/cmd_vel).

VLN-Waypoint Dataset Construction

VLN-Waypoint dataset construction pipeline. (a) Starting from classic VLN observations (RGB/depth) and multi-step action sequences. (b) An action-to-waypoint policy maps the next four actions to a discrete waypoint label (including special cases such as stop). (c) Using depth-based 3D ground segmentation, we sample traversable waypoint candidates and project them into the image to obtain waypoint-annotated views. (d) We package each sample as an instruction-route pair with the current and history frames, together with the projected waypoint candidates, forming our VLN-Waypoint dataset.