Embodied Navigation Foundation Model

ICLR 2026 Conference SubmissionAnonymous Authors
Embodied NavigationVision-Language-Action Model
Abstract:

Navigation is a fundamental capability in embodied AI, representing the intelligence required to perceive and interact within physical environments. To achieve such intelligence, recent advanced works leverage Vision-Language Models (VLMs), which demonstrate strong generalizability and possess a well-suited formulation for navigation. However, these approaches remain largely confined to narrow task settings and embodiment-specific architectures. In this work, we introduce a cross-embodiment and cross-task Navigation Foundation Model (NavFoM), trained on eight million navigation samples that encompass quadrupeds, drones, wheeled robots, and vehicles, and spanning diverse tasks such as vision-and-language navigation, object searching, target tracking, and autonomous driving. NavFoM employs a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons. To accommodate diverse camera setups and temporal horizons, NavFoM incorporates identifier tokens that embed camera view information of embodiments and the temporal context of tasks. Furthermore, to meet the demands of real-world deployment, NavFoM controls all observation tokens using a dynamically adjusted sampling strategy under a limited token length budget. Extensive evaluations on seven public benchmarks demonstrate that our model achieves state-of-the-art or highly competitive performance across different navigation tasks and embodiments without requiring task-specific fine-tuning. Additional real-world experiments further confirm the strong generalizability and practical applicability of our approach.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: cross-embodiment and cross-task embodied navigation. The field addresses how agents with diverse physical forms and capabilities can navigate and perform tasks across varied environments and objectives. The taxonomy reveals four main branches. Foundation Models and Generalist Agents for Embodied Navigation explores unified architectures that leverage large-scale pretraining and multimodal reasoning to handle multiple robot morphologies and task types, as seen in works like Universal Actions Embodied[1] and Robocat Self-Improving Agent[5]. Multi-Agent Navigation and Coordination focuses on scenarios where multiple agents must navigate shared spaces, often requiring collision avoidance, communication protocols, and cooperative strategies. Task-Specific Navigation Methods and Representations develops specialized techniques for particular problem settings, such as visual representations, hierarchical planning, or instruction following. Finally, Surveys and Overviews of Embodied AI and Navigation provide broad perspectives on the evolving landscape, synthesizing progress across these dimensions. A central tension emerges between generalist foundation models that aim for broad transferability and specialized methods that optimize for particular embodiments or tasks. Recent efforts like Multimodal LLMs Embodied Agents[2] and Bio-Inspired Cross-Embodiment[25] push toward more flexible policies that can adapt across robot types, while works such as Cross-Embodiment Limits[44] critically examine the boundaries of such transfer. Embodied Navigation Foundation[0] sits squarely within the Cross-Embodiment Foundation Models cluster, emphasizing scalable pretraining and policy adaptation mechanisms that bridge different morphologies and task specifications. Compared to Robocat Self-Improving Agent[5], which focuses on self-improvement through iterative data collection, and Bio-Inspired Cross-Embodiment[25], which draws on biological principles for morphology-agnostic control, Embodied Navigation Foundation[0] appears to prioritize unified representations that facilitate zero-shot or few-shot generalization across diverse navigation scenarios. This positioning reflects ongoing debates about whether cross-embodiment success depends more on architectural universality or on richer inductive biases tailored to embodied reasoning.

Claimed Contributions

Cross-embodiment and cross-task Navigation Foundation Model (NavFoM)

The authors propose NavFoM, a unified navigation foundation model trained on 8 million samples covering multiple embodiments (quadrupeds, drones, wheeled robots, vehicles) and diverse navigation tasks (vision-and-language navigation, object searching, target tracking, autonomous driving). The model uses a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons without requiring task-specific fine-tuning.

10 retrieved papers
Temporal-Viewpoint Indicator (TVI) tokens

The authors introduce TVI tokens as a mechanism to organize visual tokens by encoding both viewpoint (camera angle) and temporal information. These tokens enable flexible processing of arbitrary camera arrangements and support unified training across image QA, video QA, and navigation tasks with different camera configurations.

10 retrieved papers
Budget-Aware Temporal Sampling (BATS) strategy

The authors propose BATS, a token sampling strategy that dynamically samples navigation history tokens based on an exponential forgetting curve while respecting a fixed token budget. This approach balances navigation performance with inference efficiency and adapts to varying numbers of cameras, addressing practical deployment constraints.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cross-embodiment and cross-task Navigation Foundation Model (NavFoM)

The authors propose NavFoM, a unified navigation foundation model trained on 8 million samples covering multiple embodiments (quadrupeds, drones, wheeled robots, vehicles) and diverse navigation tasks (vision-and-language navigation, object searching, target tracking, autonomous driving). The model uses a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons without requiring task-specific fine-tuning.

Contribution

Temporal-Viewpoint Indicator (TVI) tokens

The authors introduce TVI tokens as a mechanism to organize visual tokens by encoding both viewpoint (camera angle) and temporal information. These tokens enable flexible processing of arbitrary camera arrangements and support unified training across image QA, video QA, and navigation tasks with different camera configurations.

Contribution

Budget-Aware Temporal Sampling (BATS) strategy

The authors propose BATS, a token sampling strategy that dynamically samples navigation history tokens based on an exponential forgetting curve while respecting a fixed token budget. This approach balances navigation performance with inference efficiency and adapts to varying numbers of cameras, addressing practical deployment constraints.