Embodied Navigation Foundation Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 8.0 Download Report PDF

Embodied NavigationVision-Language-Action Model

Navigation is a fundamental capability in embodied AI, representing the intelligence required to perceive and interact within physical environments. To achieve such intelligence, recent advanced works leverage Vision-Language Models (VLMs), which demonstrate strong generalizability and possess a well-suited formulation for navigation. However, these approaches remain largely confined to narrow task settings and embodiment-specific architectures. In this work, we introduce a cross-embodiment and cross-task Navigation Foundation Model (NavFoM), trained on eight million navigation samples that encompass quadrupeds, drones, wheeled robots, and vehicles, and spanning diverse tasks such as vision-and-language navigation, object searching, target tracking, and autonomous driving. NavFoM employs a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons. To accommodate diverse camera setups and temporal horizons, NavFoM incorporates identifier tokens that embed camera view information of embodiments and the temporal context of tasks. Furthermore, to meet the demands of real-world deployment, NavFoM controls all observation tokens using a dynamically adjusted sampling strategy under a limited token length budget. Extensive evaluations on seven public benchmarks demonstrate that our model achieves state-of-the-art or highly competitive performance across different navigation tasks and embodiments without requiring task-specific fine-tuning. Additional real-world experiments further confirm the strong generalizability and practical applicability of our approach.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: cross-embodiment and cross-task embodied navigation. The field addresses how agents with diverse physical forms and capabilities can navigate and perform tasks across varied environments and objectives. The taxonomy reveals four main branches. Foundation Models and Generalist Agents for Embodied Navigation explores unified architectures that leverage large-scale pretraining and multimodal reasoning to handle multiple robot morphologies and task types, as seen in works like Universal Actions Embodied[1] and Robocat Self-Improving Agent[5]. Multi-Agent Navigation and Coordination focuses on scenarios where multiple agents must navigate shared spaces, often requiring collision avoidance, communication protocols, and cooperative strategies. Task-Specific Navigation Methods and Representations develops specialized techniques for particular problem settings, such as visual representations, hierarchical planning, or instruction following. Finally, Surveys and Overviews of Embodied AI and Navigation provide broad perspectives on the evolving landscape, synthesizing progress across these dimensions. A central tension emerges between generalist foundation models that aim for broad transferability and specialized methods that optimize for particular embodiments or tasks. Recent efforts like Multimodal LLMs Embodied Agents[2] and Bio-Inspired Cross-Embodiment[25] push toward more flexible policies that can adapt across robot types, while works such as Cross-Embodiment Limits[44] critically examine the boundaries of such transfer. Embodied Navigation Foundation[0] sits squarely within the Cross-Embodiment Foundation Models cluster, emphasizing scalable pretraining and policy adaptation mechanisms that bridge different morphologies and task specifications. Compared to Robocat Self-Improving Agent[5], which focuses on self-improvement through iterative data collection, and Bio-Inspired Cross-Embodiment[25], which draws on biological principles for morphology-agnostic control, Embodied Navigation Foundation[0] appears to prioritize unified representations that facilitate zero-shot or few-shot generalization across diverse navigation scenarios. This positioning reflects ongoing debates about whether cross-embodiment success depends more on architectural universality or on richer inductive biases tailored to embodied reasoning.

Claimed Contributions

Cross-embodiment and cross-task Navigation Foundation Model (NavFoM)

10 retrieved papers

The authors propose NavFoM, a unified navigation foundation model trained on 8 million samples covering multiple embodiments (quadrupeds, drones, wheeled robots, vehicles) and diverse navigation tasks (vision-and-language navigation, object searching, target tracking, autonomous driving). The model uses a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons without requiring task-specific fine-tuning.

10 retrieved papers

Temporal-Viewpoint Indicator (TVI) tokens

10 retrieved papers

The authors introduce TVI tokens as a mechanism to organize visual tokens by encoding both viewpoint (camera angle) and temporal information. These tokens enable flexible processing of arbitrary camera arrangements and support unified training across image QA, video QA, and navigation tasks with different camera configurations.

10 retrieved papers

Budget-Aware Temporal Sampling (BATS) strategy

2 retrieved papers

The authors propose BATS, a token sampling strategy that dynamically samples navigation history tokens based on an exponential forgetting curve while respecting a fixed token budget. This approach balances navigation performance with inference efficiency and adapts to varying numbers of cameras, addressing practical deployment constraints.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Universal actions for enhanced embodied foundation models PDF

Jin-Liang Zheng, Jianxiong Li, Dongxiu Liu, Yi-nan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan (2025)

[2] From multimodal llms to generalist embodied agents: Methods and lessons PDF

Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Timofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, Alexander Toshev (2025)

[5] Robocat: A self-improving generalist agent for robotic manipulation PDF

Bousmalis, Konstantinos, Konstantinos Bousmalis, Vezzani, Giulia, Giulia Vezzani, Rao, Dushyant, Dushyant Rao, Devin, Coline, Coline Devin, Lee, Alex X., Alex X. Lee, Bauza, Maria, Maria BauzÃ¡, Davchev, Todor, Todor Davchev, Zhou, Yuxiang, Yuxiang Zhou, Gupta, Agrim, Agrim Gupta, Raju, Akhil, Akhil Raju, Laurens, Antoine, Antoine Laurens, Fantacci, Claudio, Claudio Fantacci, Dalibard, Valentin, Valentin Dalibard, Zambelli, Martina, Martina Zambelli, Martins, Murilo, Murilo Martins, Pevceviciute, Rugile, Rugile Pevceviciute, Blokzijl, Michiel, Michiel Blokzijl, Denil, Misha, Misha Denil, Batchelor, Nathan, Nathan Batchelor, Lampe, Thomas, Thomas Lampe, Parisotto, Emilio, Emilio Parisotto, Å»oÅna, Konrad, Konrad Å»oÅna, Reed Scott, Scott Reed, Colmenarejo, Sergio GÃ³mez, Sergio GÃ³mez Colmenarejo, Scholz, Jon, Jon Scholz, Abdolmaleki, Abbas, Abbas Abdolmaleki, Groth, Oliver, Oliver Groth, Regli, Jean-Baptiste, Jean-Baptiste Regli, Sushkov, Oleg, O. P. Sushkov, RothÃ¶rl, Tom, Tom RothÃ¶rl, Chen, JosÃ© Enrique, JosÃ© Enrique Chen, Aytar, Yusuf, Yusuf Aytar, Barker, Dave, D. B. Barker, Ortiz, Joy, Joy Ortiz, Riedmiller, Martin, Martin Riedmiller, Springenberg, Jost Tobias, Jost Tobias Springenberg, Hadsell, Raia, Raia Hadsell, Nori, Francesco, Francesco Nori, Heess, Nicolas, Nicolas Heess (2023)

[25] A Bio-Inspired Learning and Control Framework for Cross-Embodiment and Cross-Task Locomotion PDF

Shafiee-Ashtiani, Milad (2025) • Infoscience (Ecole Polytechnique FÃ©dÃ©rale de Lausanne)

[44] Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation PDF

Yang, Jonathan, Jonathan Yang, Catherine Glossop, Bhorkar, Arjun, Arjun Bhorkar, Shah, Dhruv, Dhruv Shah, Vuong, Quan, Quan Vuong, Finn, Chelsea, Chelsea Finn, Sadigh, Dorsa, Dorsa Sadigh, Levine, Sergey, Sergey Levine (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cross-embodiment and cross-task Navigation Foundation Model (NavFoM)

[44] Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation PDF

Cannot Refute

[63] Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities PDF

Cannot Refute

[64] Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation PDF

Cannot Refute

[65] Compass: Cross-embodiment mobility policy via residual rl and skill synthesis PDF

Cannot Refute

[66] X-mobility: End-to-end generalizable navigation via world modeling PDF

Cannot Refute

[67] Autonomous visual navigation for mobile robots: A systematic literature review PDF

Cannot Refute

[68] Antcar: simple route following task with ants-inspired vision and neural model PDF

Cannot Refute

[69] X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots PDF

Cannot Refute

[70] A Cross-Environment and Cross-Embodiment Path Planning Framework via a Conditional Diffusion Model PDF

Cannot Refute

[71] Design of AI based Autonomous Navigation System Using Swarm Intelligence Techniques for Agriculture Application PDF

Cannot Refute

Contribution

Temporal-Viewpoint Indicator (TVI) tokens

[53] Beings: Bayesian embodied image-goal navigation with gaussian splatting PDF

Cannot Refute

[54] Henet: Hybrid encoding for end-to-end multi-task 3d perception from multi-view cameras PDF

Cannot Refute

[55] NaviFormer: A Spatio-Temporal Context-Aware Transformer for Object Navigation PDF

Cannot Refute

[56] Simultaneous multi-view camera pose estimation and object tracking with squared planar markers PDF

Cannot Refute

[57] Active SLAM With Dynamic Viewpoint Optimization for Robust Visual Navigation PDF

Cannot Refute

[58] Learning multi-view camera relocalization with graph neural networks PDF

Cannot Refute

[59] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains PDF

Cannot Refute

[60] Learning View-invariant and Novel Spatio-temporal Features Under Uncertainty from Video PDF

Cannot Refute

[61] Virtual video camera: Image-based viewpoint navigation through space and time PDF

Cannot Refute

[62] Real-time vision-aided localization and navigation based on three-view geometry PDF

Cannot Refute

Contribution

Budget-Aware Temporal Sampling (BATS) strategy

[51] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space PDF

Cannot Refute

[52] Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning PDF

Cannot Refute

Embodied Navigation Foundation Model

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Universal actions for enhanced embodied foundation models PDF

[2] From multimodal llms to generalist embodied agents: Methods and lessons PDF

[5] Robocat: A self-improving generalist agent for robotic manipulation PDF

[25] A Bio-Inspired Learning and Control Framework for Cross-Embodiment and Cross-Task Locomotion PDF

[44] Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation PDF

Contribution Analysis

Cross-embodiment and cross-task Navigation Foundation Model (NavFoM)

[44] Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation PDF

[63] Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities PDF

[64] Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation PDF

[65] Compass: Cross-embodiment mobility policy via residual rl and skill synthesis PDF

[66] X-mobility: End-to-end generalizable navigation via world modeling PDF

[67] Autonomous visual navigation for mobile robots: A systematic literature review PDF

[68] Antcar: simple route following task with ants-inspired vision and neural model PDF

[69] X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots PDF

[70] A Cross-Environment and Cross-Embodiment Path Planning Framework via a Conditional Diffusion Model PDF

[71] Design of AI based Autonomous Navigation System Using Swarm Intelligence Techniques for Agriculture Application PDF

Temporal-Viewpoint Indicator (TVI) tokens

[53] Beings: Bayesian embodied image-goal navigation with gaussian splatting PDF

[54] Henet: Hybrid encoding for end-to-end multi-task 3d perception from multi-view cameras PDF

[55] NaviFormer: A Spatio-Temporal Context-Aware Transformer for Object Navigation PDF

[56] Simultaneous multi-view camera pose estimation and object tracking with squared planar markers PDF

[57] Active SLAM With Dynamic Viewpoint Optimization for Robust Visual Navigation PDF

[58] Learning multi-view camera relocalization with graph neural networks PDF

[59] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains PDF

[60] Learning View-invariant and Novel Spatio-temporal Features Under Uncertainty from Video PDF

[61] Virtual video camera: Image-based viewpoint navigation through space and time PDF

[62] Real-time vision-aided localization and navigation based on three-view geometry PDF

Budget-Aware Temporal Sampling (BATS) strategy

[51] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space PDF

[52] Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning PDF

Table of Contents