Embodied Navigation Foundation Model
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose NavFoM, a unified navigation foundation model trained on 8 million samples covering multiple embodiments (quadrupeds, drones, wheeled robots, vehicles) and diverse navigation tasks (vision-and-language navigation, object searching, target tracking, autonomous driving). The model uses a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons without requiring task-specific fine-tuning.
The authors introduce TVI tokens as a mechanism to organize visual tokens by encoding both viewpoint (camera angle) and temporal information. These tokens enable flexible processing of arbitrary camera arrangements and support unified training across image QA, video QA, and navigation tasks with different camera configurations.
The authors propose BATS, a token sampling strategy that dynamically samples navigation history tokens based on an exponential forgetting curve while respecting a fixed token budget. This approach balances navigation performance with inference efficiency and adapts to varying numbers of cameras, addressing practical deployment constraints.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Universal actions for enhanced embodied foundation models PDF
[2] From multimodal llms to generalist embodied agents: Methods and lessons PDF
[5] Robocat: A self-improving generalist agent for robotic manipulation PDF
[25] A Bio-Inspired Learning and Control Framework for Cross-Embodiment and Cross-Task Locomotion PDF
[44] Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Cross-embodiment and cross-task Navigation Foundation Model (NavFoM)
The authors propose NavFoM, a unified navigation foundation model trained on 8 million samples covering multiple embodiments (quadrupeds, drones, wheeled robots, vehicles) and diverse navigation tasks (vision-and-language navigation, object searching, target tracking, autonomous driving). The model uses a unified architecture that processes multimodal navigation inputs from varying camera configurations and navigation horizons without requiring task-specific fine-tuning.
[44] Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation PDF
[63] Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities PDF
[64] Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation PDF
[65] Compass: Cross-embodiment mobility policy via residual rl and skill synthesis PDF
[66] X-mobility: End-to-end generalizable navigation via world modeling PDF
[67] Autonomous visual navigation for mobile robots: A systematic literature review PDF
[68] Antcar: simple route following task with ants-inspired vision and neural model PDF
[69] X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots PDF
[70] A Cross-Environment and Cross-Embodiment Path Planning Framework via a Conditional Diffusion Model PDF
[71] Design of AI based Autonomous Navigation System Using Swarm Intelligence Techniques for Agriculture Application PDF
Temporal-Viewpoint Indicator (TVI) tokens
The authors introduce TVI tokens as a mechanism to organize visual tokens by encoding both viewpoint (camera angle) and temporal information. These tokens enable flexible processing of arbitrary camera arrangements and support unified training across image QA, video QA, and navigation tasks with different camera configurations.
[53] Beings: Bayesian embodied image-goal navigation with gaussian splatting PDF
[54] Henet: Hybrid encoding for end-to-end multi-task 3d perception from multi-view cameras PDF
[55] NaviFormer: A Spatio-Temporal Context-Aware Transformer for Object Navigation PDF
[56] Simultaneous multi-view camera pose estimation and object tracking with squared planar markers PDF
[57] Active SLAM With Dynamic Viewpoint Optimization for Robust Visual Navigation PDF
[58] Learning multi-view camera relocalization with graph neural networks PDF
[59] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains PDF
[60] Learning View-invariant and Novel Spatio-temporal Features Under Uncertainty from Video PDF
[61] Virtual video camera: Image-based viewpoint navigation through space and time PDF
[62] Real-time vision-aided localization and navigation based on three-view geometry PDF
Budget-Aware Temporal Sampling (BATS) strategy
The authors propose BATS, a token sampling strategy that dynamically samples navigation history tokens based on an exponential forgetting curve while respecting a fixed token budget. This approach balances navigation performance with inference efficiency and adapts to varying numbers of cameras, addressing practical deployment constraints.