$\pi^3$ : Permutation-Equivariant Visual Geometry Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 8.0 Download Report PDF

Permutation-Equivariance3D reconstructionReference-FreeCamera Pose EstimationDepth Estimation

We introduce $\pi^3$ , a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models will be publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces π³, a feed-forward network for visual geometry reconstruction that eliminates reliance on fixed reference views through a fully permutation-equivariant architecture. It resides in the 'Reference-Free Multi-View Reconstruction' leaf, which contains four papers total including the original work. This leaf sits within the broader 'Permutation-Equivariant Visual Geometry Reconstruction' branch, indicating a moderately populated but focused research direction. The taxonomy reveals this is an active area with sibling work exploring similar permutation-invariant formulations, suggesting the paper enters a space with established momentum rather than pioneering entirely uncharted territory.

The taxonomy structure shows the paper's immediate neighbors include 'Deep Structure from Motion' methods that recover camera parameters from point tracks, and broader branches addressing 'Equivariant Representations for 3D Point Clouds' with SO(3)-equivariant registration and capsule networks. The scope notes clarify boundaries: the paper's reference-free approach explicitly excludes calibrated stereo methods and differs from point cloud processing tasks. Nearby work on 'Geometry-Aware Attention and Positional Encoding' incorporates 3D structure into attention mechanisms, representing a complementary direction that could intersect with permutation-equivariant architectures. The taxonomy reveals a field balancing theoretical equivariance foundations with practical multi-view reconstruction challenges.

Among 27 candidates examined across three contributions, the 'π³ permutation-equivariant architecture' contribution shows one refutable candidate from nine examined, indicating some prior work in permutation-equivariant designs for visual geometry. The 'fixed reference view bias identification' and 'state-of-the-art performance' contributions found zero refutable candidates among nine each, suggesting these aspects may be more novel or less directly addressed in the limited search scope. The statistics indicate a focused literature search rather than exhaustive coverage, with the single refutable match likely representing closely related architectural work within the same taxonomy leaf rather than definitive prior art.

Based on the limited search of 27 candidates, the work appears to offer meaningful contributions in eliminating reference frame dependencies, though the permutation-equivariant architecture concept has some precedent in the examined literature. The taxonomy positioning in a four-paper leaf suggests a maturing but not overcrowded research direction. The analysis covers top semantic matches and does not claim exhaustive field coverage, leaving open the possibility of additional related work beyond the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Permutation-equivariant visual geometry reconstruction without reference frames. This field addresses the challenge of reconstructing 3D geometry from multiple views while respecting the inherent symmetries—particularly permutation invariance—of the input data, without relying on a fixed coordinate system or reference frame. The taxonomy reveals several interconnected branches: the central branch of Permutation-Equivariant Visual Geometry Reconstruction focuses on methods that explicitly encode permutation symmetries in multi-view settings, often building on structure-from-motion principles adapted to deep learning (e.g., Deep Permutation SfM[9]). Adjacent branches explore Equivariant Representations for 3D Point Clouds, which develop group-theoretic architectures for point-set data (Quaternion Equivariant Capsules[11], Canonical Capsules[10]), and Geometry-Aware Attention mechanisms that incorporate spatial relationships into transformer-like models (Geometry Aware Attention[4]). Theoretical Foundations provide the mathematical underpinnings for equivariant learning, while Structured Reconstruction with Generative Models (PolyDiffuse[6]) and Visual Permutation Learning (Visual Permutation Learning[15], DeepPermNet[13]) address related symmetry-aware tasks in generation and ordering. Recent work has intensified around scalable and reference-free reconstruction methods. Scalable Permutation Equivariant[1] and Permutation Equivariant Geometry[2] push toward handling larger view sets efficiently, while Pi Three[12] explores related permutation-invariant formulations. The original paper, Pi Cubed[0], sits squarely within the Reference-Free Multi-View Reconstruction cluster, emphasizing permutation equivariance without anchor frames—a contrast to earlier correspondence-based approaches like Correspondence Free Registration[5]. Compared to Scalable Permutation Equivariant[1], which prioritizes computational efficiency, Pi Cubed[0] appears to focus more directly on the theoretical and architectural implications of full permutation symmetry. Open questions remain around balancing expressiveness with scalability, integrating geometric priors from attention mechanisms, and extending these ideas to dynamic or non-rigid scenes.

Claimed Contributions

Identification of fixed reference view bias in visual geometry reconstruction

9 retrieved papers

The authors systematically identify and demonstrate that the common practice of anchoring reconstructions to a fixed reference view introduces an unnecessary inductive bias that limits model robustness and performance in visual geometry reconstruction tasks.

9 retrieved papers

π³ permutation-equivariant architecture

Can Refute

9 retrieved papers

The authors introduce π³, a fully permutation-equivariant neural network architecture that predicts affine-invariant camera poses and scale-invariant local pointmaps without requiring any reference frames, making it inherently robust to input ordering.

9 retrieved papers

Can Refute

State-of-the-art performance across multiple benchmarks

9 retrieved papers

Through extensive experiments, the authors demonstrate that π³ achieves state-of-the-art performance across multiple tasks including camera pose estimation, monocular and video depth estimation, and dense pointmap reconstruction, outperforming prior leading methods.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] : Scalable Permutation-Equivariant Visual Geometry Learning PDF

Y Wang, J Zhou, H Zhu, W Chang, Y Zhou, Z Li (2025)

[2] : Permutation-Equivariant Visual Geometry Learning PDF

Y Wang, J Zhou, H Zhu, W Chang, Y Zhou, Z Li (2025)

[12] $Ï^3$ : Permutation-Equivariant Visual Geometry Learning PDF

Wang Yifan, Zhou Jianjun, Zhu, Haoyi, Zhou Yang, Li Zizun, Chen Junyi, Pang, Jiangmiao, Shen, Chunhua, He Tong (2025) • arXiv (Cornell University)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of fixed reference view bias in visual geometry reconstruction

[1] : Scalable Permutation-Equivariant Visual Geometry Learning PDF

Cannot Refute

[23] Recon3D: High Quality 3D Reconstruction from a Single Image Using Generated Back-View Explicit Priors PDF

Cannot Refute

[24] Edit360: 2d image edits to 3d assets from any angle PDF

Cannot Refute

[25] Spatial-Aware Anchor Growth for 3D Gaussian Field Reconstruction PDF

Cannot Refute

[26] Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction PDF

Cannot Refute

[27] Shape anchors for data-driven multi-view reconstruction PDF

Cannot Refute

[28] Video-Based Camera Localization Using Anchor View Detection and Recursive 3D Reconstruction PDF

Cannot Refute

[29] Spatial performance with perspective displays as a function of computer graphics eyepoint elevation and geometric field of view PDF

Cannot Refute

[30] Panoramic 3D Reconstruction of an Indoor Scene Using A Multi-view PDF

Cannot Refute

Contribution

π³ permutation-equivariant architecture

[9] Deep permutation equivariant structure from motion PDF

Can Refute

[1] : Scalable Permutation-Equivariant Visual Geometry Learning PDF

Cannot Refute

[11] Quaternion Equivariant Capsule Networks for 3D Point Clouds PDF

Cannot Refute

[17] Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation PDF

Cannot Refute

[18] Robust 3D Point Cloud Registration Based on Deep Learning and Optimization PDF

Cannot Refute

[19] Lie group decompositions for equivariant neural networks PDF

Cannot Refute

[20] EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation PDF

Cannot Refute

[21] E2PN: Efficient SE (3)-equivariant point network PDF

Cannot Refute

[22] Fourier-Based Equivariant Graph Neural Networks for Camera Pose Estimation PDF

Cannot Refute

Contribution

State-of-the-art performance across multiple benchmarks

[31] Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints PDF

Cannot Refute

[32] Putting people in their place: Monocular regression of 3d people in depth PDF

Cannot Refute

[33] Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves PDF

Cannot Refute

[35] DeepFusion: Real-time dense 3D reconstruction for monocular SLAM using single-view depth and gradient predictions PDF

Cannot Refute

[36] Mono3r: Exploiting monocular cues for geometric 3d reconstruction PDF

Cannot Refute

[37] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

Cannot Refute

[38] Robust 3D Human Avatar Reconstruction From Monocular Videos Using Depth Optimization and Camera Pose Estimation PDF

Cannot Refute

[39] Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering PDF

Cannot Refute

[40] Decotr: Enhancing depth completion with 2d and 3d attentions PDF

Cannot Refute

π3\pi^3π3: Permutation-Equivariant Visual Geometry Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] : Scalable Permutation-Equivariant Visual Geometry Learning PDF

[2] : Permutation-Equivariant Visual Geometry Learning PDF

[12] I¨3Ï^3I¨3: Permutation-Equivariant Visual Geometry Learning PDF

Contribution Analysis

Identification of fixed reference view bias in visual geometry reconstruction

[1] : Scalable Permutation-Equivariant Visual Geometry Learning PDF

[23] Recon3D: High Quality 3D Reconstruction from a Single Image Using Generated Back-View Explicit Priors PDF

[24] Edit360: 2d image edits to 3d assets from any angle PDF

[25] Spatial-Aware Anchor Growth for 3D Gaussian Field Reconstruction PDF

[26] Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction PDF

[27] Shape anchors for data-driven multi-view reconstruction PDF

[28] Video-Based Camera Localization Using Anchor View Detection and Recursive 3D Reconstruction PDF

[29] Spatial performance with perspective displays as a function of computer graphics eyepoint elevation and geometric field of view PDF

[30] Panoramic 3D Reconstruction of an Indoor Scene Using A Multi-view PDF

π³ permutation-equivariant architecture

[9] Deep permutation equivariant structure from motion PDF

[1] : Scalable Permutation-Equivariant Visual Geometry Learning PDF

[11] Quaternion Equivariant Capsule Networks for 3D Point Clouds PDF

[17] Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation PDF

[18] Robust 3D Point Cloud Registration Based on Deep Learning and Optimization PDF

[19] Lie group decompositions for equivariant neural networks PDF

[20] EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation PDF

[21] E2PN: Efficient SE (3)-equivariant point network PDF

[22] Fourier-Based Equivariant Graph Neural Networks for Camera Pose Estimation PDF

State-of-the-art performance across multiple benchmarks

[31] Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints PDF

[32] Putting people in their place: Monocular regression of 3d people in depth PDF

[33] Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves PDF

[35] DeepFusion: Real-time dense 3D reconstruction for monocular SLAM using single-view depth and gradient predictions PDF

[36] Mono3r: Exploiting monocular cues for geometric 3d reconstruction PDF

[37] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

[38] Robust 3D Human Avatar Reconstruction From Monocular Videos Using Depth Optimization and Camera Pose Estimation PDF

[39] Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering PDF

[40] Decotr: Enhancing depth completion with 2d and 3d attentions PDF

Table of Contents

$\pi^3$ : Permutation-Equivariant Visual Geometry Learning

[12] $Ï^3$ : Permutation-Equivariant Visual Geometry Learning PDF