Generative Universal Verifier as Multimodal Meta-Reasoner

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language Models
Abstract:

We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 1616 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.38.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.73.7), and GenEval++(+4.34.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: visual outcome verification in multimodal reasoning. The field has organized itself around several complementary directions that address how models can produce, evaluate, and improve reasoning over visual and textual inputs. Chain-of-Thought Reasoning Enhancement explores structured prompting and intermediate step generation (e.g., Llava CoT[1], Zebra CoT[3]), while Reinforcement Learning-Based Reasoning applies policy optimization and reward signals to refine reasoning trajectories. Verification Mechanisms form a central pillar, encompassing approaches that explicitly check the correctness or consistency of generated outputs, including universal verifiers that operate across diverse reasoning tasks. Tool-Integrated Reasoning investigates how models can leverage external modules—such as code interpreters or symbolic solvers—to ground their predictions, and Benchmark Development provides standardized testbeds (e.g., Verify Benchmark[4], MAIA Benchmark[7]) for measuring progress. Analysis and Robustness Studies examine failure modes and biases, Survey and Theoretical Foundations synthesize emerging principles, and Architectural Innovations propose novel model designs to better fuse vision and language. Within this landscape, a particularly active line of work focuses on building verifiers that can assess reasoning quality without task-specific training, contrasting with methods that rely on heavy supervision or domain-tailored reward models. Generative Universal Verifier[0] sits squarely in this Universal Visual Verification cluster, emphasizing a flexible verification strategy that generalizes across problem types and modalities. This approach differs from works like Model Deliberation Safety[5], which targets safety-oriented verification in high-stakes settings, and from MM Verify[25], which may incorporate more specialized checks for particular reasoning patterns. The trade-off centers on breadth versus depth: universal verifiers aim for wide applicability but must balance that generality against the precision achievable by narrower, task-tuned methods. Open questions remain about how to scale verification signals efficiently and how to integrate them into iterative reasoning loops without prohibitive computational cost.

Claimed Contributions

ViVerBench: comprehensive benchmark for visual verification

The authors construct ViVerBench, a benchmark with 3,594 manually annotated questions across 16 subtasks in 6 categories to systematically evaluate multimodal models' ability to verify visual outcomes. The benchmark reveals substantial gaps between current VLMs and human-level visual verification capability.

10 retrieved papers
OmniVerifier-7B: first omni-capable generative verifier

The authors develop two automated data construction pipelines and train OmniVerifier-7B, achieving notable gains on ViVerBench (+8.3). They identify three atomic capabilities in visual verification (explicit alignment, relational verification, and integrative reasoning) and demonstrate their generalization and synergistic interaction.

3 retrieved papers
OmniVerifier-TTS: sequential test-time scaling paradigm

The authors propose OmniVerifier-TTS, a sequential test-time scaling method that uses the universal verifier to iteratively refine generated images through verification and editing. This approach achieves improvements on T2I-ReasonBench (+3.7) and GenEval++ (+4.3), outperforming parallel test-time scaling methods like Best-of-N.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ViVerBench: comprehensive benchmark for visual verification

The authors construct ViVerBench, a benchmark with 3,594 manually annotated questions across 16 subtasks in 6 categories to systematically evaluate multimodal models' ability to verify visual outcomes. The benchmark reveals substantial gaps between current VLMs and human-level visual verification capability.

Contribution

OmniVerifier-7B: first omni-capable generative verifier

The authors develop two automated data construction pipelines and train OmniVerifier-7B, achieving notable gains on ViVerBench (+8.3). They identify three atomic capabilities in visual verification (explicit alignment, relational verification, and integrative reasoning) and demonstrate their generalization and synergistic interaction.

Contribution

OmniVerifier-TTS: sequential test-time scaling paradigm

The authors propose OmniVerifier-TTS, a sequential test-time scaling method that uses the universal verifier to iteratively refine generated images through verification and editing. This approach achieves improvements on T2I-ReasonBench (+3.7) and GenEval++ (+4.3), outperforming parallel test-time scaling methods like Best-of-N.