Generative Universal Verifier as Multimodal Meta-Reasoner

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 8.0 Download Report PDF

Multimodal Large Language Models

We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning $16$ categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+ $8.3$ ). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+ $3.7$ ), and GenEval++(+ $4.3$ ), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual outcome verification in multimodal reasoning. The field has organized itself around several complementary directions that address how models can produce, evaluate, and improve reasoning over visual and textual inputs. Chain-of-Thought Reasoning Enhancement explores structured prompting and intermediate step generation (e.g., Llava CoT[1], Zebra CoT[3]), while Reinforcement Learning-Based Reasoning applies policy optimization and reward signals to refine reasoning trajectories. Verification Mechanisms form a central pillar, encompassing approaches that explicitly check the correctness or consistency of generated outputs, including universal verifiers that operate across diverse reasoning tasks. Tool-Integrated Reasoning investigates how models can leverage external modules—such as code interpreters or symbolic solvers—to ground their predictions, and Benchmark Development provides standardized testbeds (e.g., Verify Benchmark[4], MAIA Benchmark[7]) for measuring progress. Analysis and Robustness Studies examine failure modes and biases, Survey and Theoretical Foundations synthesize emerging principles, and Architectural Innovations propose novel model designs to better fuse vision and language. Within this landscape, a particularly active line of work focuses on building verifiers that can assess reasoning quality without task-specific training, contrasting with methods that rely on heavy supervision or domain-tailored reward models. Generative Universal Verifier[0] sits squarely in this Universal Visual Verification cluster, emphasizing a flexible verification strategy that generalizes across problem types and modalities. This approach differs from works like Model Deliberation Safety[5], which targets safety-oriented verification in high-stakes settings, and from MM Verify[25], which may incorporate more specialized checks for particular reasoning patterns. The trade-off centers on breadth versus depth: universal verifiers aim for wide applicability but must balance that generality against the precision achievable by narrower, task-tuned methods. Open questions remain about how to scale verification signals efficiently and how to integrate them into iterative reasoning loops without prohibitive computational cost.

Claimed Contributions

ViVerBench: comprehensive benchmark for visual verification

10 retrieved papers

The authors construct ViVerBench, a benchmark with 3,594 manually annotated questions across 16 subtasks in 6 categories to systematically evaluate multimodal models' ability to verify visual outcomes. The benchmark reveals substantial gaps between current VLMs and human-level visual verification capability.

10 retrieved papers

OmniVerifier-7B: first omni-capable generative verifier

3 retrieved papers

The authors develop two automated data construction pipelines and train OmniVerifier-7B, achieving notable gains on ViVerBench (+8.3). They identify three atomic capabilities in visual verification (explicit alignment, relational verification, and integrative reasoning) and demonstrate their generalization and synergistic interaction.

3 retrieved papers

OmniVerifier-TTS: sequential test-time scaling paradigm

Can Refute

10 retrieved papers

The authors propose OmniVerifier-TTS, a sequential test-time scaling method that uses the universal verifier to iteratively refine generated images through verification and editing. This approach achieves improvements on T2I-ReasonBench (+3.7) and GenEval++ (+4.3), outperforming parallel test-time scaling methods like Best-of-N.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ViVerBench: comprehensive benchmark for visual verification

[8] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF

Cannot Refute

[13] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

Cannot Refute

[30] MM-CoT: A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models PDF

Cannot Refute

[51] Visualtrans: A benchmark for real-world visual transformation reasoning PDF

Cannot Refute

[52] Fakebench: Probing explainable fake image detection via large multimodal models PDF

Cannot Refute

[53] Benchlmm: Benchmarking cross-style visual capability of large multimodal models PDF

Cannot Refute

[54] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

Cannot Refute

[55] Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning PDF

Cannot Refute

[56] Grounded Reinforcement Learning for Visual Reasoning PDF

Cannot Refute

[57] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF

Cannot Refute

Contribution

OmniVerifier-7B: first omni-capable generative verifier

[58] Generative hierarchical features from synthesizing images PDF

Cannot Refute

[59] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs PDF

Cannot Refute

[60] MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs PDF

Cannot Refute

Contribution

OmniVerifier-TTS: sequential test-time scaling paradigm

[63] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step PDF

Can Refute

[65] From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning PDF

Can Refute

[68] Let's Verify and Reinforce Image Generation Step by Step PDF

Can Refute

[61] Inference-time scaling for diffusion models beyond scaling denoising steps PDF

Cannot Refute

[62] Video-t1: Test-time scaling for video generation PDF

Cannot Refute

[64] Scalingnoise: Scaling inference-time search for generating infinite videos PDF

Cannot Refute

[66] Sdedit: Guided image synthesis and editing with stochastic differential equations PDF

Cannot Refute

[67] Revise: Learning to refine at test-time via intrinsic self-verification PDF

Cannot Refute

[69] Scaling Inference Time Compute for Diffusion Models PDF

Cannot Refute

[70] Generation as search operator for test-time scaling of diffusion-based combinatorial optimization PDF

Cannot Refute

Generative Universal Verifier as Multimodal Meta-Reasoner

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

ViVerBench: comprehensive benchmark for visual verification

[8] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF

[13] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

[30] MM-CoT: A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models PDF

[51] Visualtrans: A benchmark for real-world visual transformation reasoning PDF

[52] Fakebench: Probing explainable fake image detection via large multimodal models PDF

[53] Benchlmm: Benchmarking cross-style visual capability of large multimodal models PDF

[54] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

[55] Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning PDF

[56] Grounded Reinforcement Learning for Visual Reasoning PDF

[57] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF

OmniVerifier-7B: first omni-capable generative verifier

[58] Generative hierarchical features from synthesizing images PDF

[59] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs PDF

[60] MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs PDF

OmniVerifier-TTS: sequential test-time scaling paradigm

[63] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step PDF

[65] From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning PDF

[68] Let's Verify and Reinforce Image Generation Step by Step PDF

[61] Inference-time scaling for diffusion models beyond scaling denoising steps PDF

[62] Video-t1: Test-time scaling for video generation PDF

[64] Scalingnoise: Scaling inference-time search for generating infinite videos PDF

[66] Sdedit: Guided image synthesis and editing with stochastic differential equations PDF

[67] Revise: Learning to refine at test-time via intrinsic self-verification PDF

[69] Scaling Inference Time Compute for Diffusion Models PDF

[70] Generation as search operator for test-time scaling of diffusion-based combinatorial optimization PDF

Table of Contents