Generative Universal Verifier as Multimodal Meta-Reasoner
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors construct ViVerBench, a benchmark with 3,594 manually annotated questions across 16 subtasks in 6 categories to systematically evaluate multimodal models' ability to verify visual outcomes. The benchmark reveals substantial gaps between current VLMs and human-level visual verification capability.
The authors develop two automated data construction pipelines and train OmniVerifier-7B, achieving notable gains on ViVerBench (+8.3). They identify three atomic capabilities in visual verification (explicit alignment, relational verification, and integrative reasoning) and demonstrate their generalization and synergistic interaction.
The authors propose OmniVerifier-TTS, a sequential test-time scaling method that uses the universal verifier to iteratively refine generated images through verification and editing. This approach achieves improvements on T2I-ReasonBench (+3.7) and GenEval++ (+4.3), outperforming parallel test-time scaling methods like Best-of-N.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
ViVerBench: comprehensive benchmark for visual verification
The authors construct ViVerBench, a benchmark with 3,594 manually annotated questions across 16 subtasks in 6 categories to systematically evaluate multimodal models' ability to verify visual outcomes. The benchmark reveals substantial gaps between current VLMs and human-level visual verification capability.
[8] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF
[13] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF
[30] MM-CoT: A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models PDF
[51] Visualtrans: A benchmark for real-world visual transformation reasoning PDF
[52] Fakebench: Probing explainable fake image detection via large multimodal models PDF
[53] Benchlmm: Benchmarking cross-style visual capability of large multimodal models PDF
[54] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF
[55] Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning PDF
[56] Grounded Reinforcement Learning for Visual Reasoning PDF
[57] Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning PDF
OmniVerifier-7B: first omni-capable generative verifier
The authors develop two automated data construction pipelines and train OmniVerifier-7B, achieving notable gains on ViVerBench (+8.3). They identify three atomic capabilities in visual verification (explicit alignment, relational verification, and integrative reasoning) and demonstrate their generalization and synergistic interaction.
[58] Generative hierarchical features from synthesizing images PDF
[59] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs PDF
[60] MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs PDF
OmniVerifier-TTS: sequential test-time scaling paradigm
The authors propose OmniVerifier-TTS, a sequential test-time scaling method that uses the universal verifier to iteratively refine generated images through verification and editing. This approach achieves improvements on T2I-ReasonBench (+3.7) and GenEval++ (+4.3), outperforming parallel test-time scaling methods like Best-of-N.