Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 8.0 Download Report PDF

benchmarkagentsrlvrmulti-agent systemsreasoninglarge language models

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the “sim2real” gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Gaia2, a benchmark for evaluating LLM agents in asynchronous, dynamic environments, alongside the ARE (Agents Research Environments) framework and an action-level verifier. Within the taxonomy, it resides in the 'Asynchronous and Dynamic Environment Benchmarks' leaf under 'Benchmark Design and Evaluation Methodologies'. This leaf contains only two papers total, including Gaia2 itself, indicating a relatively sparse research direction. The sibling paper (MOASEI Competition) shares the focus on dynamic multi-agent scenarios but emphasizes competitive coordination rather than asynchronous temporal constraints and RL-ready verification.

The taxonomy reveals that neighboring leaves address complementary concerns: 'Domain-Specific Agent Evaluation' (healthcare, travel planning) and 'Task Decomposition and Tool Integration Evaluation' focus on specialized or multi-step reasoning without emphasizing temporal dynamics. Meanwhile, the 'Time-Sensitive and Rapidly Changing Environments' leaf under 'Dynamic Environment Adaptation' explores real-time decision-making but lacks the benchmark-centric evaluation infrastructure that Gaia2 provides. The scope note for Gaia2's leaf explicitly excludes static task benchmarks and domain-specific evaluations, positioning it at the intersection of temporal realism and general-purpose assessment.

Among 26 candidates examined across three contributions, no refutable prior work was identified. For the ARE framework (10 candidates examined, 0 refutable), the Gaia2 benchmark (10 candidates, 0 refutable), and the ARE Verifier (6 candidates, 0 refutable), the analysis found no overlapping systems that combine asynchronous environment simulation, action-level verification, and RL-ready reward signals. This suggests that within the limited search scope—focused on top-K semantic matches and citation expansion—the specific combination of features appears novel, though the search does not claim exhaustive coverage of all agent benchmarking literature.

Given the sparse population of the 'Asynchronous and Dynamic Environment Benchmarks' leaf and the absence of refutable candidates among 26 examined papers, Gaia2 appears to occupy a relatively underexplored niche. However, the limited search scope means that closely related work outside the top-26 semantic matches may exist. The analysis captures the paper's positioning within a structured taxonomy and its immediate neighborhood, but does not constitute a comprehensive survey of all agent evaluation frameworks or asynchronous simulation platforms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating language model agents in asynchronous dynamic environments. The field has organized itself around several complementary perspectives. At the highest level, researchers distinguish between architectural innovations—such as asynchronous and parallel agent designs (e.g., Autogen[2], Async Planner[9])—and methodological concerns around benchmark design and evaluation (e.g., MOASEI Competition[41], Gaia2[0]). A third major branch focuses on dynamic environment adaptation and real-time decision-making, where agents must respond to shifting conditions (e.g., Dynamic Strategy Adaptation[22], Rapid-Reflex Agent[7]). Meanwhile, domain-specific applications (Clinical LLM Agents[1], ProtAgents[4]) and general-purpose frameworks (AgentScope[31], APPL[40]) reflect the tension between specialized performance and broad reusability. Additional branches address sequential planning enhancements, multimodal context integration, and specialized technical contributions, collectively mapping out a landscape that balances foundational infrastructure with task-driven innovation. Within this ecosystem, a particularly active line of work centers on creating benchmarks that capture the unpredictability and temporal complexity of real-world settings. Gaia2[0] exemplifies this direction by emphasizing asynchronous dynamics and rigorous evaluation protocols, positioning itself alongside efforts like the MOASEI Competition[41] that stress multi-agent coordination under time pressure. In contrast, works such as CAT[3] and TP-RAG[5] prioritize context-aware reasoning and retrieval mechanisms, trading off some environmental realism for deeper semantic understanding. The interplay between these themes—whether to foreground temporal fidelity or cognitive depth—remains an open question, with Gaia2[0] leaning toward the former by foregrounding asynchronous event handling and dynamic task arrival. This choice situates it closer to benchmark-centric studies that probe agent robustness in fluid scenarios, rather than purely architectural or domain-specific explorations.

Claimed Contributions

ARE (Agents Research Environments) framework

10 retrieved papers

The authors introduce ARE, a research platform providing abstractions (apps, events, notifications, scenarios) for creating simulated, asynchronous environments that evolve independently of agent actions. This framework enables reproducible benchmarking and supports reinforcement learning from verifiable rewards (RLVR).

10 retrieved papers

Gaia2 benchmark

10 retrieved papers

The authors present Gaia2, a benchmark consisting of 1,120 human-annotated scenarios in a smartphone-like environment. It evaluates agents on capabilities including temporal awareness, adaptability to dynamic events, robustness to noise, ambiguity resolution, and multi-agent collaboration, with action-level verification suitable for RLVR training.

10 retrieved papers

ARE Verifier for action-level evaluation

6 retrieved papers

The authors develop a verifier that evaluates every state-changing write action against oracle annotations, checking consistency, causality, timing, and turn-level correctness. This mechanism achieves high agreement with human annotations (0.98) and provides fine-grained credit assignment for RLVR, serving as a reusable component beyond Gaia2.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[41] Inaugural MOASEI Competition at AAMAS'2025: A Technical Report PDF

Eck, Adam, Doshi, Prashant, Soh, Leen-Kiat (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARE (Agents Research Environments) framework

[51] Robotouille: An asynchronous planning benchmark for LLM agents PDF

Cannot Refute

[67] Event-triggered model predictive control with deep reinforcement learning for autonomous driving PDF

Cannot Refute

[68] Towards spike-based machine intelligence with neuromorphic computing PDF

Cannot Refute

[69] A multisynaptic spiking neuron for simultaneously encoding spatiotemporal dynamics PDF

Cannot Refute

[70] Event-based communication in distributed Q-learning PDF

Cannot Refute

[71] Event-Triggered Reinforcement Learning Based Joint Resource Allocation for Ultra-Reliable Low-Latency V2X Communications PDF

Cannot Refute

[72] Asynchronous training of quantum reinforcement learning PDF

Cannot Refute

[73] Deep reinforcement learning for event-driven multi-agent decision processes PDF

Cannot Refute

[74] CERiL: Continuous Event-based Reinforcement Learning PDF

Cannot Refute

[75] Representation learning for event-based visuomotor policies PDF

Cannot Refute

Contribution

Gaia2 benchmark

[51] Robotouille: An asynchronous planning benchmark for LLM agents PDF

Cannot Refute

[52] Multi-Agent Coordination PDF

Cannot Refute

[53] Temporally robust multi-agent stl motion planning in continuous time PDF

Cannot Refute

[54] Asynchronous multi-agent deep reinforcement learning under partial observability PDF

Cannot Refute

[55] Vaiage: A Multi-Agent Solution to Personalized Travel Planning PDF

Cannot Refute

[56] Asynchronous actor-critic for multi-agent reinforcement learning PDF

Cannot Refute

[57] TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception PDF

Cannot Refute

[58] Asynchronous multi-agent multisorted systems PDF

Cannot Refute

[59] Dealing with interdependent activities, uncertain durations, and semantic interoperability in multi-agent plans temporal coordination. PDF

Cannot Refute

[60] Finite-Time Analysis of Asynchronous Multi-Agent TD Learning PDF

Cannot Refute

Contribution

ARE Verifier for action-level evaluation

[61] SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

Cannot Refute

[62] QEDCartographer: Automating formal verification using reward-free reinforcement learning PDF

Cannot Refute

[63] " good robot! now watch this!": Repurposing reinforcement learning for task-to-task transfer PDF

Cannot Refute

[64] Robust Deep Reinforcement Learning Using Formal Verification PDF

Cannot Refute

[65] LaViPlan : Language-Guided Visual Path Planning with RLVR PDF

Cannot Refute

[66] Leveraging Reinforcement Learning for an Efficient Windows Registry Analysis during Cyber Incident Response PDF

Cannot Refute

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[41] Inaugural MOASEI Competition at AAMAS'2025: A Technical Report PDF

Contribution Analysis

ARE (Agents Research Environments) framework

[51] Robotouille: An asynchronous planning benchmark for LLM agents PDF

[67] Event-triggered model predictive control with deep reinforcement learning for autonomous driving PDF

[68] Towards spike-based machine intelligence with neuromorphic computing PDF

[69] A multisynaptic spiking neuron for simultaneously encoding spatiotemporal dynamics PDF

[70] Event-based communication in distributed Q-learning PDF

[71] Event-Triggered Reinforcement Learning Based Joint Resource Allocation for Ultra-Reliable Low-Latency V2X Communications PDF

[72] Asynchronous training of quantum reinforcement learning PDF

[73] Deep reinforcement learning for event-driven multi-agent decision processes PDF

[74] CERiL: Continuous Event-based Reinforcement Learning PDF

[75] Representation learning for event-based visuomotor policies PDF

Gaia2 benchmark

[51] Robotouille: An asynchronous planning benchmark for LLM agents PDF

[52] Multi-Agent Coordination PDF

[53] Temporally robust multi-agent stl motion planning in continuous time PDF

[54] Asynchronous multi-agent deep reinforcement learning under partial observability PDF

[55] Vaiage: A Multi-Agent Solution to Personalized Travel Planning PDF

[56] Asynchronous actor-critic for multi-agent reinforcement learning PDF

[57] TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception PDF

[58] Asynchronous multi-agent multisorted systems PDF

[59] Dealing with interdependent activities, uncertain durations, and semantic interoperability in multi-agent plans temporal coordination. PDF

[60] Finite-Time Analysis of Asynchronous Multi-Agent TD Learning PDF

ARE Verifier for action-level evaluation

[61] SR: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning PDF

[62] QEDCartographer: Automating formal verification using reward-free reinforcement learning PDF

[63] " good robot! now watch this!": Repurposing reinforcement learning for task-to-task transfer PDF

[64] Robust Deep Reinforcement Learning Using Formal Verification PDF

[65] LaViPlan : Language-Guided Visual Path Planning with RLVR PDF

[66] Leveraging Reinforcement Learning for an Efficient Windows Registry Analysis during Cyber Incident Response PDF

Table of Contents