Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

ICLR 2026 Conference SubmissionAnonymous Authors
benchmarkagentsrlvrmulti-agent systemsreasoninglarge language models
Abstract:

We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the “sim2real” gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Gaia2, a benchmark for evaluating LLM agents in asynchronous, dynamic environments, alongside the ARE (Agents Research Environments) framework and an action-level verifier. Within the taxonomy, it resides in the 'Asynchronous and Dynamic Environment Benchmarks' leaf under 'Benchmark Design and Evaluation Methodologies'. This leaf contains only two papers total, including Gaia2 itself, indicating a relatively sparse research direction. The sibling paper (MOASEI Competition) shares the focus on dynamic multi-agent scenarios but emphasizes competitive coordination rather than asynchronous temporal constraints and RL-ready verification.

The taxonomy reveals that neighboring leaves address complementary concerns: 'Domain-Specific Agent Evaluation' (healthcare, travel planning) and 'Task Decomposition and Tool Integration Evaluation' focus on specialized or multi-step reasoning without emphasizing temporal dynamics. Meanwhile, the 'Time-Sensitive and Rapidly Changing Environments' leaf under 'Dynamic Environment Adaptation' explores real-time decision-making but lacks the benchmark-centric evaluation infrastructure that Gaia2 provides. The scope note for Gaia2's leaf explicitly excludes static task benchmarks and domain-specific evaluations, positioning it at the intersection of temporal realism and general-purpose assessment.

Among 26 candidates examined across three contributions, no refutable prior work was identified. For the ARE framework (10 candidates examined, 0 refutable), the Gaia2 benchmark (10 candidates, 0 refutable), and the ARE Verifier (6 candidates, 0 refutable), the analysis found no overlapping systems that combine asynchronous environment simulation, action-level verification, and RL-ready reward signals. This suggests that within the limited search scope—focused on top-K semantic matches and citation expansion—the specific combination of features appears novel, though the search does not claim exhaustive coverage of all agent benchmarking literature.

Given the sparse population of the 'Asynchronous and Dynamic Environment Benchmarks' leaf and the absence of refutable candidates among 26 examined papers, Gaia2 appears to occupy a relatively underexplored niche. However, the limited search scope means that closely related work outside the top-26 semantic matches may exist. The analysis captures the paper's positioning within a structured taxonomy and its immediate neighborhood, but does not constitute a comprehensive survey of all agent evaluation frameworks or asynchronous simulation platforms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating language model agents in asynchronous dynamic environments. The field has organized itself around several complementary perspectives. At the highest level, researchers distinguish between architectural innovations—such as asynchronous and parallel agent designs (e.g., Autogen[2], Async Planner[9])—and methodological concerns around benchmark design and evaluation (e.g., MOASEI Competition[41], Gaia2[0]). A third major branch focuses on dynamic environment adaptation and real-time decision-making, where agents must respond to shifting conditions (e.g., Dynamic Strategy Adaptation[22], Rapid-Reflex Agent[7]). Meanwhile, domain-specific applications (Clinical LLM Agents[1], ProtAgents[4]) and general-purpose frameworks (AgentScope[31], APPL[40]) reflect the tension between specialized performance and broad reusability. Additional branches address sequential planning enhancements, multimodal context integration, and specialized technical contributions, collectively mapping out a landscape that balances foundational infrastructure with task-driven innovation. Within this ecosystem, a particularly active line of work centers on creating benchmarks that capture the unpredictability and temporal complexity of real-world settings. Gaia2[0] exemplifies this direction by emphasizing asynchronous dynamics and rigorous evaluation protocols, positioning itself alongside efforts like the MOASEI Competition[41] that stress multi-agent coordination under time pressure. In contrast, works such as CAT[3] and TP-RAG[5] prioritize context-aware reasoning and retrieval mechanisms, trading off some environmental realism for deeper semantic understanding. The interplay between these themes—whether to foreground temporal fidelity or cognitive depth—remains an open question, with Gaia2[0] leaning toward the former by foregrounding asynchronous event handling and dynamic task arrival. This choice situates it closer to benchmark-centric studies that probe agent robustness in fluid scenarios, rather than purely architectural or domain-specific explorations.

Claimed Contributions

ARE (Agents Research Environments) framework

The authors introduce ARE, a research platform providing abstractions (apps, events, notifications, scenarios) for creating simulated, asynchronous environments that evolve independently of agent actions. This framework enables reproducible benchmarking and supports reinforcement learning from verifiable rewards (RLVR).

10 retrieved papers
Gaia2 benchmark

The authors present Gaia2, a benchmark consisting of 1,120 human-annotated scenarios in a smartphone-like environment. It evaluates agents on capabilities including temporal awareness, adaptability to dynamic events, robustness to noise, ambiguity resolution, and multi-agent collaboration, with action-level verification suitable for RLVR training.

10 retrieved papers
ARE Verifier for action-level evaluation

The authors develop a verifier that evaluates every state-changing write action against oracle annotations, checking consistency, causality, timing, and turn-level correctness. This mechanism achieves high agreement with human annotations (0.98) and provides fine-grained credit assignment for RLVR, serving as a reusable component beyond Gaia2.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARE (Agents Research Environments) framework

The authors introduce ARE, a research platform providing abstractions (apps, events, notifications, scenarios) for creating simulated, asynchronous environments that evolve independently of agent actions. This framework enables reproducible benchmarking and supports reinforcement learning from verifiable rewards (RLVR).

Contribution

Gaia2 benchmark

The authors present Gaia2, a benchmark consisting of 1,120 human-annotated scenarios in a smartphone-like environment. It evaluates agents on capabilities including temporal awareness, adaptability to dynamic events, robustness to noise, ambiguity resolution, and multi-agent collaboration, with action-level verification suitable for RLVR training.

Contribution

ARE Verifier for action-level evaluation

The authors develop a verifier that evaluates every state-changing write action against oracle annotations, checking consistency, causality, timing, and turn-level correctness. This mechanism achieves high agreement with human annotations (0.98) and provides fine-grained credit assignment for RLVR, serving as a reusable component beyond Gaia2.