Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 8.0 Download Report PDF

Reinforcement LearningReinforcement Learning from offline data

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: i) experience rehearsal and ii) execution guidance. With these modifications, the non-curated offline data substantially improves RL’s sample efficiency. Under limited sample budgets, our method achieves a 102.8% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes leveraging non-curated, reward-free, multi-embodiment offline data to improve online RL sample efficiency through world model pre-training with experience rehearsal and execution guidance. It resides in the 'World Model Pre-Training and Fine-Tuning' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers across 27 leaf nodes. This leaf focuses specifically on pre-training generalist world models from diverse offline data and adapting them for downstream tasks, distinguishing it from model-free offline-to-online methods and skill-based approaches.

The taxonomy reveals neighboring work in adjacent branches: 'Personalized and Heterogeneous Agent Methods' addresses agent-specific simulators rather than generalist models, while 'Offline-to-Online Reinforcement Learning' encompasses policy constraint methods and generative approaches that do not rely on world models. The sibling papers in this leaf—Generalist World Model and Guiding Generalist Models—share the core paradigm of pre-training large-scale dynamics models, but the taxonomy's scope notes clarify that methods without world models or those focused on skill extraction belong elsewhere. This positioning suggests the paper operates in a conceptually distinct but sparsely populated niche.

Among 21 candidates examined, two contributions show potential overlap. Contribution A (realistic setting for non-curated data) examined 6 candidates with 1 refutable match, while Contribution B (experience rehearsal and execution guidance) examined 5 candidates with 1 refutable match. Contribution C (the full NCRL method) examined 10 candidates with no refutations found. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted rather than exhaustive literature review. The two refutable pairs suggest some prior exploration of related problem settings or techniques, though the majority of examined work does not directly overlap.

Given the sparse leaf occupancy and the limited search scope, the work appears to occupy a relatively underexplored intersection of world model pre-training and non-curated data utilization. The analysis covers semantic neighbors and citations but does not claim comprehensive field coverage. The two refutable contributions indicate partial precedent, while the core integrated method shows no direct refutation among the candidates examined, suggesting meaningful novelty within the constraints of this targeted literature search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: leveraging non-curated offline data for sample-efficient reinforcement learning. The field addresses how agents can exploit large, unstructured datasets—often collected without task-specific intent—to accelerate learning when online interaction is costly or risky. The taxonomy reveals a rich landscape spanning multiple methodological branches. Offline-to-Online Reinforcement Learning focuses on warm-starting policies with offline data before fine-tuning online, while Model-Based Methods with Non-Curated Data emphasize learning world models or dynamics from diverse experiences and then planning or adapting policies within those models. Skill and Primitive Discovery extracts reusable behaviors from unstructured trajectories, and Learning from Unstructured Play Data tackles the challenge of goal-agnostic exploration logs. Imitation Learning from Diverse Data and Purely Offline Reinforcement Learning represent contrasting paradigms—one leveraging demonstrations of varying quality, the other operating entirely without environment interaction. Cross-Domain and Transfer Learning, Specialized Offline RL Settings, Policy Evaluation and Selection, Meta-Learning and Multi-Task Offline RL, and Theoretical Foundations and Benchmarking round out the taxonomy, highlighting concerns around generalization, safety constraints, policy choice under uncertainty, and rigorous evaluation. Within Model-Based Methods, a particularly active line centers on pre-training generalist world models from broad offline corpora and then fine-tuning or guiding them for specific tasks. Guiding World Models[0] exemplifies this approach by steering a pre-trained dynamics model toward task-relevant rollouts, closely aligning with Generalist World Model[1] and Guiding Generalist Models[7], which similarly emphasize adapting large-scale learned simulators. These works contrast with earlier efforts like Opal[2], which focused on narrower model-based offline RL without the generalist pre-training paradigm, and with diffusion-based guidance methods such as Energy Guided Diffusion[3] and Score Based Diffusion Policies[4], which apply similar steering ideas to policy rather than world-model space. The central trade-off across these branches involves balancing the richness and coverage of non-curated data against the risk of distribution shift and the computational cost of large model training, with Guiding World Models[0] positioned among methods that exploit pre-trained generalist representations to achieve sample efficiency in downstream tasks.

Claimed Contributions

Realistic setting for leveraging non-curated offline data

Can Refute

5 retrieved papers

The authors introduce a problem setting where offline data is reward-free, of mixed quality, and collected across multiple embodiments, expanding beyond prior work that assumes curated, reward-labeled, or expert-only data.

5 retrieved papers

Can Refute

Experience rehearsal and execution guidance techniques

Can Refute

5 retrieved papers

The authors develop experience rehearsal (retrieving task-relevant trajectories from offline data to reduce distributional shift) and execution guidance (using a prior policy trained on retrieved data to steer exploration) to address the failure of naive world model fine-tuning.

5 retrieved papers

Can Refute

NCRL method leveraging non-curated data in both stages

10 retrieved papers

The authors propose NCRL, a two-stage approach that pre-trains a task-agnostic world model on non-curated data and then fine-tunes it using the proposed techniques, demonstrating substantial improvements in sample efficiency across 72 visuomotor tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

Y Zhao, A Scannell, Y Hou, T Cui, L Chen (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Realistic setting for leveraging non-curated offline data

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

Can Refute

[54] Peac: Unsupervised pre-training for cross-embodiment reinforcement learning PDF

Cannot Refute

[55] Meta-Controller: Few-Shot Imitation of Unseen Embodiments and Tasks in Continuous Control PDF

Cannot Refute

[56] Offline Pre-trained Multi-agent Decision Transformer PDF

Cannot Refute

[57] Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks PDF

Cannot Refute

Contribution

Experience rehearsal and execution guidance techniques

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

Can Refute

[50] Optimal Volt/Var Control for Unbalanced Distribution Networks With Human-in-the-Loop Deep Reinforcement Learning PDF

Cannot Refute

[51] Enhancing Sample Efficiency in Online Reinforcement Learning via Policy-Guided Diffusion Models PDF

Cannot Refute

[52] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning PDF

Cannot Refute

[53] Efficient Learning of Goal-Oriented Push-Grasping Synergy in Clutter PDF

Cannot Refute

Contribution

NCRL method leveraging non-curated data in both stages

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

Cannot Refute

[58] Bridging the sim2real gap: Vision encoder pre-training for visuomotor policy transfer PDF

Cannot Refute

[59] Visual foresight: Model-based deep reinforcement learning for vision-based robotic control PDF

Cannot Refute

[60] Dino-wm: World models on pre-trained visual features enable zero-shot planning PDF

Cannot Refute

[61] Learning View-invariant World Models for Visual Robotic Manipulation PDF

Cannot Refute

[62] Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator PDF

Cannot Refute

[63] Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining PDF

Cannot Refute

[64] Universal visual decomposer: Long-horizon manipulation made easy PDF

Cannot Refute

[65] SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets PDF

Cannot Refute

[66] Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach PDF

Cannot Refute

Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

Contribution Analysis

Realistic setting for leveraging non-curated offline data

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

[54] Peac: Unsupervised pre-training for cross-embodiment reinforcement learning PDF

[55] Meta-Controller: Few-Shot Imitation of Unseen Embodiments and Tasks in Continuous Control PDF

[56] Offline Pre-trained Multi-agent Decision Transformer PDF

[57] Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks PDF

Experience rehearsal and execution guidance techniques

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

[50] Optimal Volt/Var Control for Unbalanced Distribution Networks With Human-in-the-Loop Deep Reinforcement Learning PDF

[51] Enhancing Sample Efficiency in Online Reinforcement Learning via Policy-Guided Diffusion Models PDF

[52] Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning PDF

[53] Efficient Learning of Goal-Oriented Push-Grasping Synergy in Clutter PDF

NCRL method leveraging non-curated data in both stages

[1] Generalist World Model Pre-Training for Efficient Reinforcement Learning PDF

[58] Bridging the sim2real gap: Vision encoder pre-training for visuomotor policy transfer PDF

[59] Visual foresight: Model-based deep reinforcement learning for vision-based robotic control PDF

[60] Dino-wm: World models on pre-trained visual features enable zero-shot planning PDF

[61] Learning View-invariant World Models for Visual Robotic Manipulation PDF

[62] Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator PDF

[63] Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining PDF

[64] Universal visual decomposer: Long-horizon manipulation made easy PDF

[65] SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets PDF

[66] Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach PDF

Table of Contents