Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningReinforcement Learning from offline data
Abstract:

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: i) experience rehearsal and ii) execution guidance. With these modifications, the non-curated offline data substantially improves RL’s sample efficiency. Under limited sample budgets, our method achieves a 102.8% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes leveraging non-curated, reward-free, multi-embodiment offline data to improve online RL sample efficiency through world model pre-training with experience rehearsal and execution guidance. It resides in the 'World Model Pre-Training and Fine-Tuning' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers across 27 leaf nodes. This leaf focuses specifically on pre-training generalist world models from diverse offline data and adapting them for downstream tasks, distinguishing it from model-free offline-to-online methods and skill-based approaches.

The taxonomy reveals neighboring work in adjacent branches: 'Personalized and Heterogeneous Agent Methods' addresses agent-specific simulators rather than generalist models, while 'Offline-to-Online Reinforcement Learning' encompasses policy constraint methods and generative approaches that do not rely on world models. The sibling papers in this leaf—Generalist World Model and Guiding Generalist Models—share the core paradigm of pre-training large-scale dynamics models, but the taxonomy's scope notes clarify that methods without world models or those focused on skill extraction belong elsewhere. This positioning suggests the paper operates in a conceptually distinct but sparsely populated niche.

Among 21 candidates examined, two contributions show potential overlap. Contribution A (realistic setting for non-curated data) examined 6 candidates with 1 refutable match, while Contribution B (experience rehearsal and execution guidance) examined 5 candidates with 1 refutable match. Contribution C (the full NCRL method) examined 10 candidates with no refutations found. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted rather than exhaustive literature review. The two refutable pairs suggest some prior exploration of related problem settings or techniques, though the majority of examined work does not directly overlap.

Given the sparse leaf occupancy and the limited search scope, the work appears to occupy a relatively underexplored intersection of world model pre-training and non-curated data utilization. The analysis covers semantic neighbors and citations but does not claim comprehensive field coverage. The two refutable contributions indicate partial precedent, while the core integrated method shows no direct refutation among the candidates examined, suggesting meaningful novelty within the constraints of this targeted literature search.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
20
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: leveraging non-curated offline data for sample-efficient reinforcement learning. The field addresses how agents can exploit large, unstructured datasets—often collected without task-specific intent—to accelerate learning when online interaction is costly or risky. The taxonomy reveals a rich landscape spanning multiple methodological branches. Offline-to-Online Reinforcement Learning focuses on warm-starting policies with offline data before fine-tuning online, while Model-Based Methods with Non-Curated Data emphasize learning world models or dynamics from diverse experiences and then planning or adapting policies within those models. Skill and Primitive Discovery extracts reusable behaviors from unstructured trajectories, and Learning from Unstructured Play Data tackles the challenge of goal-agnostic exploration logs. Imitation Learning from Diverse Data and Purely Offline Reinforcement Learning represent contrasting paradigms—one leveraging demonstrations of varying quality, the other operating entirely without environment interaction. Cross-Domain and Transfer Learning, Specialized Offline RL Settings, Policy Evaluation and Selection, Meta-Learning and Multi-Task Offline RL, and Theoretical Foundations and Benchmarking round out the taxonomy, highlighting concerns around generalization, safety constraints, policy choice under uncertainty, and rigorous evaluation. Within Model-Based Methods, a particularly active line centers on pre-training generalist world models from broad offline corpora and then fine-tuning or guiding them for specific tasks. Guiding World Models[0] exemplifies this approach by steering a pre-trained dynamics model toward task-relevant rollouts, closely aligning with Generalist World Model[1] and Guiding Generalist Models[7], which similarly emphasize adapting large-scale learned simulators. These works contrast with earlier efforts like Opal[2], which focused on narrower model-based offline RL without the generalist pre-training paradigm, and with diffusion-based guidance methods such as Energy Guided Diffusion[3] and Score Based Diffusion Policies[4], which apply similar steering ideas to policy rather than world-model space. The central trade-off across these branches involves balancing the richness and coverage of non-curated data against the risk of distribution shift and the computational cost of large model training, with Guiding World Models[0] positioned among methods that exploit pre-trained generalist representations to achieve sample efficiency in downstream tasks.

Claimed Contributions

Realistic setting for leveraging non-curated offline data

The authors introduce a problem setting where offline data is reward-free, of mixed quality, and collected across multiple embodiments, expanding beyond prior work that assumes curated, reward-labeled, or expert-only data.

5 retrieved papers
Can Refute
Experience rehearsal and execution guidance techniques

The authors develop experience rehearsal (retrieving task-relevant trajectories from offline data to reduce distributional shift) and execution guidance (using a prior policy trained on retrieved data to steer exploration) to address the failure of naive world model fine-tuning.

5 retrieved papers
Can Refute
NCRL method leveraging non-curated data in both stages

The authors propose NCRL, a two-stage approach that pre-trains a task-agnostic world model on non-curated data and then fine-tunes it using the proposed techniques, demonstrating substantial improvements in sample efficiency across 72 visuomotor tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Realistic setting for leveraging non-curated offline data

The authors introduce a problem setting where offline data is reward-free, of mixed quality, and collected across multiple embodiments, expanding beyond prior work that assumes curated, reward-labeled, or expert-only data.

Contribution

Experience rehearsal and execution guidance techniques

The authors develop experience rehearsal (retrieving task-relevant trajectories from offline data to reduce distributional shift) and execution guidance (using a prior policy trained on retrieved data to steer exploration) to address the failure of naive world model fine-tuning.

Contribution

NCRL method leveraging non-curated data in both stages

The authors propose NCRL, a two-stage approach that pre-trains a task-agnostic world model on non-curated data and then fine-tunes it using the proposed techniques, demonstrating substantial improvements in sample efficiency across 72 visuomotor tasks.