Scaling with Collapse: Efficient and Predictable Training of LLM Families

ICLR 2026 Conference SubmissionAnonymous Authors
Training loss curve collapseCompute-efficient LLM pre-trainingTokens-per-parameter (TPP)AdamW EMA timescaleLearning-rate schedulesScale-stable dynamics (μP)Early stopping for hyperparameter tuning
Abstract:

Effective LLM training relies on consistency, meaning that key quantities—such as final losses and optimal hyperparameters—scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can collapse onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under practical scaling recipes, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, Celerity, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper demonstrates that training loss curves collapse onto a universal trajectory when optimization hyperparameters are scaled optimally with model size and data budget. It resides in the 'Compute-Optimal Collapse Phenomena' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader 'Loss Curve Collapse and Universality' branch. The work extends recent findings on loss curve collapse by showing the phenomenon holds under practical joint scaling of width, depth, learning rate, batch size, and weight decay—a setting closer to real-world LLM training than prior studies.

The taxonomy reveals that this work sits at the intersection of multiple research threads. Its parent branch 'Loss Curve Collapse and Universality' is adjacent to 'Scaling Laws and Loss Prediction', which contains foundational power-law studies across seven papers in four sub-categories. Neighboring branches include 'Training Dynamics and Phase Transitions' (examining temporal behavior) and 'Optimization and Hyperparameter Scaling' (studying how hyperparameters adapt with scale). The paper bridges these areas by linking collapse phenomena to compute-optimal hyperparameter choices, connecting geometric curve structure to optimization efficiency in ways that prior scaling law formulations did not emphasize.

Among 29 candidates examined, the core collapse demonstration (Contribution 1) shows one refutable candidate from 9 examined, suggesting some prior work on collapse exists but coverage is limited. The Celerity LLM family (Contribution 2) examined 10 candidates with none refutable, indicating the specific model instantiation appears novel within this search scope. The early stopping method (Contribution 3) found 2 refutable candidates among 10 examined, suggesting related hyperparameter tuning approaches exist. The limited search scale means these statistics reflect top-semantic-match coverage rather than exhaustive field surveys, and the sparse taxonomy leaf suggests this research direction remains relatively unexplored.

Given the small sibling set and limited candidate pool examined, the work appears to occupy a genuinely sparse area where loss curve collapse meets compute-optimal training. The analysis covers top-30 semantic matches and does not claim exhaustive coverage of all hyperparameter tuning or scaling law literature. The taxonomy structure suggests the field is actively fragmenting into specialized sub-problems, with this paper carving out a niche at the intersection of collapse phenomena and practical scaling recipes.

Taxonomy

Core-task Taxonomy Papers
36
3
Claimed Contributions
29
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: training loss curve prediction and collapse across model scales. The field investigates how neural network training loss evolves as a function of model size, data, and compute, seeking predictable patterns that generalize across scales. The taxonomy organizes this landscape into several major branches. Scaling Laws and Loss Prediction focuses on empirical power-law relationships that forecast final performance from resource budgets, exemplified by foundational work like Scaling Laws[1] and compute-optimal studies such as Chinchilla[2]. Loss Curve Collapse and Universality examines whether training curves from different model sizes can be mapped onto a single master curve, revealing universal structure in optimization dynamics. Training Dynamics and Phase Transitions studies abrupt changes in learning behavior, while Training Instabilities and Pathologies addresses phenomena like loss spikes and divergence. Optimization and Hyperparameter Scaling explores how learning rates and batch sizes should adapt with model scale, and Theoretical Foundations seeks mechanistic explanations for observed regularities. Additional branches cover data efficiency, model compression, implicit bias effects on downstream tasks, and domain-specific applications ranging from protein language models to reinforcement learning. Recent work has intensified focus on whether loss curves truly collapse in a universal manner and what this implies for predicting large-scale behavior from small-scale proxies. Studies like Small-scale Proxies[6] and Loss-to-loss Prediction[9] explore whether cheaper pilot runs can reliably forecast expensive training outcomes, while Feature-Learning Consistency[8] investigates whether internal representations evolve similarly across scales. The original paper, Scaling with Collapse[0], sits squarely within the Loss Curve Collapse and Universality branch, specifically addressing compute-optimal collapse phenomena. It shares thematic ground with Scaling Collapse Universal[27], which also examines universal collapse properties, but Scaling with Collapse[0] emphasizes how collapse behavior manifests under compute-optimal training regimes where model size and data are jointly scaled. This contrasts with earlier scaling law studies like Scaling Laws Precision[4] that focused primarily on predictive accuracy of power laws rather than the geometric structure of curve families, highlighting an evolving interest in deeper invariances beyond simple extrapolation formulas.

Claimed Contributions

Demonstration that training loss curves collapse under optimal hyperparameter scaling

The authors show that normalized training loss curves (TLCs) collapse onto a universal trajectory across different model sizes when the AdamW timescale τ, tokens-per-parameter ratio (TPP), and learning rate schedule are properly aligned. This collapse emerges as a signature of compute-efficient training.

9 retrieved papers
Can Refute
Celerity LLM family trained with collapse-inducing hyperparameter scaling

The authors introduce Celerity, the first large-scale LLM family (300M–3.9B parameters) explicitly trained in fixed-TPP bands with optimal τ scaling to achieve training loss curve collapse. This family demonstrates compute-efficiency and provides practical validation of collapse principles at scale.

10 retrieved papers
Early stopping method for hyperparameter tuning using collapse predictions

The authors propose a functional form for normalized TLCs that can be fit on small-scale runs and used to extrapolate final loss from partial trajectories. This enables selecting optimal hyperparameters after only 10–30% of training, significantly reducing tuning compute costs.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Demonstration that training loss curves collapse under optimal hyperparameter scaling

The authors show that normalized training loss curves (TLCs) collapse onto a universal trajectory across different model sizes when the AdamW timescale τ, tokens-per-parameter ratio (TPP), and learning rate schedule are properly aligned. This collapse emerges as a signature of compute-efficient training.

Contribution

Celerity LLM family trained with collapse-inducing hyperparameter scaling

The authors introduce Celerity, the first large-scale LLM family (300M–3.9B parameters) explicitly trained in fixed-TPP bands with optimal τ scaling to achieve training loss curve collapse. This family demonstrates compute-efficiency and provides practical validation of collapse principles at scale.

Contribution

Early stopping method for hyperparameter tuning using collapse predictions

The authors propose a functional form for normalized TLCs that can be fit on small-scale runs and used to extrapolate final loss from partial trajectories. This enables selecting optimal hyperparameters after only 10–30% of training, significantly reducing tuning compute costs.