Scaling with Collapse: Efficient and Predictable Training of LLM Families

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 8.0 Download Report PDF

Training loss curve collapseCompute-efficient LLM pre-trainingTokens-per-parameter (TPP)AdamW EMA timescaleLearning-rate schedulesScale-stable dynamics (μP)Early stopping for hyperparameter tuning

Effective LLM training relies on consistency, meaning that key quantities—such as final losses and optimal hyperparameters—scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can collapse onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under practical scaling recipes, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, Celerity, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper demonstrates that training loss curves collapse onto a universal trajectory when optimization hyperparameters are scaled optimally with model size and data budget. It resides in the 'Compute-Optimal Collapse Phenomena' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader 'Loss Curve Collapse and Universality' branch. The work extends recent findings on loss curve collapse by showing the phenomenon holds under practical joint scaling of width, depth, learning rate, batch size, and weight decay—a setting closer to real-world LLM training than prior studies.

The taxonomy reveals that this work sits at the intersection of multiple research threads. Its parent branch 'Loss Curve Collapse and Universality' is adjacent to 'Scaling Laws and Loss Prediction', which contains foundational power-law studies across seven papers in four sub-categories. Neighboring branches include 'Training Dynamics and Phase Transitions' (examining temporal behavior) and 'Optimization and Hyperparameter Scaling' (studying how hyperparameters adapt with scale). The paper bridges these areas by linking collapse phenomena to compute-optimal hyperparameter choices, connecting geometric curve structure to optimization efficiency in ways that prior scaling law formulations did not emphasize.

Among 29 candidates examined, the core collapse demonstration (Contribution 1) shows one refutable candidate from 9 examined, suggesting some prior work on collapse exists but coverage is limited. The Celerity LLM family (Contribution 2) examined 10 candidates with none refutable, indicating the specific model instantiation appears novel within this search scope. The early stopping method (Contribution 3) found 2 refutable candidates among 10 examined, suggesting related hyperparameter tuning approaches exist. The limited search scale means these statistics reflect top-semantic-match coverage rather than exhaustive field surveys, and the sparse taxonomy leaf suggests this research direction remains relatively unexplored.

Given the small sibling set and limited candidate pool examined, the work appears to occupy a genuinely sparse area where loss curve collapse meets compute-optimal training. The analysis covers top-30 semantic matches and does not claim exhaustive coverage of all hyperparameter tuning or scaling law literature. The taxonomy structure suggests the field is actively fragmenting into specialized sub-problems, with this paper carving out a niche at the intersection of collapse phenomena and practical scaling recipes.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: training loss curve prediction and collapse across model scales. The field investigates how neural network training loss evolves as a function of model size, data, and compute, seeking predictable patterns that generalize across scales. The taxonomy organizes this landscape into several major branches. Scaling Laws and Loss Prediction focuses on empirical power-law relationships that forecast final performance from resource budgets, exemplified by foundational work like Scaling Laws[1] and compute-optimal studies such as Chinchilla[2]. Loss Curve Collapse and Universality examines whether training curves from different model sizes can be mapped onto a single master curve, revealing universal structure in optimization dynamics. Training Dynamics and Phase Transitions studies abrupt changes in learning behavior, while Training Instabilities and Pathologies addresses phenomena like loss spikes and divergence. Optimization and Hyperparameter Scaling explores how learning rates and batch sizes should adapt with model scale, and Theoretical Foundations seeks mechanistic explanations for observed regularities. Additional branches cover data efficiency, model compression, implicit bias effects on downstream tasks, and domain-specific applications ranging from protein language models to reinforcement learning. Recent work has intensified focus on whether loss curves truly collapse in a universal manner and what this implies for predicting large-scale behavior from small-scale proxies. Studies like Small-scale Proxies[6] and Loss-to-loss Prediction[9] explore whether cheaper pilot runs can reliably forecast expensive training outcomes, while Feature-Learning Consistency[8] investigates whether internal representations evolve similarly across scales. The original paper, Scaling with Collapse[0], sits squarely within the Loss Curve Collapse and Universality branch, specifically addressing compute-optimal collapse phenomena. It shares thematic ground with Scaling Collapse Universal[27], which also examines universal collapse properties, but Scaling with Collapse[0] emphasizes how collapse behavior manifests under compute-optimal training regimes where model size and data are jointly scaled. This contrasts with earlier scaling law studies like Scaling Laws Precision[4] that focused primarily on predictive accuracy of power laws rather than the geometric structure of curve families, highlighting an evolving interest in deeper invariances beyond simple extrapolation formulas.

Claimed Contributions

Demonstration that training loss curves collapse under optimal hyperparameter scaling

Can Refute

9 retrieved papers

The authors show that normalized training loss curves (TLCs) collapse onto a universal trajectory across different model sizes when the AdamW timescale τ, tokens-per-parameter ratio (TPP), and learning rate schedule are properly aligned. This collapse emerges as a signature of compute-efficient training.

9 retrieved papers

Can Refute

Celerity LLM family trained with collapse-inducing hyperparameter scaling

10 retrieved papers

The authors introduce Celerity, the first large-scale LLM family (300M–3.9B parameters) explicitly trained in fixed-TPP bands with optimal τ scaling to achieve training loss curve collapse. This family demonstrates compute-efficiency and provides practical validation of collapse principles at scale.

10 retrieved papers

Early stopping method for hyperparameter tuning using collapse predictions

Can Refute

10 retrieved papers

The authors propose a functional form for normalized TLCs that can be fit on small-scale runs and used to extrapolate final loss from partial trajectories. This enables selecting optimal hyperparameters after only 10–30% of training, significantly reducing tuning compute costs.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[27] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks PDF

Qiu, Shikai, Xiao, Lechao, Wilson, Andrew Gordon, Pennington, Jeffrey, Agarwala, Atish (2025) • arXiv (Cornell University)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Demonstration that training loss curves collapse under optimal hyperparameter scaling

[27] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks PDF

Can Refute

[1] Scaling laws for neural language models PDF

Cannot Refute

[35] nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales PDF

Cannot Refute

[47] Resolving discrepancies in compute-optimal scaling of language models PDF

Cannot Refute

[48] Simplifying DINO via Coding Rate Regularization PDF

Cannot Refute

[49] Exploring molecular pretraining model at scale PDF

Cannot Refute

[50] Warmstarting for scaling language models PDF

Cannot Refute

[51] Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training PDF

Cannot Refute

[52] Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales PDF

Cannot Refute

Contribution

Celerity LLM family trained with collapse-inducing hyperparameter scaling

[37] Scaling data-constrained language models PDF

Cannot Refute

[38] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF

Cannot Refute

[39] Tuning large neural networks via zero-shot hyperparameter transfer PDF

Cannot Refute

[40] Minicpm: Unveiling the potential of small language models with scalable training strategies PDF

Cannot Refute

[41] Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster PDF

Cannot Refute

[42] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search PDF

Cannot Refute

[43] A system for massively parallel hyperparameter tuning PDF

Cannot Refute

[44] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo PDF

Cannot Refute

[45] Scaling laws for generative mixed-modal language models PDF

Cannot Refute

[46] Scaling laws for differentially private language models PDF

Cannot Refute

Contribution

Early stopping method for hyperparameter tuning using collapse predictions

[53] Scaling laws for hyperparameter optimization PDF

Can Refute

[62] Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. PDF

Can Refute

[54] Improving Hyperparameter Optimization with Checkpointed Model Weights PDF

Cannot Refute

[55] Surpassing early stopping: A novel correlation-based stopping criterion for neural networks PDF

Cannot Refute

[56] Keeping deep learning models in check: A history-based approach to mitigate overfitting PDF

Cannot Refute

[57] Optimizing coronary artery disease diagnosis: a heuristic approach using robust data preprocessing and automated hyperparameter tuning of eXtreme gradient â¦ PDF

Cannot Refute

[58] Early stopping on CNN-LSTM development to improve classification performance PDF

Cannot Refute

[59] On the difficulty of DNN hyperparameter optimization using learning curve prediction PDF

Cannot Refute

[60] Neural Velocity for hyperparameter tuning PDF

Cannot Refute

[61] Learning curve prediction with Bayesian neural networks PDF

Cannot Refute

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[27] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks PDF

Contribution Analysis

Demonstration that training loss curves collapse under optimal hyperparameter scaling

[27] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks PDF

[1] Scaling laws for neural language models PDF

[35] nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales PDF

[47] Resolving discrepancies in compute-optimal scaling of language models PDF

[48] Simplifying DINO via Coding Rate Regularization PDF

[49] Exploring molecular pretraining model at scale PDF

[50] Warmstarting for scaling language models PDF

[51] Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training PDF

[52] Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales PDF

Celerity LLM family trained with collapse-inducing hyperparameter scaling

[37] Scaling data-constrained language models PDF

[38] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF

[39] Tuning large neural networks via zero-shot hyperparameter transfer PDF

[40] Minicpm: Unveiling the potential of small language models with scalable training strategies PDF

[41] Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster PDF

[42] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search PDF

[43] A system for massively parallel hyperparameter tuning PDF

[44] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo PDF

[45] Scaling laws for generative mixed-modal language models PDF

[46] Scaling laws for differentially private language models PDF

Early stopping method for hyperparameter tuning using collapse predictions

[53] Scaling laws for hyperparameter optimization PDF

[62] Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. PDF

[54] Improving Hyperparameter Optimization with Checkpointed Model Weights PDF

[55] Surpassing early stopping: A novel correlation-based stopping criterion for neural networks PDF

[56] Keeping deep learning models in check: A history-based approach to mitigate overfitting PDF

[57] Optimizing coronary artery disease diagnosis: a heuristic approach using robust data preprocessing and automated hyperparameter tuning of eXtreme gradient â¦ PDF

[58] Early stopping on CNN-LSTM development to improve classification performance PDF

[59] On the difficulty of DNN hyperparameter optimization using learning curve prediction PDF

[60] Neural Velocity for hyperparameter tuning PDF

[61] Learning curve prediction with Bayesian neural networks PDF

Table of Contents

[57] Optimizing coronary artery disease diagnosis: a heuristic approach using robust data preprocessing and automated hyperparameter tuning of eXtreme gradient â¦ PDF