Scaling with Collapse: Efficient and Predictable Training of LLM Families
Overview
Overall Novelty Assessment
The paper demonstrates that training loss curves collapse onto a universal trajectory when optimization hyperparameters are scaled optimally with model size and data budget. It resides in the 'Compute-Optimal Collapse Phenomena' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader 'Loss Curve Collapse and Universality' branch. The work extends recent findings on loss curve collapse by showing the phenomenon holds under practical joint scaling of width, depth, learning rate, batch size, and weight decay—a setting closer to real-world LLM training than prior studies.
The taxonomy reveals that this work sits at the intersection of multiple research threads. Its parent branch 'Loss Curve Collapse and Universality' is adjacent to 'Scaling Laws and Loss Prediction', which contains foundational power-law studies across seven papers in four sub-categories. Neighboring branches include 'Training Dynamics and Phase Transitions' (examining temporal behavior) and 'Optimization and Hyperparameter Scaling' (studying how hyperparameters adapt with scale). The paper bridges these areas by linking collapse phenomena to compute-optimal hyperparameter choices, connecting geometric curve structure to optimization efficiency in ways that prior scaling law formulations did not emphasize.
Among 29 candidates examined, the core collapse demonstration (Contribution 1) shows one refutable candidate from 9 examined, suggesting some prior work on collapse exists but coverage is limited. The Celerity LLM family (Contribution 2) examined 10 candidates with none refutable, indicating the specific model instantiation appears novel within this search scope. The early stopping method (Contribution 3) found 2 refutable candidates among 10 examined, suggesting related hyperparameter tuning approaches exist. The limited search scale means these statistics reflect top-semantic-match coverage rather than exhaustive field surveys, and the sparse taxonomy leaf suggests this research direction remains relatively unexplored.
Given the small sibling set and limited candidate pool examined, the work appears to occupy a genuinely sparse area where loss curve collapse meets compute-optimal training. The analysis covers top-30 semantic matches and does not claim exhaustive coverage of all hyperparameter tuning or scaling law literature. The taxonomy structure suggests the field is actively fragmenting into specialized sub-problems, with this paper carving out a niche at the intersection of collapse phenomena and practical scaling recipes.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors show that normalized training loss curves (TLCs) collapse onto a universal trajectory across different model sizes when the AdamW timescale τ, tokens-per-parameter ratio (TPP), and learning rate schedule are properly aligned. This collapse emerges as a signature of compute-efficient training.
The authors introduce Celerity, the first large-scale LLM family (300M–3.9B parameters) explicitly trained in fixed-TPP bands with optimal τ scaling to achieve training loss curve collapse. This family demonstrates compute-efficiency and provides practical validation of collapse principles at scale.
The authors propose a functional form for normalized TLCs that can be fit on small-scale runs and used to extrapolate final loss from partial trajectories. This enables selecting optimal hyperparameters after only 10–30% of training, significantly reducing tuning compute costs.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[27] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Demonstration that training loss curves collapse under optimal hyperparameter scaling
The authors show that normalized training loss curves (TLCs) collapse onto a universal trajectory across different model sizes when the AdamW timescale τ, tokens-per-parameter ratio (TPP), and learning rate schedule are properly aligned. This collapse emerges as a signature of compute-efficient training.
[27] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks PDF
[1] Scaling laws for neural language models PDF
[35] nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales PDF
[47] Resolving discrepancies in compute-optimal scaling of language models PDF
[48] Simplifying DINO via Coding Rate Regularization PDF
[49] Exploring molecular pretraining model at scale PDF
[50] Warmstarting for scaling language models PDF
[51] Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training PDF
[52] Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales PDF
Celerity LLM family trained with collapse-inducing hyperparameter scaling
The authors introduce Celerity, the first large-scale LLM family (300M–3.9B parameters) explicitly trained in fixed-TPP bands with optimal τ scaling to achieve training loss curve collapse. This family demonstrates compute-efficiency and provides practical validation of collapse principles at scale.
[37] Scaling data-constrained language models PDF
[38] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF
[39] Tuning large neural networks via zero-shot hyperparameter transfer PDF
[40] Minicpm: Unveiling the potential of small language models with scalable training strategies PDF
[41] Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster PDF
[42] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search PDF
[43] A system for massively parallel hyperparameter tuning PDF
[44] Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo PDF
[45] Scaling laws for generative mixed-modal language models PDF
[46] Scaling laws for differentially private language models PDF
Early stopping method for hyperparameter tuning using collapse predictions
The authors propose a functional form for normalized TLCs that can be fit on small-scale runs and used to extrapolate final loss from partial trajectories. This enables selecting optimal hyperparameters after only 10–30% of training, significantly reducing tuning compute costs.