The Scaling Laws Era: Why Bigger Models Keep Getting Smarter

In January 2020, a team at OpenAI published a paper that would fundamentally change how the AI industry allocates billions of dollars. "Scaling Laws for Neural Language Models" by Jared Kaplan and colleagues established something that seemed almost too clean to be true: language model performance scales as power-laws with model size, dataset size, and training compute, with trends spanning more than seven orders of magnitude.

This wasn't a vague observation. It was a precise mathematical relationship. And it meant that for the first time, researchers could predict how well a model would perform before spending the compute to train it.

The Kaplan Scaling Laws

The paper's central finding was elegantly simple. Loss — the standard measure of how well a language model predicts text — decreases predictably as you increase any of three factors: the number of parameters (N), the size of the training dataset (D), or the total compute budget (C). The relationship follows a power law, meaning that a plot on log-log axes yields a straight line.

The loss scales according to L(N) = (Nc/N)^0.076, where Nc is a constant. Architectural details like network width or depth had minimal effects within wide ranges. What mattered was scale — raw, simple scale.

Perhaps the most consequential finding was about efficiency: larger models are significantly more sample-efficient. This meant that the optimal strategy for a fixed compute budget was to train a very large model on a relatively modest amount of data, stopping significantly before convergence. More parameters per training token was better than more tokens per parameter.

When training within a fixed compute budget C, the paper predicted that optimal model size, batch size, number of training steps, and dataset size should all grow proportionally with compute. This gave research labs a concrete formula for how to spend their GPU hours.

The Chinchilla Correction

For two years, the AI industry followed Kaplan's prescription: build bigger models, don't worry as much about data. GPT-3 was 175 billion parameters. Gopher was 280 billion. Megatron-Turing NLG was 530 billion. The race was on for the largest model.

Then in March 2022, DeepMind published a paper that forced a dramatic course correction.

Jordan Hoffmann and colleagues trained over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens. Their analysis revealed that contemporary large language models were significantly undertrained — they were too large for the amount of data they were trained on.

To prove their point, DeepMind trained Chinchilla: a 70 billion parameter model using the same compute budget as their own 280 billion parameter Gopher, but with 4x more training data. The result was decisive. Chinchilla uniformly and significantly outperformed Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on downstream evaluation tasks — despite being 4 to 7.5 times smaller.

On the MMLU benchmark, Chinchilla reached a state-of-the-art accuracy of 67.5%, a greater than 7% improvement over Gopher. The paper, now cited over 3,700 times, established a new rule: for compute-optimal training, models should be trained on approximately 20 tokens per parameter. Both model size and training data should scale proportionally with computational budget.

The implications rippled across the industry. Billions of dollars had been spent building models that were too large and undertrained. The optimal path wasn't just more parameters — it was the right balance between parameters and data.

Independent Verification

The significance of these scaling laws demanded rigorous verification. Epoch AI, a respected AI research organization, published a careful replication attempt of Chinchilla scaling in April 2024. Their work validated the core methodology while noting that optimal ratios can vary based on data quality and model architecture — an important nuance that acknowledged the laws as strong guidelines rather than universal constants.

Meanwhile, the Kempner Institute at Harvard developed a dynamical model of neural scaling laws, approximating test loss as a linear combination of model size and time bottleneck scalings. This theoretical framework provided mathematical foundations for understanding why scaling laws hold across diverse architectures and training regimes — suggesting something deeper than empirical coincidence.

The 2025 Supercollapse Discovery

The most remarkable finding came in July 2025. A paper titled "Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks" demonstrated something physicists had seen in phase transitions but no one expected in neural networks.

When compute-optimally trained models of varying sizes had their loss curves normalized, the curves collapsed onto a single universal function. The differences between individual curves fell below the noise floor of random seeds — a phenomenon the authors termed "supercollapse."

This means that compute-optimally trained neural networks don't just follow scaling laws — they follow the same underlying dynamics regardless of size. It's as if there's a single master curve that all models trace, just at different scales. This level of universality suggests that the scaling behavior of neural networks is governed by fundamental principles we're only beginning to understand.

What Scaling Laws Mean for Builders

For teams building AI products, scaling laws provide something invaluable: predictability. Before Kaplan and Chinchilla, training a large model was a bet — you spent millions on compute and hoped the result would be good enough. Now, teams can predict performance with reasonable accuracy before committing resources.

But scaling laws also carry a warning. They tell us that raw scale has diminishing returns — each order of magnitude of compute buys a smaller improvement. The future of AI capability may depend less on building ever-larger models and more on finding new scaling dimensions: better data quality, more efficient architectures, improved training algorithms, and novel approaches like inference-time compute scaling.

At Promethic Labs, understanding these trade-offs is central to how we build. We don't chase parameter counts. We chase compute-optimal configurations that deliver the best performance for every dollar of infrastructure spend. Because in production, efficiency isn't just about cost — it's about being able to iterate, experiment, and improve faster than anyone else.