q0: Primitives for Hyper-Epoch Pretraining

June 14, 2026 · arXiv

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions.

The framing comes from Solomonoff induction [1], which says to keep many hypotheses for the data and weight them by a prior instead of betting everything on one. We do two things here. The first is the methodology. q0 turns a multi-epoch budget into a population through three primitives. A cyclic schedule collects diverse models quickly, chain distillation compounds their quality, and a learned prior weights them for any inference budget. The second is allocation. Building on our methodology, we work out how best to spend a given epoch budget, which shifts with its size, across the whole range from a single epoch up to the largest.

Primitive 1. Fast Exploration of Weight Space via Snapshot Ensembling

left: standard training restarts from scratch for every model, reaching one solution per run. right: a cyclic LR explores then exploits over and over, collecting several snapshots in a single run.

The obvious way to build a population is to train several models independently, each from its own random seed. But every run starts from scratch with no memory of what the others found, so you pay the full training cost again for each model.

Snapshot ensembling [2] gets the population of models from a single run, with no wasted restarts. The trick is a cyclic schedule. Training is divided into short cycles, and at each cycle end the learning rate has decayed to its floor while weight decay is at its peak, so the model has settled hard into a basin. Save that as a snapshot. Then the LR resets, giving the model enough energy to escape and settle somewhere new.

Primitive 2. Model Capability Compounding via Chain Distillation

high
loss

low
loss

chain distillation compounds quality. independent models train with CE only, no knowledge sharing

Snapshot ensembling gives you a population, but every member has still only seen one-hot labels. How good any single snapshot can get is capped by what cross-entropy can pull out of those labels, so a larger ensemble is mostly just more members of roughly the same quality. Chain distillation lifts that per-snapshot quality directly, by training on a richer signal than the labels alone.

After a few warmup cycles, each new snapshot trains against the one just before it, used as a frozen teacher and refreshed at every cycle boundary. Its loss keeps the usual next-token cross-entropy but adds a temperature-scaled KL term that pulls its predictions toward the teacher's:

\mathcal{L} \;=\; (1-\alpha)\,\mathcal{L}_{\text{CE}} \;+\; \alpha\,T^{2}\,\mathrm{KL}\big(\sigma(z'/T)\,\|\,\sigma(z/T)\big)

$\mathcal{L}_{\text{CE}}$ is cross-entropy against the true next token
$z,\ z'$ are the student's logits and the frozen teacher's (the previous snapshot)
$\alpha$ sets how much to lean on the teacher versus the ground truth
$T$ is the softmax temperature, with $T^2$ keeping the KL gradient on the CE term's scale

The teacher's soft distribution carries more than a one-hot label does. Its probabilities on the wrong tokens reveal which ones it treats as similar, the dark knowledge that distillation [3] relies on, and its confidence on the correct token reweights how much each example contributes. In the spirit of born-again networks [4], where a model is retrained against an earlier copy, both effects make the supervision strictly richer than the labels.

Because each model conditions on its predecessor and improves on it, the population's capability compounds along the chain rather than flattening into members of similar quality. This does cost some diversity, since consecutive snapshots end up more alike. But the per-model quality gain outweighs that loss, and the diversity a single chain gives up is recovered by running several trajectories rather than one.

Primitive 3. Learned Generalization Prior

standard averages every model equally. the learned prior weights the population and keeps only the models that help

The first two primitives give us a pool of models, trained quickly and each better than a plain snapshot. Standard ensembling simply averages all of them with equal weight. But not every model helps the ensemble the same amount, and if your inference budget only lets you run $K$ of the pool, you have to choose which $K$.

The naive choice is the $K$ with the lowest individual loss. But the individually-best models aren't always the best ensemble. A slightly worse model can still correct mistakes the others share, so instead of ranking by individual loss, we learn which members to keep and how to weight them, fitting weights on a small held-out set that minimize the ensemble's loss:

\mathcal{L}(\beta) = -\frac{1}{|\mathcal{F}|} \sum_{t \in \mathcal{F}} \log\!\left(\sum_{i=1}^{M} w_i \cdot p_{i,t}\right), \quad w = \mathrm{softmax}(\beta)

$\mathcal{F}$ is a held-out fitness set the weights are fit on
$p_{i,t}$ is the probability member $i$ assigns to the true token at position $t$
$w$ are the mixture weights, a softmax over learned logits $\beta$

Because the weights mix the probabilities inside the log, they favor models that complement each other rather than the ones that merely look best on their own. The same fit is reused at any inference budget $K$.

How to allocate the budget

Chain distillation makes each model individually better, but it has a side effect. The models start to agree more, which removes some of the diversity between them. That matters, because an ensemble only helps when its members are both good and disagree.

We get the diversity back by running a few separate trajectories (base models) instead of one, each from a different random initialization. A base model is one run of the cyclic schedule with chain distillation, producing its own chain of snapshots. And it stays cheap. Rather than train many models from scratch to get diversity, we only run a few trajectories, and each one gives us many snapshots.

But you can't just run many tiny trajectories either. Given a fixed budget of total training epochs $E$, do you spend it all on one long trajectory, or split it across several shorter ones? Neither extreme wins. Every fresh trajectory pays a warm-up cost before its snapshots are any good, so splitting a small budget across many bases wastes most of it warming up. A single trajectory stretched too long does the opposite, with its late cycles drifting away from good minima. The sweet spot moves with the budget.

32 960

each curve is a fixed number of base models. drag the epoch budget and the lowest curve is the best number to train, climbing as the budget grows

The optimal number of bases climbs in a rough staircase. One base wins up to around 128 epochs, two through about 256, three through about 512, and only past that does a fourth or fifth start to help. The gains flatten out after about three. Roughly speaking, each additional base wants a doubling of the epoch budget, so in most cases more cycles per trajectory is a better use of compute than more trajectories.

The small-budget regime

The plot above runs out to very large budgets, which may become routine as compute keeps growing. For now, though, the practical case is the other end, with only a handful of epochs to spend. From the allocation, these small budgets are best spent on a single base, since one trajectory beats splitting when epochs are scarce. But a handful of epochs is too little to spread the usual cyclic schedule across the whole run. So we keep the single trajectory and pack the cycles into the very end instead. We train normally for most of the budget, then run a short burst of cyclic LR cycles back-to-back, one snapshot per cycle.

We found that at a single epoch, four short cycles squeezed into the final fraction of training beat keeping one checkpoint. At four epochs, training three normally and then running the four short cycles through the last one does better still. The crossover we see lands around eight epochs. Below it, packing the cycles at the end wins. Above it, full cycles spread across the budget pay off and the staircase kicks in.

Results

Everything above runs on a 1.8B-parameter decoder-only transformer trained on 100M FineWeb [5] tokens, with a disjoint 10M-token validation set. This is the fixed-data, infinite-compute setting of the Slowrun challenge [6], where the goal is to get the best performance out of a limited dataset. The baseline is a strong one, with eight models trained independently for 32 epochs each (256 in total) and their EMA weights averaged uniformly in probability space.

method	epochs	val loss
baseline (naive ensemble)	256	3.0476
q0	256	3.0034
q0 (scaled)	960	2.9870

q0 reaches the 256-epoch baseline's loss at around 56 epochs, about 4.6× fewer epochs, or roughly 3.8× fewer when matched to the baseline's ensemble size of eight, and it continues to improve past it. Under the Slowrun data-efficiency accounting, that comes to a data efficiency of about 12.9×.

Validation loss versus total training epochs: q0 matches the ensemble baseline at far fewer epochs and keeps improving — q0 matches the strong ensemble baseline using far fewer epochs, about 56 at the best k (4.6×) and 67 at k=8 (3.8×), and keeps dropping past it.

The gains aren't confined to validation loss. They transfer to zero-shot downstream benchmarks, where q0 improves over the baseline on every task, and we find a data efficiency of about 14.2× (16.0× for the scaled run) against 12.2× for the baseline:

method	ARC-E	PIQA	SciQ	avg
baseline	0.4781	0.6697	0.7850	0.6443
q0	0.4865	0.6823	0.8030	0.6573
q0 (scaled)	0.4945	0.6774	0.8080	0.6600

We hope q0 encourages further work on multi-epoch pretraining as a way of scaling compute over limited data, especially by efficiently searching for diverse models. Searching over a population is a different problem than training a single model with gradient descent, and likely calls for new primitives; the three we introduce here are a first set that give large gains, and we expect much more lies ahead.

Full paper on arXiv.

References

R. J. Solomonoff. A formal theory of inductive inference. Information and Control 7(1), pp. 1–22, 1964.
G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, K. Q. Weinberger. Snapshot ensembles: train 1, get M for free. ICLR, 2017.
G. Hinton, O. Vinyals, J. Dean. Distilling the knowledge in a neural network. 2015.
T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, A. Anandkumar. Born-again neural networks. ICML, 2018.
G. Penedo, H. Kydlíček, L. Ben Allal, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, T. Wolf. The FineWeb datasets: decanting the web for the finest text data at scale. NeurIPS Datasets and Benchmarks, 2024.
Slowrun: a benchmark for language modeling in the fixed-data, infinite-compute regime. github.com/qlabs-eng/slowrun.