Skip to content
ResearchPaper in PrepHPC Sweep

Grokking Topology

Early-warning signals for emergent generalization.

Grokking is the phenomenon where a neural network, after a long phase of apparent overfitting, suddenly snaps into perfect generalization. It is disconcerting, it is real, and it is poorly understood. We are investigating whether the topology of the loss landscape provides a reliable, early-warning signal for grokking, and whether that signal beats simpler scalar proxies like weight norm.

The research question

For the canonical grokking task (modular addition with a small transformer), the model can sit at near-zero training loss and chance-level test accuracy for tens of thousands of steps before suddenly generalizing. By the time test accuracy moves, the dynamics have already happened. Anyone hoping to predict or accelerate grokking has to find a signal that fires earlier.

We hypothesized that persistent homology computed over slices of the loss landscape near the operating point would carry an earlier and cleaner signal than scalar curvature or weight-norm proxies. The full-scale sweep is what tests that hypothesis honestly.

Experimental setup

Task

Modular addition mod 97 with a one-layer transformer decoder, roughly 302K parameters. Canonical grokking testbed where the dynamics are clean and the generalization snap is unmistakable.

Training

Full-batch AdamW, lr = 1e-3, varied weight decay, 100K steps. Step checkpoints captured throughout, including the entire pre-grokking plateau.

Topological observables

H0 total persistence, H0 effective feature count (inverse participation ratio), H0 median persistence, H0 persistence entropy. Comparator: commutator defect computed full-batch with proper detachment.

Compute

NMSU Discovery HPC, A100-PCIE-40GB. Full study is a 90-job constrained scaling sweep across 30 seeds and 3 weight-decay values to test reproducibility of any signal we find.

What the pilot showed

The pilot was a four-seed run designed to gate the full HPC sweep. It produced an honest result: on this particular task and at this particular scale, the topological signals we tracked were largely redundant with weight norm. Weight norm is a simpler, cheaper observable, and on the modular-addition task it carries enough of the signal that the topological invariants do not add much.

That is a useful negative result, not a contradiction of the hypothesis. The modular-addition task is small. The full study is designed to test whether the same redundancy holds across seeds and weight-decay regimes, and whether richer or scaled-up tasks expose a regime where topology and weight norm diverge.

Where it stands

Pilot · Complete

Four-seed pilot run, analyzed, and documented. Pilot conclusion: topological signals largely redundant with weight norm at this scale.

Full Sweep · Running

30 seeds, 3 weight-decay values, 90 jobs total on NMSU Discovery. Tests whether the pilot conclusion holds at scale and whether richer architectures or task variations expose a topology-only signal.

Paper · In Preparation

Writeup in preparation regardless of which way the full sweep lands. Honest negative results in a fashionable area are worth publishing.