Topological Signatures of Knowledge Persistence in Continual Learning Systems
Axion Deep Labs · February 2026 · Preliminary Proof-of-Concept (Small-Scale) · Phase I Scale Validation Planned (Supercomputer Required)
Abstract
Preliminary proof-of-concept: We investigate whether the topological structure of neural network loss landscapes predicts resistance to catastrophic forgetting. Across 19 small-to-medium architectures (0.3M-44.7M parameters) and 3 small-image datasets (CIFAR-100, CUB-200-2011, NWPU-RESISC45), we compute persistent homology on 50x50 loss landscape grids using 5 independent random 2D slices. The most stable signal: H0 persistence predicts EWC mitigation benefit (CIFAR-100 rho = 0.76, RESISC-45 rho = 0.86). These results are preliminary, established on models well below production scale. The critical open question for Phase I is whether the topological signal survives on 100M-7B+ parameter models, long task sequences, and diverse continual learning methods, which requires supercomputer resources and potentially novel distributed persistent homology algorithms.
Results at a Glance
CUB-200 Key Result
p = 0.037
Suggestive (does not survive Bonferroni)
Params Alone (CUB)
rho = -0.92
Wrong direction without topology
+Topology (CUB)
rho = 0.34
Prediction rescued
MAE Reduction
17.5%
0.186 to 0.154 with topology
Configs Complete
57 / 57
19 archs x 3 datasets done
RESISC-45 Topology
p = 0.566
Topology does not help on satellite
Params vs ret (CIFAR)
rho = -0.76
p = 0.0002, survives Bonferroni
Topo on CIFAR-100
Not sig.
Redundant on easy tasks
EWC Benefit (RESISC)
rho = 0.86
H0 predicts EWC benefit, p = 2.4e-6
EWC Benefit (CIFAR)
rho = 0.76
H0 predicts EWC benefit, p = 0.0002
WRN H0 Monotonicity
rho = -1.0
Perfect on all 3 datasets
Cubical vs Ripser
rho = 1.0
H1 agreement on all 3 datasets
1. Background and Motivation
Catastrophic forgetting, the tendency of neural networks to lose previously learned knowledge when trained on new tasks, remains one of the most fundamental unsolved problems in machine learning (McCloskey and Cohen, 1989). Every major mitigation strategy (replay buffers, elastic weight consolidation, progressive networks) manages the symptom rather than addressing the underlying geometric cause.
Topological Data Analysis (TDA) has emerged as a tool for characterizing loss landscape geometry (Ballester and Araujo, 2020). Persistent homology extracts scale-invariant topological features that survive across multiple scales of filtration: H0 (connected components) and H1 (loops/tunnels).
This experiment tests whether H1 persistence (topological loop structure) in the loss landscape predicts catastrophic forgetting resistance, and whether this signal is independent of model scale.
2. Methodology
Experimental Pipeline
Datasets (3 Domains)
- CIFAR-100 (19/19 architectures complete): Split into Task A (classes 0-49) and Task B (classes 50-99). Standard augmentation.
- CUB-200-2011 (19/19 complete): Fine-grained bird classification. 200 species, cross-domain validation.
- NWPU-RESISC45 (19/19 architectures complete): Satellite remote sensing. 45 scene classes, cross-domain validation.
Training Protocol
- SGD with momentum (0.9), weight decay 5x10^-4, cosine annealing with warmup (5-10 epochs), batch size 128
- Task A: 100 epochs to convergence
- Phase 3 variants: naive sequential, EWC (lambda=400), cosine LR schedule
- Retention metrics: ret@100 and ret@10 (accuracy at 100 and 10 steps of Task B training)
Loss Landscape Sampling
- 50x50 grid (2,500 evaluation points) along 2 filter-normalized random directions (Li et al., 2018)
- Range: [-1.0, 1.0]
- 5 independent random 2D slices per architecture (landscape seed randomized but logged)
- Sublevel set filtration with lower-star construction
Persistent Homology
- Primary: Ripser (Vietoris-Rips, sparse mode)
- Validation: GUDHI cubical persistent homology (Phase 2c)
- Dimensions: H0 (connected components), H1 (loops)
- Primary metric: H1 total persistence = sum of (death_i - birth_i) for all H1 features
19 Architectures Under Study
Original Architectures (14)
ResNet-18, ResNet-50, ResNet-18 Wide
WRN-28-10, DenseNet-121
MobileNet-V3-Small, ShuffleNet-V2
EfficientNet-B0, RegNet-Y-400MF
ViT-Tiny, ViT-Small
MLP-Mixer, ConvNeXt-Tiny, VGG-16-BN
WRN Width Ladder (5 additional)
WRN-28-1, WRN-28-2, WRN-28-4, WRN-28-6, WRN-28-8
Same architecture, varying only width multiplier k. Isolates parameter count from architectural inductive bias. All complete on CIFAR-100, CUB-200, and RESISC-45.
3. Statistical Framework
Primary Analysis
- Spearman rank correlation (non-parametric)
- Bonferroni correction across hypothesis tests
- Permutation test: 1,000 shuffles for empirical p-values
- Leave-one-architecture-out Ridge regression with nested alpha selection
- Matched-dimensionality null control (1,000 random feature draws)
Confound Controls
- Partial Spearman correlation (H1 | parameter count)
- Cross-dataset replication (CIFAR-100, CUB-200, RESISC-45)
- Within-family analysis (CNN-only) to control architecture type
- WRN width ladder: same architecture, varying only params
4. Results
Cross-Dataset Predictive Model (Phase 5)
Leave-one-architecture-out Ridge regression with permutation testing. Compares params-only vs. params+topology models across datasets.
| Dataset | Outcome | Params-only rho | +Topology rho | Perm. p | Verdict |
|---|---|---|---|---|---|
| CIFAR-100 (n=19) | ret@100 | 0.43 | 0.30 | 0.295 | Not significant |
| CUB-200 (n=19) | ret@10 | -0.92 | 0.34 | 0.037 | Suggestive |
| RESISC-45 (n=19) | ret@100 | -- | -- | 0.566 | Not significant |
| RESISC-45 (n=19) | ret@10 | -- | -- | 0.628 | Not significant |
| RESISC-45 (n=19) | early_aurc | -- | -- | 0.743 | Not significant |
On CIFAR-100, parameter count alone explains forgetting and topology adds nothing. On CUB-200, parameter count predicts in the wrong direction and topology rescues the prediction (suggestive at p = 0.037 but does not survive Bonferroni across 3 datasets, adjusted alpha = 0.0167). On RESISC-45, topology does not help predict forgetting on any metric.
CIFAR-100 Results (n=19, Easy Benchmark)
All 19 architectures sorted by ret@100. On this easy benchmark, bigger models simply retain better.
| Architecture | Params | Task A Acc. | ret@100 | ret@10 | H1 Pers. | Type |
|---|---|---|---|---|---|---|
| ViT-Tiny | 0.3M | 52.7% | 22.5% | 95.9% | 0.01 | Transformer |
| ShuffleNet-V2 | 1.3M | 76.8% | 17.3% | 84.7% | 0.79 | CNN |
| ViT-Small | 2.2M | 62.2% | 9.6% | 94.7% | 0.24 | Transformer |
| MobileNet-V3-S | 1.1M | 68.6% | 7.6% | 75.0% | 1.89 | CNN |
| EfficientNet-B0 | 4.1M | 76.6% | 7.1% | 78.6% | 1.91 | CNN |
| WRN-28-1 | 0.4M | 71.7% | 6.6% | 51.0% | 0.00 | WRN-ladder |
| RegNet-Y-400MF | 4.0M | 72.2% | 2.0% | 54.1% | 0.05 | CNN |
| WRN-28-2 | 1.5M | 78.6% | 1.1% | 22.8% | 0.00 | WRN-ladder |
| VGG-16-BN | 14.8M | 78.4% | 0.8% | 88.0% | 0.00 | CNN |
| WRN-28-8 | 23.4M | 82.9% | 0.7% | 4.4% | 0.01 | WRN-ladder |
| WRN-28-4 | 5.9M | 81.8% | 0.3% | 8.5% | 0.02 | WRN-ladder |
| WRN-28-10 | 36.5M | 84.0% | 0.3% | 5.3% | 0.07 | WRN-ladder |
| ResNet-18 | 11.2M | 82.0% | 0.2% | 46.7% | 0.00 | CNN |
| WRN-28-6 | 13.2M | 82.8% | 0.1% | 4.5% | 0.02 | WRN-ladder |
| ResNet-50 | 23.7M | 83.6% | 0.1% | 56.0% | 0.00 | CNN |
| DenseNet-121 | 7.1M | 84.5% | 0.05% | 25.7% | 0.01 | CNN |
| MLP-Mixer | 2.3M | 61.5% | 0.03% | 0.03% | 0.12 | MLP |
| ConvNeXt-Tiny | 27.9M | 56.7% | 0.0% | 45.0% | 0.00 | CNN |
| ResNet-18 Wide | 44.7M | 83.1% | 0.0% | 29.7% | 0.00 | CNN |
CIFAR-100 Phase 4 Correlation Analysis
- Parameter count vs ret@100: rho = -0.76, p = 0.0002, survives Bonferroni
- H1 persistence vs ret@100: rho = 0.47, p = 0.042, does NOT survive Bonferroni
- Partial H1 | params: rho = 0.33, p = 0.19, not significant
- Conclusion: On this easy task, parameter count dominates
CUB-200 Results (n=19, Hard Fine-Grained)
Top architectures by retention on CUB-200-2011 (200 bird species). Parameter count fails as a predictor on this hard benchmark.
| Architecture | Params | ret@100 | Type |
|---|---|---|---|
| ViT-Tiny | 0.3M | 31.1% | Transformer |
| ViT-Small | 2.2M | 23.4% | Transformer |
| WRN-28-10 | 36.5M | 8.1% | WRN-ladder |
| WRN-28-8 | 23.4M | 5.0% | WRN-ladder |
| EfficientNet-B0 | 4.1M | 3.5% | CNN |
| ShuffleNet-V2 | 1.3M | 2.8% | CNN |
| WRN-28-6 | 13.2M | 2.4% | WRN-ladder |
| DenseNet-121 | 7.1M | 1.9% | CNN |
CUB-200 Phase 4 Correlations
- Parameter count vs ret@100: rho = -0.27, p = 0.27 (NOT significant)
- Parameter count fails on hard tasks
CUB-200 Phase 5 (ret@10 Detail)
- Params alone: rho = -0.92 (wrong direction)
- Params + topology: rho = 0.34 (rescued)
- Topology alone: rho = 0.33, MAE = 0.147
- Permutation test: p = 0.037
- Matched-dimensionality control: exceeds 95th percentile
- MAE reduction: 17.5%
Phase 6: Pooled Interaction Analysis (n=57)
Formal test of dataset moderation via OLS with clustered bootstrap. All 57 configurations pooled across 3 datasets with dataset x topology interaction terms.
EWC Benefit Moderation Test
Block Permutation p
p = 0.046
Dataset moderates H0 -> EWC benefit
CIFAR-100 H0 Effect
+0.016
CI [+0.005, +0.062] excludes zero
CUB-200 H0 Effect
+0.002
CI [-0.008, +0.013] includes zero
RESISC-45 H0 Effect
+0.007
CI [+0.004, +0.012] excludes zero
H0 partial effects on EWC benefit per dataset. CIs from clustered bootstrap. CIFAR-100 and RESISC-45 confidence intervals exclude zero, confirming the per-dataset correlations. CUB-200 CI includes zero.
Forgetting Prediction Moderation Test
ret@10 Block p
p = 0.196
Not significant overall
ret@100 Block p
p = 0.035
Significant moderation
CUB-200 ret@10 Effect
-0.123
CI [-0.183, -0.046] excludes zero
- CIFAR-100 H0 on ret@10: -0.001, CI [-0.486, +0.073] (includes zero)
- CUB-200 H0 on ret@10: -0.123, CI [-0.183, -0.046] (excludes zero)
- RESISC-45 H0 on ret@10: -0.021, CI [-0.264, +0.083] (includes zero)
Bottom line: Dataset significantly moderates the topology-EWC benefit relationship (permutation p = 0.046), with H0 predicting EWC benefit on CIFAR-100 and RESISC-45 (CIs excluding zero) but not CUB-200. For forgetting prediction, the ret@100 block test is significant (p = 0.035) and CUB-200 is the only dataset where H0 CI on ret@10 excludes zero, consistent with Phase 5 findings that topology rescues prediction specifically on fine-grained tasks.
5. Key Findings
Finding 1: Topology is a conditional predictor, not a universal one
On CUB-200 (fine-grained birds), topology rescues forgetting prediction where parameter count fails (params rho = -0.92 wrong direction; +topology rho = 0.34; permutation p = 0.037). However, this p-value does not survive Bonferroni correction across 3 datasets (adjusted alpha = 0.0167), making the result suggestive rather than confirmed. On RESISC-45 (satellite scenes), topology does not help at all (perm p = 0.566). Topology appears to matter on fine-grained visual tasks but not on satellite imagery.
Finding 2: On easy tasks, parameter count is all you need
On CIFAR-100 (n=19), parameter count shows rho = -0.76, p = 0.0002, and survives Bonferroni correction. Topology adds nothing beyond what scale already explains. Bigger models simply retain better on easy benchmarks.
Finding 3: Task domain, not just difficulty, determines whether topology predicts forgetting
The picture is more nuanced than "topology helps on hard tasks." CIFAR-100 (easy): scale dominates. CUB-200 (fine-grained): topology is suggestive. RESISC-45 (satellite): topology does not help despite being a non-trivial task. The domain itself matters. Fine-grained visual discrimination may create loss landscape structures that topological features can capture, while satellite scene classification does not.
Finding 4: H0 predicts EWC benefit across datasets (strongest cross-dataset signal)
The most robust finding across all 3 datasets: H0 persistence (connected components) predicts how much a model benefits from Elastic Weight Consolidation. On CIFAR-100: rho = 0.76, p = 0.0002. On RESISC-45: rho = 0.86, p = 2.4e-6. The Phase 6 pooled interaction analysis formally confirms that dataset moderates this relationship (block permutation p = 0.046), with per-dataset H0 partial effects excluding zero on CIFAR-100 (CI [+0.005, +0.062]) and RESISC-45 (CI [+0.004, +0.012]) but not CUB-200 (CI [-0.008, +0.013]). Models with more fragmented loss landscapes (higher H0) benefit more from EWC regularization. This makes topology a mitigation sensitivity marker, telling you not just whether a model will forget, but how much a specific intervention will help.
Finding 5: WRN width ladder confirms universal H0 monotonicity
The WRN-28-k ladder (k=1,2,4,6,8,10) shows H0 perfectly monotonic with width on all 3 datasets (rho = -1.0 on CIFAR-100, CUB-200, and RESISC-45). Wider networks universally produce smoother loss landscapes. Cubical vs Ripser H1 agreement is also perfect (rho = 1.0) on all 3 datasets, confirming methodological robustness.
6. WRN Width Ladder
Design: WRN-28-k, k = 1, 2, 4, 6, 8, 10
The WRN width ladder holds architecture constant (WideResNet-28 with identical depth, skip connections, and training protocol) while varying only the width multiplier k. This scales parameter count from roughly 0.4M (k=1) to 36.5M (k=10) within a single architecture family, isolating scale from inductive bias.
Key Results
H0 monotonic with width (universal)
H0 persistence is perfectly monotonic with width multiplier (rho = -1.0) on all 3 datasets (CIFAR-100, CUB-200, RESISC-45). Wider networks universally produce smoother loss landscapes with fewer connected components.
Direction flip across datasets
CIFAR-100: H0 vs retention rho = 0.71 (suggestive). CUB-200: H0 vs retention rho = -0.83, p = 0.04 (opposite direction). The relationship between topology and forgetting varies by domain, even though H0 monotonicity with scale is universal.
| Config | Width k | Params | Status |
|---|---|---|---|
| WRN-28-1 | 1 | 0.4M | Complete |
| WRN-28-2 | 2 | 1.5M | Complete |
| WRN-28-4 | 4 | 5.9M | Complete |
| WRN-28-6 | 6 | 13.2M | Complete |
| WRN-28-8 | 8 | 23.4M | Complete |
| WRN-28-10 | 10 | 36.5M | Complete |
7. Cross-Domain Validation
CIFAR-100
19 / 19 architectures complete
Standard object recognition. 50 classes per task. Parameter count dominates (rho = -0.76, p = 0.0002). Topology redundant on this easy benchmark.
CUB-200-2011
19 / 19 architectures complete
Fine-grained bird classification. 200 species. Topology rescues prediction (permutation p = 0.037, suggestive but does not survive Bonferroni across 3 datasets) where parameter count fails (rho = -0.27, not significant).
NWPU-RESISC45
19 / 19 architectures complete
Satellite remote sensing scenes. 45 classes. Topology does NOT help predict forgetting (perm p = 0.566 ret@100, p = 0.628 ret@10, p = 0.743 early_aurc). However, H0 strongly predicts EWC benefit (rho = 0.86, p = 2.4e-6).
8. Discussion and Next Steps
With all 57 configurations complete across 3 datasets, the picture is clear: topology is not a universal predictor of forgetting. The CUB-200 result (p = 0.037) is suggestive but does not survive Bonferroni correction across 3 datasets (adjusted alpha = 0.0167). On RESISC-45, topology provides no forgetting prediction at all (p = 0.566). Topology's predictive value for forgetting is conditional on the visual domain.
However, the strongest cross-dataset signal is not about predicting forgetting directly. H0 persistence (connected components) predicts how much a model benefits from EWC regularization, and this holds on both CIFAR-100 (rho = 0.76, p = 0.0002) and RESISC-45 (rho = 0.86, p = 2.4e-6). This reframes topology's role: it is a mitigation sensitivity marker. It tells you not just that a model might forget, but how much a specific intervention (EWC) will help.
Revised narrative:
- Topology as conditional forgetting predictor: works on fine-grained CUB-200, not on satellite RESISC-45 or easy CIFAR-100
- Topology as mitigation sensitivity marker: H0 predicts EWC benefit across datasets (the most robust finding)
- WRN H0 monotonicity (rho = -1.0) and cubical/Ripser agreement (rho = 1.0) are universal across all 3 datasets
Next steps:
- Multi-seed runs for confidence intervals on the CUB-200 finding
- Scale to 30+ architectures for more statistical power (target: CUB-200 p < 0.0167 after Bonferroni)
- EWC benefit prediction API as the commercially viable product angle
- ArXiv publication and NeurIPS/ICML submission
9. Proposed Mechanism: Basin Fragmentation
H0 in persistent homology counts connected components in the sublevel set filtration of the loss landscape. A high H0 count indicates a fragmented landscape with many disconnected basins at low loss values.
We propose the basin fragmentation hypothesis: H0 measures the degree of loss landscape fragmentation, which determines how much curvature-based regularization (EWC) can help by preventing inter-basin drift during sequential training.
High H0 (fragmented landscape)
- Many disconnected basins at low loss
- Naive training drifts across basin boundaries
- EWC prevents inter-basin drift via Fisher penalty
- Large EWC benefit
Low H0 (smooth landscape)
- One broad basin; few disconnected regions
- Naive training perturbs within the same basin
- EWC penalty addresses a problem that does not exist
- Small EWC benefit
The WRN width ladder provides supporting evidence: H0 decreases perfectly with width (rho = -1.0 vs params) across all three datasets, consistent with wider networks having smoother, less fragmented landscapes. The CUB-200 null for EWC benefit (rho = 0.31, p = 0.19) may indicate that fine-grained discrimination creates forgetting through feature-level interference rather than parameter-level basin drift.
This mechanism is tentative. A causal test would require intervening on landscape topology (e.g., via landscape-aware regularization) and measuring the effect on EWC benefit.
10. Limitations and External Validity
What we claim
- Dataset significantly moderates the H0-EWC benefit relationship (Phase 6, p = 0.046)
- H0 partial effects on EWC benefit exclude zero on CIFAR-100 and RESISC-45 but not CUB-200
- On CUB-200, topology provides the only predictive signal for early forgetting (ret@10 CI excludes zero)
What we do not claim
- That topology universally predicts forgetting (RESISC-45 null)
- That the EWC moderation finding is confirmatory (it emerged from exploratory analysis)
- That basin fragmentation is an established causal mechanism
Scope limitations
- 19 architectures: moderate statistical power; WRN width ladder controls for family but has limited within-ladder degrees of freedom
- One mitigation method: only EWC tested; if H0 does not predict benefit under Synaptic Intelligence or PackNet, the finding is EWC-specific
- 2D projections: topology computed on 2D landscape cross-sections, not the full high-dimensional landscape; 5 slices mitigate but do not eliminate sampling variance
- Borderline p-values: EWC moderation p = 0.046, forgetting ret@100 p = 0.035; CUB-200 ret@10 p = 0.037 does not survive Bonferroni
Falsification targets
- Synaptic Intelligence benefit shows no H0 correlation on CIFAR-100 or RESISC-45 (mechanism is EWC-specific)
- Adding 10+ architectures eliminates the CUB-200 ret@10 signal (forgetting prediction claim fails)
- Landscape intervention (e.g., SAM) changes H0 without changing EWC benefit (causal link is broken)
- Cubical persistence disagrees with Ripser-based H0 on the moderation result (measurement is method-dependent)
11. Analysis Path Transparency
The original hypothesis targeted topology as a direct predictor of forgetting. Retention at step 10 was pre-specified as the primary outcome, with ret@100 and early AURC as robustness checks.
CIFAR-100 was run first and showed parameter count dominates (topology null, p = 0.295). CUB-200 was run second and showed topology rescues prediction (p = 0.037). RESISC-45 was run third and returned a null for topology (p = 0.566), falsifying the simpler “topology helps on hard tasks” framing.
The EWC benefit analysis was computed as part of Phase 4 diagnostics, not as the original target hypothesis. The shift from “topology predicts forgetting” to “topology predicts mitigation benefit” emerged from the data after the RESISC-45 null. The Phase 6 pooled interaction model was designed post hoc to formalize the cross-dataset moderation pattern.
The EWC moderation finding (p = 0.046) should be interpreted as a data-driven discovery requiring pre-registered replication, not as a confirmatory result.
12. Reproducibility and Infrastructure
Compute
- Local GPU cluster (NVIDIA RTX, CUDA)
- PyTorch 2.x with mixed precision
- Training seed = 42 (deterministic)
- Landscape seed randomized, logged per run
Topology
- Ripser (Vietoris-Rips, sparse mode)
- GUDHI (cubical PH validation)
- scikit-tda ecosystem
- 5 random slices per architecture
Tracking and Versioning
- Version-controlled YAML configs (57 total)
- Full dependency pinning (pyproject.toml)
- Structured JSON output with all metrics
- Flask dashboard for experiment management
References
- Ballester, R. and Araujo, X. (2020). On the interplay between topological data analysis and deep learning. NeurIPS Workshop on TDA.
- Boissonnat, J.-D. et al. (2018). Geometric and Topological Inference. Cambridge University Press.
- Draxler, F. et al. (2018). Essentially no barriers in neural network energy landscape. ICML.
- Kirkpatrick, J. et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
- Li, H. et al. (2018). Visualizing the loss landscape of neural nets. NeurIPS.
- McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks. Psychology of Learning and Motivation, 24, 109-165.
- Otter, N. et al. (2017). A roadmap for the computation of persistent homology. EPJ Data Science, 6(1), 1-38.
- Rieck, B. et al. (2019). Neural persistence: a complexity measure for deep neural networks using algebraic topology. ICLR.
- Tononi, G. and Cirelli, C. (2014). Sleep and the price of plasticity. Neuron, 81(1), 12-34.
