RESEARCH OUTPUTEXP-01 · PERSISTPreliminary Complete · 57 of 57 Configs · Phase I Scale Validation Planned

Topological Signatures of Knowledge Persistence in Continual Learning Systems

Axion Deep Labs · February 2026 · Preliminary Proof-of-Concept (Small-Scale) · Phase I Scale Validation Planned (Supercomputer Required)

Abstract

Preliminary proof-of-concept: We investigate whether the topological structure of neural network loss landscapes predicts resistance to catastrophic forgetting. Across 19 small-to-medium architectures (0.3M-44.7M parameters) and 3 small-image datasets (CIFAR-100, CUB-200-2011, NWPU-RESISC45), we compute persistent homology on 50x50 loss landscape grids using 5 independent random 2D slices. The most stable signal: H0 persistence predicts EWC mitigation benefit (CIFAR-100 rho = 0.76, RESISC-45 rho = 0.86). These results are preliminary, established on models well below production scale. The critical open question for Phase I is whether the topological signal survives on 100M-7B+ parameter models, long task sequences, and diverse continual learning methods, which requires supercomputer resources and potentially novel distributed persistent homology algorithms.

Results at a Glance

CUB-200 Key Result

p = 0.037

Suggestive (does not survive Bonferroni)

Params Alone (CUB)

rho = -0.92

Wrong direction without topology

+Topology (CUB)

rho = 0.34

Prediction rescued

MAE Reduction

17.5%

0.186 to 0.154 with topology

Configs Complete

57 / 57

19 archs x 3 datasets done

RESISC-45 Topology

p = 0.566

Topology does not help on satellite

Params vs ret (CIFAR)

rho = -0.76

p = 0.0002, survives Bonferroni

Topo on CIFAR-100

Not sig.

Redundant on easy tasks

EWC Benefit (RESISC)

rho = 0.86

H0 predicts EWC benefit, p = 2.4e-6

EWC Benefit (CIFAR)

rho = 0.76

H0 predicts EWC benefit, p = 0.0002

WRN H0 Monotonicity

rho = -1.0

Perfect on all 3 datasets

Cubical vs Ripser

rho = 1.0

H1 agreement on all 3 datasets

1. Background and Motivation

Catastrophic forgetting, the tendency of neural networks to lose previously learned knowledge when trained on new tasks, remains one of the most fundamental unsolved problems in machine learning (McCloskey and Cohen, 1989). Every major mitigation strategy (replay buffers, elastic weight consolidation, progressive networks) manages the symptom rather than addressing the underlying geometric cause.

Topological Data Analysis (TDA) has emerged as a tool for characterizing loss landscape geometry (Ballester and Araujo, 2020). Persistent homology extracts scale-invariant topological features that survive across multiple scales of filtration: H0 (connected components) and H1 (loops/tunnels).

This experiment tests whether H1 persistence (topological loop structure) in the loss landscape predicts catastrophic forgetting resistance, and whether this signal is independent of model scale.

2. Methodology

Experimental Pipeline

Datasets (3 Domains)

  • CIFAR-100 (19/19 architectures complete): Split into Task A (classes 0-49) and Task B (classes 50-99). Standard augmentation.
  • CUB-200-2011 (19/19 complete): Fine-grained bird classification. 200 species, cross-domain validation.
  • NWPU-RESISC45 (19/19 architectures complete): Satellite remote sensing. 45 scene classes, cross-domain validation.

Training Protocol

  • SGD with momentum (0.9), weight decay 5x10^-4, cosine annealing with warmup (5-10 epochs), batch size 128
  • Task A: 100 epochs to convergence
  • Phase 3 variants: naive sequential, EWC (lambda=400), cosine LR schedule
  • Retention metrics: ret@100 and ret@10 (accuracy at 100 and 10 steps of Task B training)

Loss Landscape Sampling

  • 50x50 grid (2,500 evaluation points) along 2 filter-normalized random directions (Li et al., 2018)
  • Range: [-1.0, 1.0]
  • 5 independent random 2D slices per architecture (landscape seed randomized but logged)
  • Sublevel set filtration with lower-star construction

Persistent Homology

  • Primary: Ripser (Vietoris-Rips, sparse mode)
  • Validation: GUDHI cubical persistent homology (Phase 2c)
  • Dimensions: H0 (connected components), H1 (loops)
  • Primary metric: H1 total persistence = sum of (death_i - birth_i) for all H1 features

19 Architectures Under Study

Original Architectures (14)

ResNet-18, ResNet-50, ResNet-18 Wide

WRN-28-10, DenseNet-121

MobileNet-V3-Small, ShuffleNet-V2

EfficientNet-B0, RegNet-Y-400MF

ViT-Tiny, ViT-Small

MLP-Mixer, ConvNeXt-Tiny, VGG-16-BN

WRN Width Ladder (5 additional)

WRN-28-1, WRN-28-2, WRN-28-4, WRN-28-6, WRN-28-8

Same architecture, varying only width multiplier k. Isolates parameter count from architectural inductive bias. All complete on CIFAR-100, CUB-200, and RESISC-45.

3. Statistical Framework

Primary Analysis

  • Spearman rank correlation (non-parametric)
  • Bonferroni correction across hypothesis tests
  • Permutation test: 1,000 shuffles for empirical p-values
  • Leave-one-architecture-out Ridge regression with nested alpha selection
  • Matched-dimensionality null control (1,000 random feature draws)

Confound Controls

  • Partial Spearman correlation (H1 | parameter count)
  • Cross-dataset replication (CIFAR-100, CUB-200, RESISC-45)
  • Within-family analysis (CNN-only) to control architecture type
  • WRN width ladder: same architecture, varying only params

4. Results

Cross-Dataset Predictive Model (Phase 5)

Leave-one-architecture-out Ridge regression with permutation testing. Compares params-only vs. params+topology models across datasets.

DatasetOutcomeParams-only rho+Topology rhoPerm. pVerdict
CIFAR-100 (n=19)ret@1000.430.300.295Not significant
CUB-200 (n=19)ret@10-0.920.340.037Suggestive
RESISC-45 (n=19)ret@100----0.566Not significant
RESISC-45 (n=19)ret@10----0.628Not significant
RESISC-45 (n=19)early_aurc----0.743Not significant

On CIFAR-100, parameter count alone explains forgetting and topology adds nothing. On CUB-200, parameter count predicts in the wrong direction and topology rescues the prediction (suggestive at p = 0.037 but does not survive Bonferroni across 3 datasets, adjusted alpha = 0.0167). On RESISC-45, topology does not help predict forgetting on any metric.

CIFAR-100 Results (n=19, Easy Benchmark)

All 19 architectures sorted by ret@100. On this easy benchmark, bigger models simply retain better.

ArchitectureParamsTask A Acc.ret@100ret@10H1 Pers.Type
ViT-Tiny0.3M52.7%22.5%95.9%0.01Transformer
ShuffleNet-V21.3M76.8%17.3%84.7%0.79CNN
ViT-Small2.2M62.2%9.6%94.7%0.24Transformer
MobileNet-V3-S1.1M68.6%7.6%75.0%1.89CNN
EfficientNet-B04.1M76.6%7.1%78.6%1.91CNN
WRN-28-10.4M71.7%6.6%51.0%0.00WRN-ladder
RegNet-Y-400MF4.0M72.2%2.0%54.1%0.05CNN
WRN-28-21.5M78.6%1.1%22.8%0.00WRN-ladder
VGG-16-BN14.8M78.4%0.8%88.0%0.00CNN
WRN-28-823.4M82.9%0.7%4.4%0.01WRN-ladder
WRN-28-45.9M81.8%0.3%8.5%0.02WRN-ladder
WRN-28-1036.5M84.0%0.3%5.3%0.07WRN-ladder
ResNet-1811.2M82.0%0.2%46.7%0.00CNN
WRN-28-613.2M82.8%0.1%4.5%0.02WRN-ladder
ResNet-5023.7M83.6%0.1%56.0%0.00CNN
DenseNet-1217.1M84.5%0.05%25.7%0.01CNN
MLP-Mixer2.3M61.5%0.03%0.03%0.12MLP
ConvNeXt-Tiny27.9M56.7%0.0%45.0%0.00CNN
ResNet-18 Wide44.7M83.1%0.0%29.7%0.00CNN

CIFAR-100 Phase 4 Correlation Analysis

  • Parameter count vs ret@100: rho = -0.76, p = 0.0002, survives Bonferroni
  • H1 persistence vs ret@100: rho = 0.47, p = 0.042, does NOT survive Bonferroni
  • Partial H1 | params: rho = 0.33, p = 0.19, not significant
  • Conclusion: On this easy task, parameter count dominates

CUB-200 Results (n=19, Hard Fine-Grained)

Top architectures by retention on CUB-200-2011 (200 bird species). Parameter count fails as a predictor on this hard benchmark.

ArchitectureParamsret@100Type
ViT-Tiny0.3M31.1%Transformer
ViT-Small2.2M23.4%Transformer
WRN-28-1036.5M8.1%WRN-ladder
WRN-28-823.4M5.0%WRN-ladder
EfficientNet-B04.1M3.5%CNN
ShuffleNet-V21.3M2.8%CNN
WRN-28-613.2M2.4%WRN-ladder
DenseNet-1217.1M1.9%CNN

CUB-200 Phase 4 Correlations

  • Parameter count vs ret@100: rho = -0.27, p = 0.27 (NOT significant)
  • Parameter count fails on hard tasks

CUB-200 Phase 5 (ret@10 Detail)

  • Params alone: rho = -0.92 (wrong direction)
  • Params + topology: rho = 0.34 (rescued)
  • Topology alone: rho = 0.33, MAE = 0.147
  • Permutation test: p = 0.037
  • Matched-dimensionality control: exceeds 95th percentile
  • MAE reduction: 17.5%

Phase 6: Pooled Interaction Analysis (n=57)

Formal test of dataset moderation via OLS with clustered bootstrap. All 57 configurations pooled across 3 datasets with dataset x topology interaction terms.

EWC Benefit Moderation Test

Block Permutation p

p = 0.046

Dataset moderates H0 -> EWC benefit

CIFAR-100 H0 Effect

+0.016

CI [+0.005, +0.062] excludes zero

CUB-200 H0 Effect

+0.002

CI [-0.008, +0.013] includes zero

RESISC-45 H0 Effect

+0.007

CI [+0.004, +0.012] excludes zero

H0 partial effects on EWC benefit per dataset. CIs from clustered bootstrap. CIFAR-100 and RESISC-45 confidence intervals exclude zero, confirming the per-dataset correlations. CUB-200 CI includes zero.

Forgetting Prediction Moderation Test

ret@10 Block p

p = 0.196

Not significant overall

ret@100 Block p

p = 0.035

Significant moderation

CUB-200 ret@10 Effect

-0.123

CI [-0.183, -0.046] excludes zero

  • CIFAR-100 H0 on ret@10: -0.001, CI [-0.486, +0.073] (includes zero)
  • CUB-200 H0 on ret@10: -0.123, CI [-0.183, -0.046] (excludes zero)
  • RESISC-45 H0 on ret@10: -0.021, CI [-0.264, +0.083] (includes zero)

Bottom line: Dataset significantly moderates the topology-EWC benefit relationship (permutation p = 0.046), with H0 predicting EWC benefit on CIFAR-100 and RESISC-45 (CIs excluding zero) but not CUB-200. For forgetting prediction, the ret@100 block test is significant (p = 0.035) and CUB-200 is the only dataset where H0 CI on ret@10 excludes zero, consistent with Phase 5 findings that topology rescues prediction specifically on fine-grained tasks.

5. Key Findings

Finding 1: Topology is a conditional predictor, not a universal one

On CUB-200 (fine-grained birds), topology rescues forgetting prediction where parameter count fails (params rho = -0.92 wrong direction; +topology rho = 0.34; permutation p = 0.037). However, this p-value does not survive Bonferroni correction across 3 datasets (adjusted alpha = 0.0167), making the result suggestive rather than confirmed. On RESISC-45 (satellite scenes), topology does not help at all (perm p = 0.566). Topology appears to matter on fine-grained visual tasks but not on satellite imagery.

Finding 2: On easy tasks, parameter count is all you need

On CIFAR-100 (n=19), parameter count shows rho = -0.76, p = 0.0002, and survives Bonferroni correction. Topology adds nothing beyond what scale already explains. Bigger models simply retain better on easy benchmarks.

Finding 3: Task domain, not just difficulty, determines whether topology predicts forgetting

The picture is more nuanced than "topology helps on hard tasks." CIFAR-100 (easy): scale dominates. CUB-200 (fine-grained): topology is suggestive. RESISC-45 (satellite): topology does not help despite being a non-trivial task. The domain itself matters. Fine-grained visual discrimination may create loss landscape structures that topological features can capture, while satellite scene classification does not.

Finding 4: H0 predicts EWC benefit across datasets (strongest cross-dataset signal)

The most robust finding across all 3 datasets: H0 persistence (connected components) predicts how much a model benefits from Elastic Weight Consolidation. On CIFAR-100: rho = 0.76, p = 0.0002. On RESISC-45: rho = 0.86, p = 2.4e-6. The Phase 6 pooled interaction analysis formally confirms that dataset moderates this relationship (block permutation p = 0.046), with per-dataset H0 partial effects excluding zero on CIFAR-100 (CI [+0.005, +0.062]) and RESISC-45 (CI [+0.004, +0.012]) but not CUB-200 (CI [-0.008, +0.013]). Models with more fragmented loss landscapes (higher H0) benefit more from EWC regularization. This makes topology a mitigation sensitivity marker, telling you not just whether a model will forget, but how much a specific intervention will help.

Finding 5: WRN width ladder confirms universal H0 monotonicity

The WRN-28-k ladder (k=1,2,4,6,8,10) shows H0 perfectly monotonic with width on all 3 datasets (rho = -1.0 on CIFAR-100, CUB-200, and RESISC-45). Wider networks universally produce smoother loss landscapes. Cubical vs Ripser H1 agreement is also perfect (rho = 1.0) on all 3 datasets, confirming methodological robustness.

6. WRN Width Ladder

Complete

Design: WRN-28-k, k = 1, 2, 4, 6, 8, 10

The WRN width ladder holds architecture constant (WideResNet-28 with identical depth, skip connections, and training protocol) while varying only the width multiplier k. This scales parameter count from roughly 0.4M (k=1) to 36.5M (k=10) within a single architecture family, isolating scale from inductive bias.

Key Results

H0 monotonic with width (universal)

H0 persistence is perfectly monotonic with width multiplier (rho = -1.0) on all 3 datasets (CIFAR-100, CUB-200, RESISC-45). Wider networks universally produce smoother loss landscapes with fewer connected components.

Direction flip across datasets

CIFAR-100: H0 vs retention rho = 0.71 (suggestive). CUB-200: H0 vs retention rho = -0.83, p = 0.04 (opposite direction). The relationship between topology and forgetting varies by domain, even though H0 monotonicity with scale is universal.

ConfigWidth kParamsStatus
WRN-28-110.4MComplete
WRN-28-221.5MComplete
WRN-28-445.9MComplete
WRN-28-6613.2MComplete
WRN-28-8823.4MComplete
WRN-28-101036.5MComplete

7. Cross-Domain Validation

CIFAR-100

19 / 19 architectures complete

Standard object recognition. 50 classes per task. Parameter count dominates (rho = -0.76, p = 0.0002). Topology redundant on this easy benchmark.

CUB-200-2011

19 / 19 architectures complete

Fine-grained bird classification. 200 species. Topology rescues prediction (permutation p = 0.037, suggestive but does not survive Bonferroni across 3 datasets) where parameter count fails (rho = -0.27, not significant).

NWPU-RESISC45

19 / 19 architectures complete

Satellite remote sensing scenes. 45 classes. Topology does NOT help predict forgetting (perm p = 0.566 ret@100, p = 0.628 ret@10, p = 0.743 early_aurc). However, H0 strongly predicts EWC benefit (rho = 0.86, p = 2.4e-6).

8. Discussion and Next Steps

With all 57 configurations complete across 3 datasets, the picture is clear: topology is not a universal predictor of forgetting. The CUB-200 result (p = 0.037) is suggestive but does not survive Bonferroni correction across 3 datasets (adjusted alpha = 0.0167). On RESISC-45, topology provides no forgetting prediction at all (p = 0.566). Topology's predictive value for forgetting is conditional on the visual domain.

However, the strongest cross-dataset signal is not about predicting forgetting directly. H0 persistence (connected components) predicts how much a model benefits from EWC regularization, and this holds on both CIFAR-100 (rho = 0.76, p = 0.0002) and RESISC-45 (rho = 0.86, p = 2.4e-6). This reframes topology's role: it is a mitigation sensitivity marker. It tells you not just that a model might forget, but how much a specific intervention (EWC) will help.

Revised narrative:

  • Topology as conditional forgetting predictor: works on fine-grained CUB-200, not on satellite RESISC-45 or easy CIFAR-100
  • Topology as mitigation sensitivity marker: H0 predicts EWC benefit across datasets (the most robust finding)
  • WRN H0 monotonicity (rho = -1.0) and cubical/Ripser agreement (rho = 1.0) are universal across all 3 datasets

Next steps:

  1. Multi-seed runs for confidence intervals on the CUB-200 finding
  2. Scale to 30+ architectures for more statistical power (target: CUB-200 p < 0.0167 after Bonferroni)
  3. EWC benefit prediction API as the commercially viable product angle
  4. ArXiv publication and NeurIPS/ICML submission

9. Proposed Mechanism: Basin Fragmentation

H0 in persistent homology counts connected components in the sublevel set filtration of the loss landscape. A high H0 count indicates a fragmented landscape with many disconnected basins at low loss values.

We propose the basin fragmentation hypothesis: H0 measures the degree of loss landscape fragmentation, which determines how much curvature-based regularization (EWC) can help by preventing inter-basin drift during sequential training.

High H0 (fragmented landscape)

  • Many disconnected basins at low loss
  • Naive training drifts across basin boundaries
  • EWC prevents inter-basin drift via Fisher penalty
  • Large EWC benefit

Low H0 (smooth landscape)

  • One broad basin; few disconnected regions
  • Naive training perturbs within the same basin
  • EWC penalty addresses a problem that does not exist
  • Small EWC benefit

The WRN width ladder provides supporting evidence: H0 decreases perfectly with width (rho = -1.0 vs params) across all three datasets, consistent with wider networks having smoother, less fragmented landscapes. The CUB-200 null for EWC benefit (rho = 0.31, p = 0.19) may indicate that fine-grained discrimination creates forgetting through feature-level interference rather than parameter-level basin drift.

This mechanism is tentative. A causal test would require intervening on landscape topology (e.g., via landscape-aware regularization) and measuring the effect on EWC benefit.

10. Limitations and External Validity

What we claim

  • Dataset significantly moderates the H0-EWC benefit relationship (Phase 6, p = 0.046)
  • H0 partial effects on EWC benefit exclude zero on CIFAR-100 and RESISC-45 but not CUB-200
  • On CUB-200, topology provides the only predictive signal for early forgetting (ret@10 CI excludes zero)

What we do not claim

  • That topology universally predicts forgetting (RESISC-45 null)
  • That the EWC moderation finding is confirmatory (it emerged from exploratory analysis)
  • That basin fragmentation is an established causal mechanism

Scope limitations

  • 19 architectures: moderate statistical power; WRN width ladder controls for family but has limited within-ladder degrees of freedom
  • One mitigation method: only EWC tested; if H0 does not predict benefit under Synaptic Intelligence or PackNet, the finding is EWC-specific
  • 2D projections: topology computed on 2D landscape cross-sections, not the full high-dimensional landscape; 5 slices mitigate but do not eliminate sampling variance
  • Borderline p-values: EWC moderation p = 0.046, forgetting ret@100 p = 0.035; CUB-200 ret@10 p = 0.037 does not survive Bonferroni

Falsification targets

  1. Synaptic Intelligence benefit shows no H0 correlation on CIFAR-100 or RESISC-45 (mechanism is EWC-specific)
  2. Adding 10+ architectures eliminates the CUB-200 ret@10 signal (forgetting prediction claim fails)
  3. Landscape intervention (e.g., SAM) changes H0 without changing EWC benefit (causal link is broken)
  4. Cubical persistence disagrees with Ripser-based H0 on the moderation result (measurement is method-dependent)

11. Analysis Path Transparency

The original hypothesis targeted topology as a direct predictor of forgetting. Retention at step 10 was pre-specified as the primary outcome, with ret@100 and early AURC as robustness checks.

CIFAR-100 was run first and showed parameter count dominates (topology null, p = 0.295). CUB-200 was run second and showed topology rescues prediction (p = 0.037). RESISC-45 was run third and returned a null for topology (p = 0.566), falsifying the simpler “topology helps on hard tasks” framing.

The EWC benefit analysis was computed as part of Phase 4 diagnostics, not as the original target hypothesis. The shift from “topology predicts forgetting” to “topology predicts mitigation benefit” emerged from the data after the RESISC-45 null. The Phase 6 pooled interaction model was designed post hoc to formalize the cross-dataset moderation pattern.

The EWC moderation finding (p = 0.046) should be interpreted as a data-driven discovery requiring pre-registered replication, not as a confirmatory result.

12. Reproducibility and Infrastructure

Compute

  • Local GPU cluster (NVIDIA RTX, CUDA)
  • PyTorch 2.x with mixed precision
  • Training seed = 42 (deterministic)
  • Landscape seed randomized, logged per run

Topology

  • Ripser (Vietoris-Rips, sparse mode)
  • GUDHI (cubical PH validation)
  • scikit-tda ecosystem
  • 5 random slices per architecture

Tracking and Versioning

  • Version-controlled YAML configs (57 total)
  • Full dependency pinning (pyproject.toml)
  • Structured JSON output with all metrics
  • Flask dashboard for experiment management

References

  • Ballester, R. and Araujo, X. (2020). On the interplay between topological data analysis and deep learning. NeurIPS Workshop on TDA.
  • Boissonnat, J.-D. et al. (2018). Geometric and Topological Inference. Cambridge University Press.
  • Draxler, F. et al. (2018). Essentially no barriers in neural network energy landscape. ICML.
  • Kirkpatrick, J. et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
  • Li, H. et al. (2018). Visualizing the loss landscape of neural nets. NeurIPS.
  • McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks. Psychology of Learning and Motivation, 24, 109-165.
  • Otter, N. et al. (2017). A roadmap for the computation of persistent homology. EPJ Data Science, 6(1), 1-38.
  • Rieck, B. et al. (2019). Neural persistence: a complexity measure for deep neural networks using algebraic topology. ICLR.
  • Tononi, G. and Cirelli, C. (2014). Sleep and the price of plasticity. Neuron, 81(1), 12-34.