Research Programs EXP-01 PERSISTFindings

RESEARCH OUTPUTEXP-01 · PERSIST · Phase 0Superseded by Phase I-A & I-B

Topological Signatures of Knowledge Persistence in Continual Learning Systems

Axion Labs · February 2026 · Phase 0 Proof-of-Concept (Small-Scale)

Status update · April 2026

The findings on this page describe the original Phase 0 proof-of-concept across 19 architectures and 3 small-image datasets. Phase I-A (ImageNet-100 scale validation, 8 architectures up to ViT-L/16) and Phase I-B (cross-dataset forgetting sweep, 114 configurations across 6 ordered dataset pairs) are both complete on the NMSU Discovery HPC cluster. Headline result at scale: H1 dominant with rho = 0.93, p = 0.0007 on ImageNet-100. NSF SBIR Phase I application targeting medical imaging under FDA PCCP is in active development.

See current PERSIST overview

Abstract

Preliminary proof-of-concept: We investigate whether the topological structure of neural network loss landscapes predicts resistance to catastrophic forgetting. Across 19 small-to-medium architectures (0.3M-44.7M parameters) and 3 small-image datasets (CIFAR-100, CUB-200-2011, NWPU-RESISC45), we compute persistent homology on 50x50 loss landscape grids using 5 independent random 2D slices. The most stable signal: H0 persistence predicts EWC mitigation benefit (CIFAR-100 rho = 0.76, RESISC-45 rho = 0.86). These results are preliminary, established on models well below production scale. The critical open question for Phase I is whether the topological signal survives on 100M-7B+ parameter models, long task sequences, and diverse continual learning methods, which requires supercomputer resources and potentially novel distributed persistent homology algorithms.

Results at a Glance

CUB-200 Key Result

p = 0.037

Suggestive (does not survive Bonferroni)

Params Alone (CUB)

rho = -0.92

Wrong direction without topology

+Topology (CUB)

rho = 0.34

Prediction rescued

MAE Reduction

17.5%

0.186 to 0.154 with topology

Configs Complete

57 / 57

19 archs x 3 datasets done

RESISC-45 Topology

p = 0.566

Topology does not help on satellite

Params vs ret (CIFAR)

rho = -0.76

p = 0.0002, survives Bonferroni

Topo on CIFAR-100

Not sig.

Redundant on easy tasks

EWC Benefit (RESISC)

rho = 0.86

H0 predicts EWC benefit, p = 2.4e-6

EWC Benefit (CIFAR)

rho = 0.76

H0 predicts EWC benefit, p = 0.0002

WRN H0 Monotonicity

rho = -1.0

Perfect on all 3 datasets

Cubical vs Ripser

rho = 1.0

H1 agreement on all 3 datasets

1. Background and Motivation

Catastrophic forgetting, the tendency of neural networks to lose previously learned knowledge when trained on new tasks, remains one of the most fundamental unsolved problems in machine learning (McCloskey and Cohen, 1989). Every major mitigation strategy (replay buffers, elastic weight consolidation, progressive networks) manages the symptom rather than addressing the underlying geometric cause.

Topological Data Analysis (TDA) has emerged as a tool for characterizing loss landscape geometry (Ballester and Araujo, 2020). Persistent homology extracts scale-invariant topological features that survive across multiple scales of filtration: H0 (connected components) and H1 (loops/tunnels).

This experiment tests whether H1 persistence (topological loop structure) in the loss landscape predicts catastrophic forgetting resistance, and whether this signal is independent of model scale.

2. Methodology

Experimental Pipeline

Datasets (3 Domains)

CIFAR-100 (19/19 architectures complete): Split into Task A (classes 0-49) and Task B (classes 50-99). Standard augmentation.
CUB-200-2011 (19/19 complete): Fine-grained bird classification. 200 species, cross-domain validation.
NWPU-RESISC45 (19/19 architectures complete): Satellite remote sensing. 45 scene classes, cross-domain validation.

Training Protocol

SGD with momentum (0.9), weight decay 5x10^-4, cosine annealing with warmup (5-10 epochs), batch size 128
Task A: 100 epochs to convergence
Phase 3 variants: naive sequential, EWC (lambda=400), cosine LR schedule
Retention metrics: ret@100 and ret@10 (accuracy at 100 and 10 steps of Task B training)

Loss Landscape Sampling

50x50 grid (2,500 evaluation points) along 2 filter-normalized random directions (Li et al., 2018)
Range: [-1.0, 1.0]
5 independent random 2D slices per architecture (landscape seed randomized but logged)
Sublevel set filtration with lower-star construction

Persistent Homology

Primary: Ripser (Vietoris-Rips, sparse mode)
Validation: GUDHI cubical persistent homology (Phase 2c)
Dimensions: H0 (connected components), H1 (loops)
Primary metric: H1 total persistence = sum of (death_i - birth_i) for all H1 features

19 Architectures Under Study

Original Architectures (14)

ResNet-18, ResNet-50, ResNet-18 Wide

WRN-28-10, DenseNet-121

MobileNet-V3-Small, ShuffleNet-V2

EfficientNet-B0, RegNet-Y-400MF

ViT-Tiny, ViT-Small

MLP-Mixer, ConvNeXt-Tiny, VGG-16-BN

WRN Width Ladder (5 additional)

WRN-28-1, WRN-28-2, WRN-28-4, WRN-28-6, WRN-28-8

Same architecture, varying only width multiplier k. Isolates parameter count from architectural inductive bias. All complete on CIFAR-100, CUB-200, and RESISC-45.

3. Statistical Framework

Primary Analysis

Spearman rank correlation (non-parametric)
Bonferroni correction across hypothesis tests
Permutation test: 1,000 shuffles for empirical p-values
Leave-one-architecture-out Ridge regression with nested alpha selection
Matched-dimensionality null control (1,000 random feature draws)

Confound Controls

Partial Spearman correlation (H1 | parameter count)
Cross-dataset replication (CIFAR-100, CUB-200, RESISC-45)
Within-family analysis (CNN-only) to control architecture type
WRN width ladder: same architecture, varying only params

4. Results

Cross-Dataset Predictive Model (Phase 5)

Leave-one-architecture-out Ridge regression with permutation testing. Compares params-only vs. params+topology models across datasets.

Dataset	Outcome	Params-only rho	+Topology rho	Perm. p	Verdict
CIFAR-100 (n=19)	ret@100	0.43	0.30	0.295	Not significant
CUB-200 (n=19)	ret@10	-0.92	0.34	0.037	Suggestive
RESISC-45 (n=19)	ret@100	--	--	0.566	Not significant
RESISC-45 (n=19)	ret@10	--	--	0.628	Not significant
RESISC-45 (n=19)	early_aurc	--	--	0.743	Not significant

On CIFAR-100, parameter count alone explains forgetting and topology adds nothing. On CUB-200, parameter count predicts in the wrong direction and topology rescues the prediction (suggestive at p = 0.037 but does not survive Bonferroni across 3 datasets, adjusted alpha = 0.0167). On RESISC-45, topology does not help predict forgetting on any metric.

CIFAR-100 Results (n=19, Easy Benchmark)

All 19 architectures sorted by ret@100. On this easy benchmark, bigger models simply retain better.

Architecture	Params	Task A Acc.	ret@100	ret@10	H1 Pers.	Type
ViT-Tiny	0.3M	52.7%	22.5%	95.9%	0.01	Transformer
ShuffleNet-V2	1.3M	76.8%	17.3%	84.7%	0.79	CNN
ViT-Small	2.2M	62.2%	9.6%	94.7%	0.24	Transformer
MobileNet-V3-S	1.1M	68.6%	7.6%	75.0%	1.89	CNN
EfficientNet-B0	4.1M	76.6%	7.1%	78.6%	1.91	CNN
WRN-28-1	0.4M	71.7%	6.6%	51.0%	0.00	WRN-ladder
RegNet-Y-400MF	4.0M	72.2%	2.0%	54.1%	0.05	CNN
WRN-28-2	1.5M	78.6%	1.1%	22.8%	0.00	WRN-ladder
VGG-16-BN	14.8M	78.4%	0.8%	88.0%	0.00	CNN
WRN-28-8	23.4M	82.9%	0.7%	4.4%	0.01	WRN-ladder
WRN-28-4	5.9M	81.8%	0.3%	8.5%	0.02	WRN-ladder
WRN-28-10	36.5M	84.0%	0.3%	5.3%	0.07	WRN-ladder
ResNet-18	11.2M	82.0%	0.2%	46.7%	0.00	CNN
WRN-28-6	13.2M	82.8%	0.1%	4.5%	0.02	WRN-ladder
ResNet-50	23.7M	83.6%	0.1%	56.0%	0.00	CNN
DenseNet-121	7.1M	84.5%	0.05%	25.7%	0.01	CNN
MLP-Mixer	2.3M	61.5%	0.03%	0.03%	0.12	MLP
ConvNeXt-Tiny	27.9M	56.7%	0.0%	45.0%	0.00	CNN
ResNet-18 Wide	44.7M	83.1%	0.0%	29.7%	0.00	CNN

CIFAR-100 Phase 4 Correlation Analysis

Parameter count vs ret@100: rho = -0.76, p = 0.0002, survives Bonferroni
H1 persistence vs ret@100: rho = 0.47, p = 0.042, does NOT survive Bonferroni
Partial H1 | params: rho = 0.33, p = 0.19, not significant
Conclusion: On this easy task, parameter count dominates

CUB-200 Results (n=19, Hard Fine-Grained)

Top architectures by retention on CUB-200-2011 (200 bird species). Parameter count fails as a predictor on this hard benchmark.

Architecture	Params	ret@100	Type
ViT-Tiny	0.3M	31.1%	Transformer
ViT-Small	2.2M	23.4%	Transformer
WRN-28-10	36.5M	8.1%	WRN-ladder
WRN-28-8	23.4M	5.0%	WRN-ladder
EfficientNet-B0	4.1M	3.5%	CNN
ShuffleNet-V2	1.3M	2.8%	CNN
WRN-28-6	13.2M	2.4%	WRN-ladder
DenseNet-121	7.1M	1.9%	CNN

CUB-200 Phase 4 Correlations

Parameter count vs ret@100: rho = -0.27, p = 0.27 (NOT significant)
Parameter count fails on hard tasks

CUB-200 Phase 5 (ret@10 Detail)

Params alone: rho = -0.92 (wrong direction)
Params + topology: rho = 0.34 (rescued)
Topology alone: rho = 0.33, MAE = 0.147
Permutation test: p = 0.037
Matched-dimensionality control: exceeds 95th percentile
MAE reduction: 17.5%

Phase 6: Pooled Interaction Analysis (n=57)

Formal test of dataset moderation via OLS with clustered bootstrap. All 57 configurations pooled across 3 datasets with dataset x topology interaction terms.

EWC Benefit Moderation Test

Block Permutation p

p = 0.046

Dataset moderates H0 -> EWC benefit

CIFAR-100 H0 Effect

+0.016

CI [+0.005, +0.062] excludes zero

CUB-200 H0 Effect

+0.002

CI [-0.008, +0.013] includes zero

RESISC-45 H0 Effect

+0.007

CI [+0.004, +0.012] excludes zero

H0 partial effects on EWC benefit per dataset. CIs from clustered bootstrap. CIFAR-100 and RESISC-45 confidence intervals exclude zero, confirming the per-dataset correlations. CUB-200 CI includes zero.

Forgetting Prediction Moderation Test

ret@10 Block p

p = 0.196

Not significant overall

ret@100 Block p

p = 0.035

Significant moderation

CUB-200 ret@10 Effect

-0.123

CI [-0.183, -0.046] excludes zero

CIFAR-100 H0 on ret@10: -0.001, CI [-0.486, +0.073] (includes zero)
CUB-200 H0 on ret@10: -0.123, CI [-0.183, -0.046] (excludes zero)
RESISC-45 H0 on ret@10: -0.021, CI [-0.264, +0.083] (includes zero)

Bottom line: Dataset significantly moderates the topology-EWC benefit relationship (permutation p = 0.046), with H0 predicting EWC benefit on CIFAR-100 and RESISC-45 (CIs excluding zero) but not CUB-200. For forgetting prediction, the ret@100 block test is significant (p = 0.035) and CUB-200 is the only dataset where H0 CI on ret@10 excludes zero, consistent with Phase 5 findings that topology rescues prediction specifically on fine-grained tasks.

5. Key Findings

Finding 1: Topology is a conditional predictor, not a universal one

On CUB-200 (fine-grained birds), topology rescues forgetting prediction where parameter count fails (params rho = -0.92 wrong direction; +topology rho = 0.34; permutation p = 0.037). However, this p-value does not survive Bonferroni correction across 3 datasets (adjusted alpha = 0.0167), making the result suggestive rather than confirmed. On RESISC-45 (satellite scenes), topology does not help at all (perm p = 0.566). Topology appears to matter on fine-grained visual tasks but not on satellite imagery.

Finding 2: On easy tasks, parameter count is all you need

On CIFAR-100 (n=19), parameter count shows rho = -0.76, p = 0.0002, and survives Bonferroni correction. Topology adds nothing beyond what scale already explains. Bigger models simply retain better on easy benchmarks.

Finding 3: Task domain, not just difficulty, determines whether topology predicts forgetting

The picture is more nuanced than "topology helps on hard tasks." CIFAR-100 (easy): scale dominates. CUB-200 (fine-grained): topology is suggestive. RESISC-45 (satellite): topology does not help despite being a non-trivial task. The domain itself matters. Fine-grained visual discrimination may create loss landscape structures that topological features can capture, while satellite scene classification does not.

Finding 4: H0 predicts EWC benefit across datasets (strongest cross-dataset signal)

The most robust finding across all 3 datasets: H0 persistence (connected components) predicts how much a model benefits from Elastic Weight Consolidation. On CIFAR-100: rho = 0.76, p = 0.0002. On RESISC-45: rho = 0.86, p = 2.4e-6. The Phase 6 pooled interaction analysis formally confirms that dataset moderates this relationship (block permutation p = 0.046), with per-dataset H0 partial effects excluding zero on CIFAR-100 (CI [+0.005, +0.062]) and RESISC-45 (CI [+0.004, +0.012]) but not CUB-200 (CI [-0.008, +0.013]). Models with more fragmented loss landscapes (higher H0) benefit more from EWC regularization. This makes topology a mitigation sensitivity marker, telling you not just whether a model will forget, but how much a specific intervention will help.

Finding 5: WRN width ladder confirms universal H0 monotonicity

The WRN-28-k ladder (k=1,2,4,6,8,10) shows H0 perfectly monotonic with width on all 3 datasets (rho = -1.0 on CIFAR-100, CUB-200, and RESISC-45). Wider networks universally produce smoother loss landscapes. Cubical vs Ripser H1 agreement is also perfect (rho = 1.0) on all 3 datasets, confirming methodological robustness.

6. WRN Width Ladder

Complete

Design: WRN-28-k, k = 1, 2, 4, 6, 8, 10

The WRN width ladder holds architecture constant (WideResNet-28 with identical depth, skip connections, and training protocol) while varying only the width multiplier k. This scales parameter count from roughly 0.4M (k=1) to 36.5M (k=10) within a single architecture family, isolating scale from inductive bias.

Key Results

H0 monotonic with width (universal)

H0 persistence is perfectly monotonic with width multiplier (rho = -1.0) on all 3 datasets (CIFAR-100, CUB-200, RESISC-45). Wider networks universally produce smoother loss landscapes with fewer connected components.

Direction flip across datasets

CIFAR-100: H0 vs retention rho = 0.71 (suggestive). CUB-200: H0 vs retention rho = -0.83, p = 0.04 (opposite direction). The relationship between topology and forgetting varies by domain, even though H0 monotonicity with scale is universal.

Config	Width k	Params	Status
WRN-28-1	1	0.4M	Complete
WRN-28-2	2	1.5M	Complete
WRN-28-4	4	5.9M	Complete
WRN-28-6	6	13.2M	Complete
WRN-28-8	8	23.4M	Complete
WRN-28-10	10	36.5M	Complete

7. Cross-Domain Validation

CIFAR-100

19 / 19 architectures complete

Standard object recognition. 50 classes per task. Parameter count dominates (rho = -0.76, p = 0.0002). Topology redundant on this easy benchmark.

CUB-200-2011

19 / 19 architectures complete

Fine-grained bird classification. 200 species. Topology rescues prediction (permutation p = 0.037, suggestive but does not survive Bonferroni across 3 datasets) where parameter count fails (rho = -0.27, not significant).

NWPU-RESISC45

19 / 19 architectures complete

Satellite remote sensing scenes. 45 classes. Topology does NOT help predict forgetting (perm p = 0.566 ret@100, p = 0.628 ret@10, p = 0.743 early_aurc). However, H0 strongly predicts EWC benefit (rho = 0.86, p = 2.4e-6).

8. Discussion and Next Steps

With all 57 configurations complete across 3 datasets, the picture is clear: topology is not a universal predictor of forgetting. The CUB-200 result (p = 0.037) is suggestive but does not survive Bonferroni correction across 3 datasets (adjusted alpha = 0.0167). On RESISC-45, topology provides no forgetting prediction at all (p = 0.566). Topology's predictive value for forgetting is conditional on the visual domain.

However, the strongest cross-dataset signal is not about predicting forgetting directly. H0 persistence (connected components) predicts how much a model benefits from EWC regularization, and this holds on both CIFAR-100 (rho = 0.76, p = 0.0002) and RESISC-45 (rho = 0.86, p = 2.4e-6). This reframes topology's role: it is a mitigation sensitivity marker. It tells you not just that a model might forget, but how much a specific intervention (EWC) will help.

Revised narrative:

Topology as conditional forgetting predictor: works on fine-grained CUB-200, not on satellite RESISC-45 or easy CIFAR-100
Topology as mitigation sensitivity marker: H0 predicts EWC benefit across datasets (the most robust finding)
WRN H0 monotonicity (rho = -1.0) and cubical/Ripser agreement (rho = 1.0) are universal across all 3 datasets

Next steps:

Multi-seed runs for confidence intervals on the CUB-200 finding
Scale to 30+ architectures for more statistical power (target: CUB-200 p < 0.0167 after Bonferroni)
EWC benefit prediction API as the commercially viable product angle
ArXiv publication and NeurIPS/ICML submission

9. Proposed Mechanism: Basin Fragmentation

H0 in persistent homology counts connected components in the sublevel set filtration of the loss landscape. A high H0 count indicates a fragmented landscape with many disconnected basins at low loss values.

We propose the basin fragmentation hypothesis: H0 measures the degree of loss landscape fragmentation, which determines how much curvature-based regularization (EWC) can help by preventing inter-basin drift during sequential training.

High H0 (fragmented landscape)

Many disconnected basins at low loss
Naive training drifts across basin boundaries
EWC prevents inter-basin drift via Fisher penalty
Large EWC benefit

Low H0 (smooth landscape)

One broad basin; few disconnected regions
Naive training perturbs within the same basin
EWC penalty addresses a problem that does not exist
Small EWC benefit

The WRN width ladder provides supporting evidence: H0 decreases perfectly with width (rho = -1.0 vs params) across all three datasets, consistent with wider networks having smoother, less fragmented landscapes. The CUB-200 null for EWC benefit (rho = 0.31, p = 0.19) may indicate that fine-grained discrimination creates forgetting through feature-level interference rather than parameter-level basin drift.

This mechanism is tentative. A causal test would require intervening on landscape topology (e.g., via landscape-aware regularization) and measuring the effect on EWC benefit.

10. Limitations and External Validity

What we claim

Dataset significantly moderates the H0-EWC benefit relationship (Phase 6, p = 0.046)
H0 partial effects on EWC benefit exclude zero on CIFAR-100 and RESISC-45 but not CUB-200
On CUB-200, topology provides the only predictive signal for early forgetting (ret@10 CI excludes zero)

What we do not claim

That topology universally predicts forgetting (RESISC-45 null)
That the EWC moderation finding is confirmatory (it emerged from exploratory analysis)
That basin fragmentation is an established causal mechanism

Scope limitations

19 architectures: moderate statistical power; WRN width ladder controls for family but has limited within-ladder degrees of freedom
One mitigation method: only EWC tested; if H0 does not predict benefit under Synaptic Intelligence or PackNet, the finding is EWC-specific
2D projections: topology computed on 2D landscape cross-sections, not the full high-dimensional landscape; 5 slices mitigate but do not eliminate sampling variance
Borderline p-values: EWC moderation p = 0.046, forgetting ret@100 p = 0.035; CUB-200 ret@10 p = 0.037 does not survive Bonferroni

Falsification targets

Synaptic Intelligence benefit shows no H0 correlation on CIFAR-100 or RESISC-45 (mechanism is EWC-specific)
Adding 10+ architectures eliminates the CUB-200 ret@10 signal (forgetting prediction claim fails)
Landscape intervention (e.g., SAM) changes H0 without changing EWC benefit (causal link is broken)
Cubical persistence disagrees with Ripser-based H0 on the moderation result (measurement is method-dependent)

11. Analysis Path Transparency

The original hypothesis targeted topology as a direct predictor of forgetting. Retention at step 10 was pre-specified as the primary outcome, with ret@100 and early AURC as robustness checks.

CIFAR-100 was run first and showed parameter count dominates (topology null, p = 0.295). CUB-200 was run second and showed topology rescues prediction (p = 0.037). RESISC-45 was run third and returned a null for topology (p = 0.566), falsifying the simpler “topology helps on hard tasks” framing.

The EWC benefit analysis was computed as part of Phase 4 diagnostics, not as the original target hypothesis. The shift from “topology predicts forgetting” to “topology predicts mitigation benefit” emerged from the data after the RESISC-45 null. The Phase 6 pooled interaction model was designed post hoc to formalize the cross-dataset moderation pattern.

The EWC moderation finding (p = 0.046) should be interpreted as a data-driven discovery requiring pre-registered replication, not as a confirmatory result.

12. Reproducibility and Infrastructure

Compute

Local GPU cluster (NVIDIA RTX, CUDA)
PyTorch 2.x with mixed precision
Training seed = 42 (deterministic)
Landscape seed randomized, logged per run

Topology

Ripser (Vietoris-Rips, sparse mode)
GUDHI (cubical PH validation)
scikit-tda ecosystem
5 random slices per architecture

Tracking and Versioning

Version-controlled YAML configs (57 total)
Full dependency pinning (pyproject.toml)
Structured JSON output with all metrics
Flask dashboard for experiment management

References

Ballester, R. and Araujo, X. (2020). On the interplay between topological data analysis and deep learning. NeurIPS Workshop on TDA.
Boissonnat, J.-D. et al. (2018). Geometric and Topological Inference. Cambridge University Press.
Draxler, F. et al. (2018). Essentially no barriers in neural network energy landscape. ICML.
Kirkpatrick, J. et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
Li, H. et al. (2018). Visualizing the loss landscape of neural nets. NeurIPS.
McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks. Psychology of Learning and Motivation, 24, 109-165.
Otter, N. et al. (2017). A roadmap for the computation of persistent homology. EPJ Data Science, 6(1), 1-38.
Rieck, B. et al. (2019). Neural persistence: a complexity measure for deep neural networks using algebraic topology. ICLR.
Tononi, G. and Cirelli, C. (2014). Sleep and the price of plasticity. Neuron, 81(1), 12-34.

Full Experimental Protocols All Research Programs Research Collaboration Inquiry