EXPERIMENTAL PROTOCOLS

Priority Experiments

Detailed scope, methodology, and results for three priority experiments. EXP-01 (PERSIST) has completed preliminary proof-of-concept across 57 small-scale configurations. Phase I scale validation to production-size models (100M-7B+ parameters) is planned, requiring supercomputer resources.

EXP-01 · PERSIST EXP-02 · PHI EXP-03 · GENESIS

EXP-01 · Program II (PERSIST)

Topological Signatures of Knowledge Persistence in Continual Learning Systems

Investigating whether the topological structure of a neural network's loss landscape predicts its resistance to catastrophic forgetting during sequential task training.

StatusPhase I-A scale validation complete on ImageNet-100 (2026-04-02). Phase I-B cross-dataset forgetting sweep complete (114/114 configurations, 2026-04-24). Writeup in preparation.

Phase 0 (Preliminary)19/19 archs (0.3M-44.7M params) on CIFAR-100, CUB-200, and RESISC-45 (57/57 configs, all phases complete).

Phase I-AImageNet-100 scale validation: 8 architectures including ResNet-101, ConvNeXt-S/B/L, EfficientNet-B5, DenseNet-201, ViT-B/16, ViT-L/16. Topological signal replicates and strengthens at scale.

Phase I-BCross-dataset forgetting sweep: 114 configurations covering 6 ordered dataset pairs across 19 architectures. Mixed-effects analysis identifies where topology is load-bearing.

ProgramPERSIST (Continual Learning)

Priority1 of 3

Scope19 architectures (14 diverse + 6-point WRN width ladder) across 3 datasets (57 base configs) plus 8-arch ImageNet-100 scale validation set.

ComputeNMSU Discovery HPC cluster, NVIDIA A100-PCIE-40GB, no-cost institutional access. Local NVIDIA RTX 4090 for development.

NoveltyFirst connection of persistent homology to catastrophic forgetting prediction, with cross-dataset replication.

Headline ResultAt ImageNet-100 scale, H1 dominant with rho = 0.93, p = 0.0007 (n=8). 3-dataset Phase 6 pooled interaction replicates. NSF SBIR Phase I application in active development for medical imaging under FDA PCCP.

●Preliminary Results (19 Architectures, 3 Datasets, Small-Scale Proof-of-Concept)

Preliminary proof-of-concept (all models under 45M params): Cross-dataset analysis (57/57 configs, 3 datasets) reveals topology is a conditional predictor at small scale. On CIFAR-100 (easy), params dominate (\u03c1 = -0.76). On CUB-200, topology rescues prediction (perm. p = 0.037, suggestive). On RESISC-45, topology does not help (p = 0.566). Most stable signal: H0 predicts EWC benefit (CIFAR-100 \u03c1 = 0.76, RESISC-45 \u03c1 = 0.86). Whether these patterns survive at production scale (100M-7B+ params) is the critical open research question for Phase I, requiring supercomputer resources.

Architecture	Params	Acc.	H₁	Ret@100	Type
ViT-Tiny	0.3M	52.7%	0.01	22.5%	Transformer
ShuffleNet-V2	1.3M	76.8%	0.79	17.3%	CNN
ViT-Small	2.2M	62.2%	0.24	9.6%	Transformer
EfficientNet-B0	4.1M	76.6%	1.91	7.1%	CNN
WRN-28-10	36.5M	84.0%	0.07	0.3%	WRN-ladder
ResNet-18	11.2M	82.0%	0.00	0.2%	CNN
ResNet-18 Wide	44.7M	83.1%	0.00	0.0%	CNN

Key finding: Topology's value depends on task difficulty. On CIFAR-100, params dominate (\u03c1 = -0.76, survives Bonferroni) and topology is redundant. On CUB-200 (hard, fine-grained), topology rescues prediction (perm. p = 0.037, suggestive). On RESISC-45, topology does not help. H0 predicts EWC benefit across datasets.

CUB-200 detail: Params-only \u03c1 = -0.92 (wrong direction). Adding topology: \u03c1 = 0.34, 17.5% MAE reduction. WRN width ladder complete, confirming H0 monotonicity and direction flip between easy and hard tasks. All 3 datasets complete.

HPrincipal Hypothesis

The persistence of learned knowledge under sequential task training is predictable from the topological features of the loss landscape around learned weight configurations. Tasks that induce deeper topological features — longer-lived persistent homology classes — in the loss landscape are more resistant to catastrophic forgetting during subsequent training.

Background & Gap

Catastrophic forgetting (McCloskey & Cohen, 1989) remains one of the most fundamental unsolved problems in machine learning. Every major mitigation strategy — replay buffers, elastic weight consolidation, progressive networks — manages the symptom rather than addressing the underlying geometric cause.

Separately, Topological Data Analysis (TDA) has emerged as a powerful tool for characterizing loss landscape geometry (Ballester & Araujo, 2020). Persistent homology extracts scale-invariant features — connected components, loops, voids — that survive across multiple scales of analysis.

No published work has connected these two fields. This experiment tests whether the topological depth of learned representations predicts their survivability during continual learning.

Methodology

Phase 1, Train Task A

Train 19 architectures (14 diverse + 6-point WRN-28-k width ladder) to convergence on Task A across 3 datasets: CIFAR-100, CUB-200-2011, and NWPU-RESISC45. 100 epochs, SGD with cosine annealing. Save best checkpoints.

Phase 2, Landscape Topology (Ripser + Cubical)

Sample 50x50 loss landscape along filter-normalized random directions (Li et al., 2018). 5 independent random 2D slices per architecture. Compute persistent homology via Ripser (graph-based H₀, H₁) and GUDHI cubical complexes (validation baseline). Compute baseline metrics: Hessian trace, Fisher information, max eigenvalue, loss barrier.

Phase 3, Sequential Forgetting (Naive + EWC + Cosine)

Train sequentially on Task B with 3 variants: naive, EWC regularization (Fisher-based penalty), and cosine LR decay. Measure Task A accuracy at steps [10, 25, 50, 100, 250, 500, 1000, 5000]. Compute early AURC, ret@10, ret@100.

Phase 4, Correlation & Diagnostics

Spearman + Kendall correlation with Bonferroni correction (12 metrics). Partial correlations controlling for parameter count. Slice robustness diagnostics: Kruskal-Wallis, per-slice Spearman, pairwise ordering probability, Cohen's d. Cubical vs Ripser agreement. EWC benefit analysis. WRN width ladder: within-ladder correlations isolating scale from topology.

Phase 5, Predictive Model (LOAO CV)

Leave-one-architecture-out Ridge regression with nested alpha selection. 5 models: A (params only), A2 (params + random noise, matched-dimensionality null), B (params + Ripser topology), C (params + cubical topology), D (topology alone). 1,000-permutation test shuffling topology features to test incremental value. If Model B does not beat A2, topology features are no better than noise.

PERSIST Primary Metrics

H₀, H₁ persistent homology via Ripser (graph-based) and GUDHI (cubical complexes)
Total persistence: Σ(death_i - birth_i) across 5 independent landscape slices
Task A retention metrics: early AURC (0-500), ret@10, ret@100
LOAO cross-validated prediction error (MAE) for 5 regression models
Permutation test p-value for incremental topology value (1,000 permutations)

PERSIST Secondary Metrics

Spearman ρ + Kendall τ with Bonferroni correction (12 metrics)
Partial correlations controlling for parameter count
WRN width ladder: within-ladder Spearman + partial H1|params
Slice robustness: Kruskal-Wallis, pairwise ordering, Cohen's d
EWC benefit vs topology correlation (does topology predict regularization response?)

PERSIST Tools & Infrastructure

Training

PyTorch 2.x, CUDA (RTX 4090), Flask dashboard with live monitoring

Topology

Ripser (graph-based PH), GUDHI (cubical complexes), scikit-learn (Ridge, StandardScaler)

Datasets

CIFAR-100, CUB-200-2011 (fine-grained birds), NWPU-RESISC45 (satellite scenes)

PERSIST Expected Outputs

Correlation analysis: Ripser + cubical PH vs retention across 19 architectures and 3 datasets
WRN width ladder verdict: does topology carry independent signal beyond model scale?
LOAO predictive model: does topology improve prediction of forgetting over params alone?
Cross-domain validation: do topological signatures generalize from natural images to fine-grained and satellite data?
Publication target: NeurIPS / ICML (Continual Learning track)

PERSIST Risks & Mitigations

Risk: No correlation found between topology and forgetting

Negative result is still publishable ("topological features are insufficient to predict forgetting"). Pivot to information-geometric approaches (Fisher information metric).

Risk: Persistent homology computation intractable for large networks

Use subspace sampling (random 2D slices through weight space). Compute topology on activations rather than weights if needed. Ripser++ scales to millions of simplices.

Risk: Topological regularizer destabilizes Task B training

Anneal λ during training. Use soft constraint (penalty) rather than hard projection onto topological manifold.

PERSIST References

McCloskey, M. & Cohen, N. J. (1989). Catastrophic interference in connectionist networks. Psychology of Learning and Motivation, 24, 109-165.
Kirkpatrick, J. et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
Ballester, R. & Araujo, X. (2020). On the interplay between topological data analysis and deep learning. NeurIPS Workshop on TDA.
Li, H. et al. (2018). Visualizing the loss landscape of neural nets. NeurIPS.
Otter, N. et al. (2017). A roadmap for the computation of persistent homology. EPJ Data Science, 6(1), 1-38.
Tononi, G. & Cirelli, C. (2014). Sleep and the price of plasticity. Neuron, 81(1), 12-34.
Kumaran, D. et al. (2016). What learning systems do intelligent agents need? Trends in Cognitive Sciences, 20(7), 512-534.

EXP-02 · Program III (PHI)

Systematic Survey of Integrated Information in Modern Neural Network Architectures

The first comprehensive measurement of integrated information (Φ) across the major families of deep learning architectures, testing whether Φ correlates with generalization, transferability, and robustness.

StatusProposed

ProgramPHI (Consciousness Measurement)

Priority2 of 3

Duration4 - 5 months

ComputeGPU cluster + significant CPU for MI estimation

NoveltyFirst systematic Phi survey across modern deep learning

HPrincipal Hypothesis

Integrated information (Φ), as formalized by Integrated Information Theory, varies systematically across neural network architectures and correlates with the network's capacity for generalization and transfer learning. Networks with higher Φ process information in a more integrated manner, producing richer internal representations that resist overfitting and transfer more effectively to novel domains.

Background & Gap

Integrated Information Theory (Tononi, 2004; Tononi et al., 2016) proposes Φ as a scalar measure of how much a system is “more than the sum of its parts” — quantifying the degree to which information is integrated across a system rather than reducible to independent modules.

Φ has been computed for very small systems (logic gates, simple recurrent networks of <20 nodes) but never systematically for modern deep learning architectures. This is partly because exact Φ computation is NP-hard (requires finding the minimum information partition), but tractable approximations exist: Φ* (Oizumi et al., 2014), geometric integrated information (Barrett & Seth, 2011), and stochastic interaction (Ay, 2015).

The gap is profound: we have a rigorous mathematical framework for measuring information integration, and an entire field (deep learning) built on architectures that integrate information at massive scale — yet nobody has connected them.

Methodology

Phase 1, Phi* Implementation for Neural Networks

Implement Φ* computation adapted for neural networks. Partition each network into functional modules: individual layers, attention heads (transformers), feature map groups (CNNs), temporal steps (RNNs). For each partition scheme, compute mutual information between all module pairs using the KSG estimator (Kraskov et al., 2004) on activation vectors from a held-out probe dataset. Find the Minimum Information Partition (MIP) via greedy bipartition search. Φ* = total mutual information minus information across the MIP.

Phase 2, Architecture Survey

Compute Φ* at 5 training checkpoints (random initialization, 25%, 50%, 75%, full convergence) for: Feedforward MLPs (2, 4, 8 layers), Convolutional (ResNet-18, ResNet-50), Recurrent (LSTM, GRU, 2 and 4 layers), Transformer (GPT-2-small, ViT-Small), Graph (GCN, GAT on Cora/CiteSeer). All trained on comparable tasks (CIFAR-10/100 for vision, WikiText for language, Cora for graph). Record Φ* trajectory during training.

Phase 3, Correlation with Generalization

For each architecture at convergence, measure: test accuracy (generalization gap = train - test), transfer learning performance (fine-tune on CIFAR-100 after CIFAR-10 pretraining, or SST-2 after WikiText), adversarial robustness (PGD attack, ε = 8/255 for vision). Compute Spearman correlation between Φ* and each performance metric across all architectures.

Phase 4, Perturbational Complexity Index

Independently validate Φ* results using PCI (Casali et al., 2013), adapted from neuroscience. For each trained network: inject calibrated Gaussian noise at a single layer, record the propagation pattern across all subsequent layers, compute Lempel-Ziv complexity of the binarized activation response. Compare PCI ranking with Φ* ranking across architectures. Agreement between two independent measures strengthens the result.

Phase 5, Phi Dynamics During Training

Analyze the Φ* trajectory. Key questions: Does Φ* increase monotonically during training, or does it peak and decline (overfitting as integration collapse)? Does Φ* correlate with the information bottleneck phase transitions identified by Shwartz-Ziv & Tishby (2017)? Is there a critical Φ* threshold below which transfer learning fails?

PHI Primary Metrics

Φ* (minimum information partition) at each training checkpoint
PCI (Lempel-Ziv complexity of perturbation response)
Generalization gap (train accuracy - test accuracy)
Transfer learning Δ accuracy (target - baseline)
Spearman ρ(Φ*, generalization) and ρ(Φ*, transfer)

PHI Secondary Metrics

Φ* trajectory shape classification (monotonic, peaked, oscillating)
PCI-Φ* rank correlation (do independent measures agree?)
Layer-wise Φ contribution (which layers integrate most?)
Adversarial robustness (PGD success rate)
Partition sensitivity analysis (how much does module definition matter?)

Known Challenges

Scalability. Exact Φ is NP-hard. Φ* with greedy bipartition is O(n²) in the number of modules. For a 12-layer transformer with 12 attention heads, this is 144 modules — feasible with greedy search but requires careful implementation. Networks with >1000 effective modules require subsampling.

Partition dependence. Φ* values depend on how the network is partitioned into modules. We address this by testing multiple partition schemes (by layer, by head, by feature group) and reporting the range. If rankings are consistent across schemes, the result is robust.

Mutual information estimation. High-dimensional MI estimation is noisy. We use the KSG estimator (k=5 neighbors) with dimensionality reduction (PCA to 64 dimensions per module) on activation vectors from 10,000 probe inputs. Bootstrap confidence intervals on all MI estimates.

PHI Tools & Infrastructure

Computation

PyTorch, NumPy, SciPy, Weights & Biases, CUDA-capable GPU cluster

Information Theory

KSG estimator (custom), MINE (Belghazi et al., 2018) for validation, pyphi (adapted)

Models

torchvision (ResNet), HuggingFace (GPT-2, ViT), PyG (GCN, GAT)

PHI Expected Outputs

First Φ* landscape map across modern deep learning architectures, the 'periodic table' of information integration
Φ trajectory analysis: how integration evolves during training (potential connection to information bottleneck theory)
PCI-Φ* cross-validation: do two independent measures of integration agree in artificial systems?
If correlation holds: Φ* as a practical architecture selection and early-stopping metric
Publication target: Nature Machine Intelligence, ICLR, or Neuroscience of Consciousness (cross-disciplinary)

PHI Risks & Mitigations

Risk: Φ* approximation too noisy to produce meaningful rankings

Use multiple MI estimators (KSG + MINE) and require agreement. Increase probe dataset size. Report confidence intervals on all Φ* values.

Risk: No correlation between Φ* and generalization

Negative result is highly publishable, 'integrated information does not predict generalization' constrains IIT's applicability to artificial systems. Check if correlation exists with different Φ variants (geometric, stochastic interaction).

Risk: Partition dependence makes results non-comparable across architectures

Develop a canonical partition scheme based on computational graph structure. Alternatively, report Φ* under the partition that maximizes it (most charitable interpretation), if even maximum Φ* doesn't correlate, the result is stronger.

PHI References

Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5(1), 42.
Tononi, G. et al. (2016). Integrated information theory: from consciousness to its physical substrate. Nature Reviews Neuroscience, 17(7), 450-461.
Oizumi, M. et al. (2014). From the phenomenology to the mechanisms of consciousness: Integrated Information Theory 3.0. PLoS Computational Biology, 10(5).
Barrett, A. B. & Seth, A. K. (2011). Practical measures of integrated information for time-series data. PLoS Computational Biology, 7(1).
Casali, A. G. et al. (2013). A theoretically based index of consciousness independent of sensory processing and behavior. Science Translational Medicine, 5(198).
Kraskov, A. et al. (2004). Estimating mutual information. Physical Review E, 69(6), 066138.
Belghazi, M. I. et al. (2018). Mutual Information Neural Estimation. ICML.
Shwartz-Ziv, R. & Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv:1703.00810.

EXP-03 · Program IV (GENESIS)

Information Capacity Scaling Laws in Neural Networks: Testing for Holographic Analogs

Testing whether neural network information capacity follows an area law (proportional to boundary parameters) rather than a volume law (proportional to total parameters) — a potential analog of the Bekenstein bound from black hole thermodynamics.

StatusProposed

ProgramGENESIS (Informational Substrate)

Priority3 of 3

Duration3 - 4 months

ComputeGPU cluster (systematic training runs)

NoveltyFirst test of holographic principle analogs in deep learning

Risk ProfileHigh risk, high reward, Nature-tier if area law holds

HPrincipal Hypothesis

The maximum information a neural network can encode about its training data follows an area law — proportional to the number of boundary/interface parameters — rather than a volume law proportional to total parameter count. This would constitute a computational analog of the Bekenstein bound, the fundamental limit from black hole physics stating that the maximum entropy of a region is proportional to its surface area, not its volume.

Background & Gap

The Bekenstein bound (1973) establishes that the maximum entropy — and therefore information — containable within a physical region is proportional to its surface area, not its volume. This counter-intuitive result, formalized as the holographic principle by 't Hooft and Susskind, suggests that the universe fundamentally encodes information on boundaries rather than in bulk.

Neural network information capacity is poorly understood. Phenomena like double descent (Nakkiran et al., 2019), lottery tickets (Frankle & Carlin, 2018), and neural scaling laws (Kaplan et al., 2020) all suggest that effective capacity is not simply proportional to parameter count. Something more subtle governs how much information a network can actually encode.

If neural networks obey an area law, it would suggest that information storage in computational systems mirrors information storage in physical systems at a deep structural level — supporting the “it from bit” thesis that computation is not merely a metaphor for physics but shares its fundamental constraints.

Formal Definitions

Volume (V)

Total parameter count of the network. For a network with L layers of width w: V = O(Lw²).

Boundary (A) — Definition 1: Input/Output Interface

Parameters that directly interact with input or output: first layer weights + last layer weights. A = O(w · d_in + w · d_out), where d_in and d_out are input/output dimensionalities.

Boundary (A) — Definition 2: Cross-Layer Interface

Parameters participating in inter-layer connections. For a fully-connected network: A = O((L-1) · w²), which equals the volume minus intra-layer biases. For this to be interesting, we need architectures where A ≠ V — networks with substantial intra-layer computation (wide residual blocks, attention within layers).

Information Capacity (C)

Maximum number of random labels the network can memorize to 100% training accuracy (Zhang et al., 2017 methodology). Measured in bits: C = log&sub2;(max memorizable dataset size × number of classes).

Methodology

Phase 1, Capacity Measurement Protocol

For each architecture configuration, generate datasets with random labels (uniform random assignment of K classes to N samples from CIFAR-10 or synthetic Gaussian data). Binary search for the maximum N where the network reaches 100% training accuracy within a fixed compute budget (50 epochs, SGD with momentum). This N is the effective memorization capacity. Repeat 5 times with different random seeds, report median.

Phase 2, Systematic Architecture Sweep

Measure capacity for 15+ architecture configurations spanning different depth/width ratios. MLPs: [2×512, 4×256, 8×128, 16×64, 32×32] (constant volume ~260K params, varying depth). Wide ResNets: WRN-d-k for d ∈ {16, 22, 28, 40} and k ∈ {1, 2, 4, 8}. Transformers: {2, 4, 8, 12} layers × {64, 128, 256} model dim. For each, compute V (volume) and A (boundary under both definitions).

Phase 3, Scaling Law Extraction

Plot C vs V and C vs A on log-log axes. Fit power laws: C ~ V^α and C ~ A^β. If β ≈ 1.0 and α < 1.0, the area law holds, capacity scales with boundary, not volume. Compute R² for both fits. Use Bayesian model comparison (BIC) to determine which scaling relationship is statistically preferred. Critical test: vary depth at constant width (changes V but not A under Definition 1). If capacity stays constant, area law is strongly supported.

Phase 4, Skip Connections as 'Wormholes'

Skip connections create direct information pathways between non-adjacent layers, topological shortcuts analogous to wormholes in spacetime. Compare capacity of ResNets (with skip connections) vs equivalent-depth plain networks (without). Under the holographic analogy, skip connections should increase the effective boundary area, predicting higher capacity. If confirmed, skip connections are computational wormholes that expand the information boundary.

Phase 5, Attention as Non-Local Information Coupling

Self-attention allows every position to directly access every other position, effectively making the entire network a 'boundary.' Prediction: transformers should deviate from the area law (or equivalently, their effective boundary ≈ volume due to attention). If transformers obey a volume law while non-attention architectures obey an area law, attention is the mechanism that breaks the holographic constraint. This has implications for why transformers scale so well.

GENESIS Primary Metrics

Maximum memorizable dataset size (bits) per architecture
Volume scaling exponent α in C ~ V^α
Area scaling exponent β in C ~ A^β
R² and BIC comparison: area law vs volume law fit
Constant-width depth sweep: capacity vs depth curve

GENESIS Secondary Metrics

Mutual information I(W; D_train) at convergence (via MINE)
Skip connection capacity delta (ResNet vs plain net)
Transformer vs non-transformer scaling exponent comparison
Effective boundary expansion from attention (measured vs predicted)
Double descent location relative to boundary capacity

GENESIS Tools & Infrastructure

Training

PyTorch, Weights & Biases, distributed training (multiple GPU for transformer sweeps)

Analysis

SciPy (curve fitting, BIC), NumPy, MINE estimator, matplotlib/seaborn (scaling plots)

Data

CIFAR-10 (random label memorization), synthetic Gaussian blobs (controlled complexity)

GENESIS Expected Outputs

Scaling law characterization: area law vs volume law for 15+ architecture configurations
If area law: first evidence of holographic principle analogs in computational systems, bridging deep learning theory and theoretical physics
Skip connection analysis: empirical test of 'computational wormhole' hypothesis
Transformer exceptionalism: why attention-based architectures may break the area law (explaining their empirical superiority)
Publication target: Nature Physics, Physical Review Letters, or ICML (if framed computationally). Cross-listing on arXiv: cs.LG + hep-th

GENESIS Risks & Mitigations

Risk: Both area and volume law fit equally well (no clear winner)

The constant-width depth sweep is the decisive test. If capacity increases with depth at constant width, volume law wins. If capacity saturates, area law wins. This test has high statistical power because it isolates the variable.

Risk: Memorization capacity is a poor proxy for information capacity

Supplement with mutual information measurement I(W; D_train) using MINE. If MI-based capacity and memorization-based capacity give the same scaling exponent, the proxy is validated.

Risk: The analogy to Bekenstein is superficial, neural networks aren't physical systems

The claim is not that neural networks are literally bounded by the Bekenstein bound. The claim is that information storage in computational systems may be subject to analogous area-law constraints, suggesting shared mathematical structure. Frame as 'computational holographic principle' not 'Bekenstein bound for neural networks.'

Risk: Results are optimizer-dependent (SGD vs Adam may give different capacity)

Run full sweep with both SGD+momentum and Adam. If scaling exponents differ, report both, optimizer dependence is itself an interesting finding.

GENESIS References

Bekenstein, J. D. (1973). Black holes and entropy. Physical Review D, 7(8), 2333.
't Hooft, G. (1993). Dimensional reduction in quantum gravity. arXiv:gr-qc/9310026.
Susskind, L. (1995). The world as a hologram. Journal of Mathematical Physics, 36(11), 6377-6396.
Zhang, C. et al. (2017). Understanding deep learning requires rethinking generalization. ICLR.
Nakkiran, P. et al. (2019). Deep double descent: where bigger models and more data can hurt. ICLR.
Frankle, J. & Carlin, M. (2018). The lottery ticket hypothesis. ICLR.
Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
Wheeler, J. A. (1990). Information, physics, quantum: the search for links. Complexity, Entropy, and the Physics of Information.
Wolfram, S. (2002). A New Kind of Science. Wolfram Media.

Cross-Experiment Connections

These three experiments are not independent. Results from each directly inform and constrain the others.

EXP-01 → EXP-02

If topological depth predicts forgetting resistance, does Φ also predict it? Networks with higher integrated information may naturally create deeper topological features because integration requires complex, multi-scale structure in the loss landscape.

EXP-02 → EXP-03

If Φ correlates with generalization, and information capacity follows an area law, then Φ may be the mechanism that determines how efficiently a network uses its boundary parameters. High Φ = better boundary utilization.

EXP-03 → EXP-01

If capacity is boundary-limited, catastrophic forgetting may occur when new task information competes for limited boundary capacity. Topological protection may work by ensuring old knowledge is encoded in “interior” parameters that new learning cannot overwrite.

Together, these experiments test a unified thesis: that the geometry of knowledge, the integration of information, and the fundamental limits of computational capacity are manifestations of the same underlying mathematical structure — one shared by both physical and computational systems.

Execution Workflow

How experiments progress from hypothesis to publication. All experiments are tracked via ClearML (self-hosted, open source) for full reproducibility.

Configure

Define hypothesis, architecture, hyperparameters, and benchmarks in versioned YAML config. All experimental parameters are declarative, nothing hardcoded.

Train Baseline

Train target architecture to convergence on Task A. Checkpoints saved at intervals for downstream analysis. Loss curves, accuracy, and learning rate tracked in real time.

Measure

Run experiment-specific measurements: loss landscape sampling + persistent homology (EXP-01), Phi* computation across partitions (EXP-02), or memorization capacity binary search (EXP-03). Results logged automatically.

Perturb & Observe

Apply the experimental intervention: sequential task training (EXP-01), architecture survey across families (EXP-02), or depth/width sweep at controlled ratios (EXP-03). Measure target variables at defined intervals.

Correlate

Statistical analysis: Spearman rank correlation, Bayesian model comparison (BIC), power-law fitting. Determine whether the hypothesis is supported, refuted, or inconclusive.

Iterate or Publish

Positive result: extend to additional architectures, write paper. Negative result: analyze why, pivot methodology, document findings. All results, positive or negative, are publishable.

Compute

Local GPU cluster (NVIDIA RTX, CUDA)
PyTorch 2.x with mixed precision
Distributed training for architecture sweeps

Tracking

ClearML (self-hosted, Apache 2.0)
Full experiment versioning and comparison
Automated artifact and model storage

Reproducibility

Deterministic seeding across all runs
Version-controlled configs (YAML)
Full dependency pinning (pyproject.toml)

Back to Research Programs Research Collaboration Inquiry