EXPERIMENTAL PROTOCOLS
Priority Experiments
Detailed scope, methodology, and results for three priority experiments. EXP-01 (PERSIST) has completed preliminary proof-of-concept across 57 small-scale configurations. Phase I scale validation to production-size models (100M-7B+ parameters) is planned, requiring supercomputer resources.
Topological Signatures of Knowledge Persistence in Continual Learning Systems
Investigating whether the topological structure of a neural network's loss landscape predicts its resistance to catastrophic forgetting during sequential task training.
●Preliminary Results (19 Architectures, 3 Datasets, Small-Scale Proof-of-Concept)
Preliminary proof-of-concept (all models under 45M params): Cross-dataset analysis (57/57 configs, 3 datasets) reveals topology is a conditional predictor at small scale. On CIFAR-100 (easy), params dominate (\u03c1 = -0.76). On CUB-200, topology rescues prediction (perm. p = 0.037, suggestive). On RESISC-45, topology does not help (p = 0.566). Most stable signal: H0 predicts EWC benefit (CIFAR-100 \u03c1 = 0.76, RESISC-45 \u03c1 = 0.86). Whether these patterns survive at production scale (100M-7B+ params) is the critical open research question for Phase I, requiring supercomputer resources.
| Architecture | Params | Acc. | H₁ | Ret@100 | Type |
|---|---|---|---|---|---|
| ViT-Tiny | 0.3M | 52.7% | 0.01 | 22.5% | Transformer |
| ShuffleNet-V2 | 1.3M | 76.8% | 0.79 | 17.3% | CNN |
| ViT-Small | 2.2M | 62.2% | 0.24 | 9.6% | Transformer |
| EfficientNet-B0 | 4.1M | 76.6% | 1.91 | 7.1% | CNN |
| WRN-28-10 | 36.5M | 84.0% | 0.07 | 0.3% | WRN-ladder |
| ResNet-18 | 11.2M | 82.0% | 0.00 | 0.2% | CNN |
| ResNet-18 Wide | 44.7M | 83.1% | 0.00 | 0.0% | CNN |
Key finding: Topology's value depends on task difficulty. On CIFAR-100, params dominate (\u03c1 = -0.76, survives Bonferroni) and topology is redundant. On CUB-200 (hard, fine-grained), topology rescues prediction (perm. p = 0.037, suggestive). On RESISC-45, topology does not help. H0 predicts EWC benefit across datasets.
CUB-200 detail: Params-only \u03c1 = -0.92 (wrong direction). Adding topology: \u03c1 = 0.34, 17.5% MAE reduction. WRN width ladder complete, confirming H0 monotonicity and direction flip between easy and hard tasks. All 3 datasets complete.
HPrincipal Hypothesis
The persistence of learned knowledge under sequential task training is predictable from the topological features of the loss landscape around learned weight configurations. Tasks that induce deeper topological features — longer-lived persistent homology classes — in the loss landscape are more resistant to catastrophic forgetting during subsequent training.
Background & Gap
Catastrophic forgetting (McCloskey & Cohen, 1989) remains one of the most fundamental unsolved problems in machine learning. Every major mitigation strategy — replay buffers, elastic weight consolidation, progressive networks — manages the symptom rather than addressing the underlying geometric cause.
Separately, Topological Data Analysis (TDA) has emerged as a powerful tool for characterizing loss landscape geometry (Ballester & Araujo, 2020). Persistent homology extracts scale-invariant features — connected components, loops, voids — that survive across multiple scales of analysis.
No published work has connected these two fields. This experiment tests whether the topological depth of learned representations predicts their survivability during continual learning.
Methodology
Phase 1 — Train Task A
Train 19 architectures (14 diverse + 6-point WRN-28-k width ladder) to convergence on Task A across 3 datasets: CIFAR-100, CUB-200-2011, and NWPU-RESISC45. 100 epochs, SGD with cosine annealing. Save best checkpoints.
Phase 2 — Landscape Topology (Ripser + Cubical)
Sample 50x50 loss landscape along filter-normalized random directions (Li et al., 2018). 5 independent random 2D slices per architecture. Compute persistent homology via Ripser (graph-based H₀, H₁) and GUDHI cubical complexes (validation baseline). Compute baseline metrics: Hessian trace, Fisher information, max eigenvalue, loss barrier.
Phase 3 — Sequential Forgetting (Naive + EWC + Cosine)
Train sequentially on Task B with 3 variants: naive, EWC regularization (Fisher-based penalty), and cosine LR decay. Measure Task A accuracy at steps [10, 25, 50, 100, 250, 500, 1000, 5000]. Compute early AURC, ret@10, ret@100.
Phase 4 — Correlation & Diagnostics
Spearman + Kendall correlation with Bonferroni correction (12 metrics). Partial correlations controlling for parameter count. Slice robustness diagnostics: Kruskal-Wallis, per-slice Spearman, pairwise ordering probability, Cohen's d. Cubical vs Ripser agreement. EWC benefit analysis. WRN width ladder: within-ladder correlations isolating scale from topology.
Phase 5 — Predictive Model (LOAO CV)
Leave-one-architecture-out Ridge regression with nested alpha selection. 5 models: A (params only), A2 (params + random noise, matched-dimensionality null), B (params + Ripser topology), C (params + cubical topology), D (topology alone). 1,000-permutation test shuffling topology features to test incremental value. If Model B does not beat A2, topology features are no better than noise.
PERSIST Primary Metrics
- H₀, H₁ persistent homology via Ripser (graph-based) and GUDHI (cubical complexes)
- Total persistence: Σ(death_i - birth_i) across 5 independent landscape slices
- Task A retention metrics: early AURC (0-500), ret@10, ret@100
- LOAO cross-validated prediction error (MAE) for 5 regression models
- Permutation test p-value for incremental topology value (1,000 permutations)
PERSIST Secondary Metrics
- Spearman ρ + Kendall τ with Bonferroni correction (12 metrics)
- Partial correlations controlling for parameter count
- WRN width ladder: within-ladder Spearman + partial H1|params
- Slice robustness: Kruskal-Wallis, pairwise ordering, Cohen's d
- EWC benefit vs topology correlation (does topology predict regularization response?)
PERSIST Tools & Infrastructure
Training
PyTorch 2.x, CUDA (RTX 4090), Flask dashboard with live monitoring
Topology
Ripser (graph-based PH), GUDHI (cubical complexes), scikit-learn (Ridge, StandardScaler)
Datasets
CIFAR-100, CUB-200-2011 (fine-grained birds), NWPU-RESISC45 (satellite scenes)
PERSIST Expected Outputs
- Correlation analysis: Ripser + cubical PH vs retention across 19 architectures and 3 datasets
- WRN width ladder verdict: does topology carry independent signal beyond model scale?
- LOAO predictive model: does topology improve prediction of forgetting over params alone?
- Cross-domain validation: do topological signatures generalize from natural images to fine-grained and satellite data?
- Publication target: NeurIPS / ICML (Continual Learning track)
PERSIST Risks & Mitigations
Risk: No correlation found between topology and forgetting
Negative result is still publishable ("topological features are insufficient to predict forgetting"). Pivot to information-geometric approaches (Fisher information metric).
Risk: Persistent homology computation intractable for large networks
Use subspace sampling (random 2D slices through weight space). Compute topology on activations rather than weights if needed. Ripser++ scales to millions of simplices.
Risk: Topological regularizer destabilizes Task B training
Anneal λ during training. Use soft constraint (penalty) rather than hard projection onto topological manifold.
PERSIST References
- McCloskey, M. & Cohen, N. J. (1989). Catastrophic interference in connectionist networks. Psychology of Learning and Motivation, 24, 109-165.
- Kirkpatrick, J. et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13), 3521-3526.
- Ballester, R. & Araujo, X. (2020). On the interplay between topological data analysis and deep learning. NeurIPS Workshop on TDA.
- Li, H. et al. (2018). Visualizing the loss landscape of neural nets. NeurIPS.
- Otter, N. et al. (2017). A roadmap for the computation of persistent homology. EPJ Data Science, 6(1), 1-38.
- Tononi, G. & Cirelli, C. (2014). Sleep and the price of plasticity. Neuron, 81(1), 12-34.
- Kumaran, D. et al. (2016). What learning systems do intelligent agents need? Trends in Cognitive Sciences, 20(7), 512-534.
Systematic Survey of Integrated Information in Modern Neural Network Architectures
The first comprehensive measurement of integrated information (Φ) across the major families of deep learning architectures, testing whether Φ correlates with generalization, transferability, and robustness.
HPrincipal Hypothesis
Integrated information (Φ), as formalized by Integrated Information Theory, varies systematically across neural network architectures and correlates with the network's capacity for generalization and transfer learning. Networks with higher Φ process information in a more integrated manner, producing richer internal representations that resist overfitting and transfer more effectively to novel domains.
Background & Gap
Integrated Information Theory (Tononi, 2004; Tononi et al., 2016) proposes Φ as a scalar measure of how much a system is “more than the sum of its parts” — quantifying the degree to which information is integrated across a system rather than reducible to independent modules.
Φ has been computed for very small systems (logic gates, simple recurrent networks of <20 nodes) but never systematically for modern deep learning architectures. This is partly because exact Φ computation is NP-hard (requires finding the minimum information partition), but tractable approximations exist: Φ* (Oizumi et al., 2014), geometric integrated information (Barrett & Seth, 2011), and stochastic interaction (Ay, 2015).
The gap is profound: we have a rigorous mathematical framework for measuring information integration, and an entire field (deep learning) built on architectures that integrate information at massive scale — yet nobody has connected them.
Methodology
Phase 1 — Phi* Implementation for Neural Networks
Implement Φ* computation adapted for neural networks. Partition each network into functional modules: individual layers, attention heads (transformers), feature map groups (CNNs), temporal steps (RNNs). For each partition scheme, compute mutual information between all module pairs using the KSG estimator (Kraskov et al., 2004) on activation vectors from a held-out probe dataset. Find the Minimum Information Partition (MIP) via greedy bipartition search. Φ* = total mutual information minus information across the MIP.
Phase 2 — Architecture Survey
Compute Φ* at 5 training checkpoints (random initialization, 25%, 50%, 75%, full convergence) for: Feedforward MLPs (2, 4, 8 layers), Convolutional (ResNet-18, ResNet-50), Recurrent (LSTM, GRU — 2 and 4 layers), Transformer (GPT-2-small, ViT-Small), Graph (GCN, GAT on Cora/CiteSeer). All trained on comparable tasks (CIFAR-10/100 for vision, WikiText for language, Cora for graph). Record Φ* trajectory during training.
Phase 3 — Correlation with Generalization
For each architecture at convergence, measure: test accuracy (generalization gap = train - test), transfer learning performance (fine-tune on CIFAR-100 after CIFAR-10 pretraining, or SST-2 after WikiText), adversarial robustness (PGD attack, ε = 8/255 for vision). Compute Spearman correlation between Φ* and each performance metric across all architectures.
Phase 4 — Perturbational Complexity Index
Independently validate Φ* results using PCI (Casali et al., 2013), adapted from neuroscience. For each trained network: inject calibrated Gaussian noise at a single layer, record the propagation pattern across all subsequent layers, compute Lempel-Ziv complexity of the binarized activation response. Compare PCI ranking with Φ* ranking across architectures. Agreement between two independent measures strengthens the result.
Phase 5 — Phi Dynamics During Training
Analyze the Φ* trajectory. Key questions: Does Φ* increase monotonically during training, or does it peak and decline (overfitting as integration collapse)? Does Φ* correlate with the information bottleneck phase transitions identified by Shwartz-Ziv & Tishby (2017)? Is there a critical Φ* threshold below which transfer learning fails?
PHI Primary Metrics
- Φ* (minimum information partition) at each training checkpoint
- PCI (Lempel-Ziv complexity of perturbation response)
- Generalization gap (train accuracy - test accuracy)
- Transfer learning Δ accuracy (target - baseline)
- Spearman ρ(Φ*, generalization) and ρ(Φ*, transfer)
PHI Secondary Metrics
- Φ* trajectory shape classification (monotonic, peaked, oscillating)
- PCI-Φ* rank correlation (do independent measures agree?)
- Layer-wise Φ contribution (which layers integrate most?)
- Adversarial robustness (PGD success rate)
- Partition sensitivity analysis (how much does module definition matter?)
Known Challenges
Scalability. Exact Φ is NP-hard. Φ* with greedy bipartition is O(n²) in the number of modules. For a 12-layer transformer with 12 attention heads, this is 144 modules — feasible with greedy search but requires careful implementation. Networks with >1000 effective modules require subsampling.
Partition dependence. Φ* values depend on how the network is partitioned into modules. We address this by testing multiple partition schemes (by layer, by head, by feature group) and reporting the range. If rankings are consistent across schemes, the result is robust.
Mutual information estimation. High-dimensional MI estimation is noisy. We use the KSG estimator (k=5 neighbors) with dimensionality reduction (PCA to 64 dimensions per module) on activation vectors from 10,000 probe inputs. Bootstrap confidence intervals on all MI estimates.
PHI Tools & Infrastructure
Computation
PyTorch, NumPy, SciPy, Weights & Biases, CUDA-capable GPU cluster
Information Theory
KSG estimator (custom), MINE (Belghazi et al., 2018) for validation, pyphi (adapted)
Models
torchvision (ResNet), HuggingFace (GPT-2, ViT), PyG (GCN, GAT)
PHI Expected Outputs
- First Φ* landscape map across modern deep learning architectures — the 'periodic table' of information integration
- Φ trajectory analysis: how integration evolves during training (potential connection to information bottleneck theory)
- PCI-Φ* cross-validation: do two independent measures of integration agree in artificial systems?
- If correlation holds: Φ* as a practical architecture selection and early-stopping metric
- Publication target: Nature Machine Intelligence, ICLR, or Neuroscience of Consciousness (cross-disciplinary)
PHI Risks & Mitigations
Risk: Φ* approximation too noisy to produce meaningful rankings
Use multiple MI estimators (KSG + MINE) and require agreement. Increase probe dataset size. Report confidence intervals on all Φ* values.
Risk: No correlation between Φ* and generalization
Negative result is highly publishable — 'integrated information does not predict generalization' constrains IIT's applicability to artificial systems. Check if correlation exists with different Φ variants (geometric, stochastic interaction).
Risk: Partition dependence makes results non-comparable across architectures
Develop a canonical partition scheme based on computational graph structure. Alternatively, report Φ* under the partition that maximizes it (most charitable interpretation) — if even maximum Φ* doesn't correlate, the result is stronger.
PHI References
- Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5(1), 42.
- Tononi, G. et al. (2016). Integrated information theory: from consciousness to its physical substrate. Nature Reviews Neuroscience, 17(7), 450-461.
- Oizumi, M. et al. (2014). From the phenomenology to the mechanisms of consciousness: Integrated Information Theory 3.0. PLoS Computational Biology, 10(5).
- Barrett, A. B. & Seth, A. K. (2011). Practical measures of integrated information for time-series data. PLoS Computational Biology, 7(1).
- Casali, A. G. et al. (2013). A theoretically based index of consciousness independent of sensory processing and behavior. Science Translational Medicine, 5(198).
- Kraskov, A. et al. (2004). Estimating mutual information. Physical Review E, 69(6), 066138.
- Belghazi, M. I. et al. (2018). Mutual Information Neural Estimation. ICML.
- Shwartz-Ziv, R. & Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv:1703.00810.
Information Capacity Scaling Laws in Neural Networks: Testing for Holographic Analogs
Testing whether neural network information capacity follows an area law (proportional to boundary parameters) rather than a volume law (proportional to total parameters) — a potential analog of the Bekenstein bound from black hole thermodynamics.
HPrincipal Hypothesis
The maximum information a neural network can encode about its training data follows an area law — proportional to the number of boundary/interface parameters — rather than a volume law proportional to total parameter count. This would constitute a computational analog of the Bekenstein bound, the fundamental limit from black hole physics stating that the maximum entropy of a region is proportional to its surface area, not its volume.
Background & Gap
The Bekenstein bound (1973) establishes that the maximum entropy — and therefore information — containable within a physical region is proportional to its surface area, not its volume. This counter-intuitive result, formalized as the holographic principle by 't Hooft and Susskind, suggests that the universe fundamentally encodes information on boundaries rather than in bulk.
Neural network information capacity is poorly understood. Phenomena like double descent (Nakkiran et al., 2019), lottery tickets (Frankle & Carlin, 2018), and neural scaling laws (Kaplan et al., 2020) all suggest that effective capacity is not simply proportional to parameter count. Something more subtle governs how much information a network can actually encode.
If neural networks obey an area law, it would suggest that information storage in computational systems mirrors information storage in physical systems at a deep structural level — supporting the “it from bit” thesis that computation is not merely a metaphor for physics but shares its fundamental constraints.
Formal Definitions
Volume (V)
Total parameter count of the network. For a network with L layers of width w: V = O(Lw²).
Boundary (A) — Definition 1: Input/Output Interface
Parameters that directly interact with input or output: first layer weights + last layer weights. A = O(w · d_in + w · d_out), where d_in and d_out are input/output dimensionalities.
Boundary (A) — Definition 2: Cross-Layer Interface
Parameters participating in inter-layer connections. For a fully-connected network: A = O((L-1) · w²), which equals the volume minus intra-layer biases. For this to be interesting, we need architectures where A ≠ V — networks with substantial intra-layer computation (wide residual blocks, attention within layers).
Information Capacity (C)
Maximum number of random labels the network can memorize to 100% training accuracy (Zhang et al., 2017 methodology). Measured in bits: C = log&sub2;(max memorizable dataset size × number of classes).
Methodology
Phase 1 — Capacity Measurement Protocol
For each architecture configuration, generate datasets with random labels (uniform random assignment of K classes to N samples from CIFAR-10 or synthetic Gaussian data). Binary search for the maximum N where the network reaches 100% training accuracy within a fixed compute budget (50 epochs, SGD with momentum). This N is the effective memorization capacity. Repeat 5 times with different random seeds, report median.
Phase 2 — Systematic Architecture Sweep
Measure capacity for 15+ architecture configurations spanning different depth/width ratios. MLPs: [2×512, 4×256, 8×128, 16×64, 32×32] (constant volume ~260K params, varying depth). Wide ResNets: WRN-d-k for d ∈ {16, 22, 28, 40} and k ∈ {1, 2, 4, 8}. Transformers: {2, 4, 8, 12} layers × {64, 128, 256} model dim. For each, compute V (volume) and A (boundary under both definitions).
Phase 3 — Scaling Law Extraction
Plot C vs V and C vs A on log-log axes. Fit power laws: C ~ V^α and C ~ A^β. If β ≈ 1.0 and α < 1.0, the area law holds — capacity scales with boundary, not volume. Compute R² for both fits. Use Bayesian model comparison (BIC) to determine which scaling relationship is statistically preferred. Critical test: vary depth at constant width (changes V but not A under Definition 1). If capacity stays constant, area law is strongly supported.
Phase 4 — Skip Connections as 'Wormholes'
Skip connections create direct information pathways between non-adjacent layers — topological shortcuts analogous to wormholes in spacetime. Compare capacity of ResNets (with skip connections) vs equivalent-depth plain networks (without). Under the holographic analogy, skip connections should increase the effective boundary area, predicting higher capacity. If confirmed, skip connections are computational wormholes that expand the information boundary.
Phase 5 — Attention as Non-Local Information Coupling
Self-attention allows every position to directly access every other position — effectively making the entire network a 'boundary.' Prediction: transformers should deviate from the area law (or equivalently, their effective boundary ≈ volume due to attention). If transformers obey a volume law while non-attention architectures obey an area law, attention is the mechanism that breaks the holographic constraint. This has implications for why transformers scale so well.
GENESIS Primary Metrics
- Maximum memorizable dataset size (bits) per architecture
- Volume scaling exponent α in C ~ V^α
- Area scaling exponent β in C ~ A^β
- R² and BIC comparison: area law vs volume law fit
- Constant-width depth sweep: capacity vs depth curve
GENESIS Secondary Metrics
- Mutual information I(W; D_train) at convergence (via MINE)
- Skip connection capacity delta (ResNet vs plain net)
- Transformer vs non-transformer scaling exponent comparison
- Effective boundary expansion from attention (measured vs predicted)
- Double descent location relative to boundary capacity
GENESIS Tools & Infrastructure
Training
PyTorch, Weights & Biases, distributed training (multiple GPU for transformer sweeps)
Analysis
SciPy (curve fitting, BIC), NumPy, MINE estimator, matplotlib/seaborn (scaling plots)
Data
CIFAR-10 (random label memorization), synthetic Gaussian blobs (controlled complexity)
GENESIS Expected Outputs
- Scaling law characterization: area law vs volume law for 15+ architecture configurations
- If area law: first evidence of holographic principle analogs in computational systems — bridging deep learning theory and theoretical physics
- Skip connection analysis: empirical test of 'computational wormhole' hypothesis
- Transformer exceptionalism: why attention-based architectures may break the area law (explaining their empirical superiority)
- Publication target: Nature Physics, Physical Review Letters, or ICML (if framed computationally). Cross-listing on arXiv: cs.LG + hep-th
GENESIS Risks & Mitigations
Risk: Both area and volume law fit equally well (no clear winner)
The constant-width depth sweep is the decisive test. If capacity increases with depth at constant width, volume law wins. If capacity saturates, area law wins. This test has high statistical power because it isolates the variable.
Risk: Memorization capacity is a poor proxy for information capacity
Supplement with mutual information measurement I(W; D_train) using MINE. If MI-based capacity and memorization-based capacity give the same scaling exponent, the proxy is validated.
Risk: The analogy to Bekenstein is superficial — neural networks aren't physical systems
The claim is not that neural networks are literally bounded by the Bekenstein bound. The claim is that information storage in computational systems may be subject to analogous area-law constraints, suggesting shared mathematical structure. Frame as 'computational holographic principle' not 'Bekenstein bound for neural networks.'
Risk: Results are optimizer-dependent (SGD vs Adam may give different capacity)
Run full sweep with both SGD+momentum and Adam. If scaling exponents differ, report both — optimizer dependence is itself an interesting finding.
GENESIS References
- Bekenstein, J. D. (1973). Black holes and entropy. Physical Review D, 7(8), 2333.
- 't Hooft, G. (1993). Dimensional reduction in quantum gravity. arXiv:gr-qc/9310026.
- Susskind, L. (1995). The world as a hologram. Journal of Mathematical Physics, 36(11), 6377-6396.
- Zhang, C. et al. (2017). Understanding deep learning requires rethinking generalization. ICLR.
- Nakkiran, P. et al. (2019). Deep double descent: where bigger models and more data can hurt. ICLR.
- Frankle, J. & Carlin, M. (2018). The lottery ticket hypothesis. ICLR.
- Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
- Wheeler, J. A. (1990). Information, physics, quantum: the search for links. Complexity, Entropy, and the Physics of Information.
- Wolfram, S. (2002). A New Kind of Science. Wolfram Media.
Cross-Experiment Connections
These three experiments are not independent. Results from each directly inform and constrain the others.
EXP-01 → EXP-02
If topological depth predicts forgetting resistance, does Φ also predict it? Networks with higher integrated information may naturally create deeper topological features because integration requires complex, multi-scale structure in the loss landscape.
EXP-02 → EXP-03
If Φ correlates with generalization, and information capacity follows an area law, then Φ may be the mechanism that determines how efficiently a network uses its boundary parameters. High Φ = better boundary utilization.
EXP-03 → EXP-01
If capacity is boundary-limited, catastrophic forgetting may occur when new task information competes for limited boundary capacity. Topological protection may work by ensuring old knowledge is encoded in “interior” parameters that new learning cannot overwrite.
Together, these experiments test a unified thesis: that the geometry of knowledge, the integration of information, and the fundamental limits of computational capacity are manifestations of the same underlying mathematical structure — one shared by both physical and computational systems.
Execution Workflow
How experiments progress from hypothesis to publication. All experiments are tracked via ClearML (self-hosted, open source) for full reproducibility.
Configure
Define hypothesis, architecture, hyperparameters, and benchmarks in versioned YAML config. All experimental parameters are declarative — nothing hardcoded.
Train Baseline
Train target architecture to convergence on Task A. Checkpoints saved at intervals for downstream analysis. Loss curves, accuracy, and learning rate tracked in real time.
Measure
Run experiment-specific measurements: loss landscape sampling + persistent homology (EXP-01), Phi* computation across partitions (EXP-02), or memorization capacity binary search (EXP-03). Results logged automatically.
Perturb & Observe
Apply the experimental intervention: sequential task training (EXP-01), architecture survey across families (EXP-02), or depth/width sweep at controlled ratios (EXP-03). Measure target variables at defined intervals.
Correlate
Statistical analysis: Spearman rank correlation, Bayesian model comparison (BIC), power-law fitting. Determine whether the hypothesis is supported, refuted, or inconclusive.
Iterate or Publish
Positive result: extend to additional architectures, write paper. Negative result: analyze why, pivot methodology, document findings. All results — positive or negative — are publishable.
Compute
- Local GPU cluster (NVIDIA RTX, CUDA)
- PyTorch 2.x with mixed precision
- Distributed training for architecture sweeps
Tracking
- ClearML (self-hosted, Apache 2.0)
- Full experiment versioning and comparison
- Automated artifact and model storage
Reproducibility
- Deterministic seeding across all runs
- Version-controlled configs (YAML)
- Full dependency pinning (pyproject.toml)
