Skip to content

Instantly share code, notes, and snippets.

@artpar
Created February 19, 2026 12:56
Show Gist options
  • Select an option

  • Save artpar/4698ac62ce84ad33e49c7f5b51fab9c9 to your computer and use it in GitHub Desktop.

Select an option

Save artpar/4698ac62ce84ad33e49c7f5b51fab9c9 to your computer and use it in GitHub Desktop.
The Anatomy of Small Neural Networks: Parameter Space Structure, Architecture Selection, and the Depth-Width Tradeoff

The Anatomy of Small Neural Networks: Parameter Space Structure, Architecture Selection, and the Depth-Width Tradeoff


Abstract

We present a systematic empirical investigation of feedforward ReLU networks at minimal scale (1-16 hidden neurons), revealing structural properties of parameter space, training dynamics, and architecture selection that are obscured at practical scale. Through 11 controlled experiments totaling over 10,000 training runs, we establish several novel findings: (1) the Hessian eigenspectrum of a trained network decomposes into exactly four tiers — steep, moderate, weak, and zero — whose counts are predictable from architecture alone, with the steep tier recovering the target function's degrees of freedom; (2) this structure emerges during training via a sharp phase transition (27 steep eigenvalues collapse to 4 between steps 50-100 for a 16-neuron network), after which 97% of training time is spent exploring the solution manifold rather than improving the function; (3) exhaustive enumeration of all 3,662 architectures distributing 16 neurons across 1-16 layers reveals a 575,000x performance range, governed by a two-factor mechanism (basis count x basis diversity); (4) this mechanism extends to a three-factor model on multi-dimensional input (input capture x basis diversity x basis count), explaining why depth advantage reverses from 2.1x better to 5x worse between 1D and 2D at fixed neuron budget; (5) iterative targeted construction — a training-free method using Jacobian projections — converges to any target function in 3-7 iterations with <0.002 error. These results unify minimum-width theorems, depth separation results, and the wide residual network phenomenon under a single mechanistic framework grounded in the piecewise-linear geometry of ReLU networks.


1. Introduction

The practical success of deep neural networks has outpaced our understanding of their internal structure. Questions as basic as "why does this architecture work better than that one?" remain answered primarily by empirical benchmarking at scale, where the interaction of optimizer, regularization, data augmentation, and architecture makes causal isolation difficult.

We take the opposite approach. By studying the smallest networks that exhibit nontrivial behavior — single hidden layers with 1 to 16 ReLU neurons, and multi-layer networks distributing 16 neurons across 1 to 16 layers — we can:

  1. Compute the full Hessian eigendecomposition (not stochastic approximations)
  2. Enumerate all possible architectures (not sample from a search space)
  3. Train thousands of models to convergence in minutes (not days)
  4. Track every neuron's alive/dead status across training
  5. Isolate single variables (optimizer, activation, architecture) with controlled experiments

This paper reports the findings of 11 experiments conducted in sequence, each building on the previous. We organize the results thematically rather than chronologically, focusing on what is novel relative to existing literature.

1.1 Scope and Contributions

Our contributions fall into five categories:

I. Parameter space anatomy. We construct the complete eigenspectrum of the loss Hessian for networks with k=0 to k=16 hidden neurons, showing how each added neuron inflates the null-space-adjacent region. At k=0 (no hidden layer), parameter space is function space — both Hessian eigenvalues equal 2.0 for standard normal data, giving a perfectly circular loss landscape with zero redundancy. At k=16, only 8% of parameter dimensions are functionally relevant; the remaining 92% encode symmetries (permutation, scaling), dead neuron parameters, and kink redistributions.

II. Training dynamics through the eigenspectrum. We track the Hessian at 8 checkpoints during training and discover a sharp phase transition: a 16-neuron network initializes with 27/49 steep eigenvalues (>0.1) which collapse to 4 by step 100. This transition marks the boundary between Phase 1 (function alignment, 3% of training) and Phase 2 (manifold exploration, 97% of training). Function error converges 100-1000x faster than weight distance to the final solution.

III. Exhaustive architecture enumeration. We train all 3,662 ways to distribute 16 neurons across 1-16 layers (all integer compositions of 16, depths 1-5 complete + sampling at higher depths). Performance spans 575,000x on y=2x+1 — a trivial problem — governed by a two-factor mechanism: performance = basis count (last layer width) x basis diversity (bottleneck width). This predicts the optimal shape from first principles without training.

IV. Depth advantage reversal on multi-dimensional input. The two-factor mechanism extends to three factors on multi-D input: input capture (first width >= d) x basis diversity x basis count. On 1D, the first factor is trivially satisfied, so depth helps by increasing diversity. On 2D with 16 neurons, the first factor dominates — every neuron spent on depth is a neuron stolen from input capture. Depth-5 goes from 2.1x better (1D) to 5x worse (2D). This explains why depth requires scale (GPT's 768-wide layers on 3-channel input, ResNet's 64 channels on 3 inputs).

V. Training-free function construction. The naive Jacobian pseudoinverse has condition number 25 million, but projecting onto the target's 2-DOF subspace yields condition number 2.4. Iterative application (relinearize + solve) converges to any target in 3-7 steps with no loss function and no optimizer — purely linear algebra, with convergence rate matching Newton's method.


2. Experimental Setup

All experiments share a common framework.

2.1 Architecture

The base architecture is a fully-connected feedforward network with ReLU activations:

$$f(x) = W_L \cdot \sigma(W_{L-1} \cdot \sigma(\cdots \sigma(W_1 x + b_1) \cdots) + b_{L-1}) + b_L$$

where $\sigma$ = ReLU. For single-hidden-layer experiments, this reduces to $f(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$.

We parameterize architectures by their hidden layer widths. A "shape" $[w_1, w_2, \ldots, w_L]$ denotes $L$ hidden layers with widths $w_i$, for a total neuron budget $N = \sum w_i$. The input and output dimensions are always 1 (1D experiments) or $d$ (multi-D experiments) and 1 respectively.

2.2 Training

  • Optimizer: Adam (lr=0.01, default betas)
  • Batch size: 64, sampled from $\mathcal{N}(0, 1)^d$ (fresh each step)
  • Loss: Mean squared error
  • Steps: 2000 (experiments 001-010) or 3000 (experiment 011)
  • Precision: float32 (Apple Silicon / MPS)
  • Initialization: PyTorch defaults (Kaiming uniform)
  • No regularization: No dropout, weight decay, batch norm, or learning rate schedule

2.3 Evaluation

  • Dead neuron detection: A neuron is "dead" if its pre-activation is non-positive for all samples in a batch of 1000. Batch-level detection (n=64) overestimates by ~40%.
  • Breakpoint counting: Number of distinct linear regions in the piecewise-linear output function, computed on a fine grid.
  • Hessian: Full eigendecomposition of the $p \times p$ loss Hessian via torch.autograd.functional.hessian(). Feasible for $p \leq 100$.
  • Functional fraction: Proportion of Hessian eigenvalues in the steep tier (> 0.1), measuring what fraction of parameters control the output function.

2.4 Targets

Target Formula Class Properties
Linear $y = 2x + 1$ In-class Exactly representable by ReLU net
Absolute $y = x $
Quadratic $y = x^2$ Out-of-class Smooth, requires approximation
2D quadratic $y = (x_1^2 + x_2^2)/2$ Out-of-class Radially symmetric
2D interaction $y = x_1 \cdot x_2$ Out-of-class Requires feature combination

3. Parameter Space Anatomy

3.1 The Hessian Eigenspectrum Has Four Tiers

For a trained 1->16->1 ReLU network on $y = 2x + 1$ (49 parameters), the loss Hessian decomposes into four sharply separated tiers:

Tier Eigenvalue range Count Interpretation
Steep > 0.1 3-4 Functional DOF (slope, intercept, kink structure)
Moderate $10^{-3}$ to 0.1 5-8 ReLU kink position adjustments
Weak $10^{-6}$ to $10^{-3}$ 23-24 Scaling symmetry, kink redistribution
Zero < $10^{-6}$ 13-16 Dead neuron parameters

The top eigenvalue ($\lambda_1 = 17.4$) corresponds to intercept change — moving all neurons' biases collectively. The second ($\lambda_2 = 6.1$) corresponds to slope change. The ratio $\lambda_1 / \lambda_2 = 2.84$ reflects that intercept requires all neurons to agree (collective mode), while slope can be distributed across neurons (many partial solutions, hence more available directions, hence less curvature per direction).

This is verified by computing the Jacobian $\partial f / \partial \theta$ and projecting Hessian eigenvectors onto the slope and intercept directions. The top two eigenvectors align with intercept and slope respectively, confirming that the Hessian recovers the target function's degrees of freedom.

3.2 Building Up: k=0 to k=16

We constructed the eigenspectrum for every architecture from k=0 (no hidden layer, 2 parameters) to k=16 (49 parameters). This reveals how redundancy enters the parameter space neuron by neuron.

k=0: Parameter space IS function space. The 2x2 Hessian has eigenvalues [2.012, 2.000] — essentially $2I$ for standard normal data. The loss landscape is a perfect circular bowl. Functional fraction = 1.00. This is the unique architecture with zero redundancy: 2 parameters for a 2-DOF function.

k=1: The first symmetry. The 4-dimensional space decomposes as: 2 steep (function DOF) + 1 moderate + 1 near-zero. The near-zero direction has cosine similarity 1.0000 with the analytical scaling symmetry direction $(\alpha w_1, \alpha b_1, w_2/\alpha, b_2)$. This is the first redundancy — you can rescale the single neuron's weights without changing the function.

k=2 to k=16: Systematic inflation. Each neuron adds approximately:

  • 0 steep directions (the function's DOF is independent of network size)
  • ~1 kink-position direction (moderate if enough neurons contribute, weak otherwise)
  • ~1 scaling symmetry direction (weak)
  • ~1 kink redistribution direction (weak)

The result: functional fraction drops as approximately $1/k$, from 1.00 (k=0) to 0.50 (k=1) to 0.08 (k=16). At k=16, 92% of the 49 parameters are functionally redundant.

3.3 The Solution Manifold Is Globally Flat

Walking along Hessian eigenvectors from a converged solution:

  • Steep direction: Loss rises to 1-10 within $\pm 1$ unit
  • Moderate direction: Loss rises to $\sim 0.01$ within $\pm 1$ unit
  • Weak direction: Loss stays below $10^{-4}$ for $\pm 3$ units
  • Zero direction (dead neurons): Loss stays below $10^{-5}$ for arbitrary distance

The weak-direction flatness extends over distances exceeding 60% of the model's total weight norm. The solution manifold is not just locally tangent — it is a globally flat subspace of dimension approximately $3k$ embedded in the $3k+2$-dimensional parameter space.

3.4 Permutation Symmetry Is a Volume Effect

With 16 hidden neurons, $16! \approx 2 \times 10^{13}$ permutations of neuron indices preserve the network function exactly. We quantify this via 100 independently trained models:

  • Raw PCA: 28 components for 95% variance. Top 2 PCs explain only 7.5% and 7.1% — near-isotropic.
  • After neuron alignment (greedy matching by activation correlation): 27 components for 95% variance, but total variance drops 52.9%.

Permutation symmetry inflates the solution cloud uniformly in all directions (volume effect) rather than adding specific directions of variation (dimensionality effect). This is why PCA of neural network weight matrices is near-isotropic despite the underlying solution manifold having rich structure.

Mode connectivity. Before alignment, linear interpolation between two independently trained models encounters a loss barrier of 2.12. After neuron alignment, the barrier drops to 0.088 — a 95.8% reduction. Models that appear to live in different basins are actually in the same basin, modulo neuron relabeling. This confirms the results of Entezari et al. (2022) at minimal scale where the claim can be verified exhaustively.

On nonlinear targets ($y = x^2$), the picture changes: alignment reduces the barrier by only 49.6%, and permutation explains only 27% of weight distance (vs 42% for linear). Different runs find genuinely different piecewise approximations (function distance 1.21) — the solution landscape has real multiple local optima for out-of-class targets.

3.5 Scaling Symmetry

Beyond permutation, a second symmetry preserves the function: multiplying a neuron's input weights by $\alpha$ and its output weight by $1/\alpha$. This contributes ~10% of inter-model weight distance and accounts for approximately one flat Hessian direction per neuron. At k=1, this is the only symmetry, and its direction matches the flattest Hessian eigenvector with cosine similarity 1.0000.


4. Training Dynamics

4.1 The Eigenspectrum Phase Transition

The four-tier Hessian structure from Section 3 does not exist at initialization. Tracking the eigenspectrum during training reveals a sharp phase transition:

Step Loss Steep (>0.1) Moderate Weak Zero
0 22.4 27 6 8 8
25 2.21 28 5 6 10
50 0.578 23 9 6 11
100 0.250 4 27 7 11
200 0.021 4 13 23 9
500 0.011 4 27 10 8
3000 5.7e-5 4 10 27 8

Between steps 50 and 100, the steep count collapses from 23 to 4. This is the moment the network "finds" its solution — the loss landscape transforms from a generic high-dimensional bowl (many steep directions) to a narrow functional subspace embedded in a vast flat manifold.

The transition is remarkably sharp: over approximately 50 gradient steps (less than 2% of total training time), 19 directions transition from steep to moderate/weak. By step 200, the manifold structure is fully established and does not change qualitatively for the remaining 2800 steps.

4.2 Two Phases of Training

This transition divides training into two mechanistically distinct phases:

Phase 1: Function alignment (steps 0-100, ~3% of training). The network's output function converges to within $10^{-2}$ of the target. Steep Hessian directions settle with time constants of 50-100 steps. The loss drops 400x (from 22.4 to 0.058). This is the productive phase — essentially all functional learning happens here.

Phase 2: Manifold exploration (steps 100-3000, ~97% of training). The function is already learned; weights continue moving in weak and moderate directions without changing the output. Movement budget analysis confirms: after step 200, almost all parameter movement is in the weak tier (null-space-adjacent directions).

The movement along zero-eigenvalue directions (dead neuron parameters) follows a pure random walk, reaching normalized displacements of $\pm 100$. Adam's momentum carries these parameters with zero gradient correction — there is no force either attracting or repelling them.

4.3 Function Converges 100-1000x Faster Than Weights

For k=16, function error (measured as RMSE on a held-out grid) reaches $10^{-4}$ by step ~500. Weight distance to the final solution (measured as L2 norm) is still ~1.0 at step 3000 and shows no sign of converging. The discrepancy arises because most of the weight movement is along flat directions that do not change the function.

This observation has implications for convergence diagnostics: monitoring loss is sufficient for detecting functional convergence, but weight-space metrics (gradient norms, weight changes) can be highly misleading because they conflate functional progress with manifold exploration.


5. Architecture Selection at Fixed Neuron Budget

5.1 The Enumeration

We trained all ways to distribute $N = 16$ hidden neurons across $L = 1$ to $L = 16$ layers. The number of such architectures equals the number of ordered integer compositions of 16, which is $2^{15} = 32{,}768$. We trained all 1,941 shapes for depths 1-5 and sampled 1,721 additional shapes for depths 6-16, for a total of 3,662 architectures $\times$ 2 targets = 7,324 training runs.

5.2 The Performance Landscape

Performance range. On $y = 2x + 1$, best loss = $5.80 \times 10^{-5}$ (shape 4->3->9), worst = $33.4$ (any depth-11+ shape). Ratio: 575,629x. Even restricting to shapes with similar parameter counts (50-70), the range exceeds 1,000x. Architecture choice matters enormously, even on the simplest possible regression problem.

Depth frontier. Depth is not monotonically bad. The best shape at each depth:

Depth Best shape Params Loss ($y=x^2$) vs depth-1
1 [16] 49 1.51 baseline
2 [6, 10] 99 0.97 1.6x better
3 [5, 7, 4] 94 0.72 2.1x better
4 [4, 4, 4, 4] 77 1.08 1.4x better
5 [3, 4, 2, 2, 5] 59 0.72 2.1x better
10 62.6 41x worse
11+ 108.0 72x worse

Depth-3 and depth-5 beat depth-1 on $y = x^2$. But the best deep shapes are highly asymmetric: 5->7->4, not 5->5->5. Our prior experiment (009) concluded "depth hurts on 1D" because it tested only uniform shapes (8->8, 5->5->5) — an artifact of shape selection, not a fundamental property.

Caveat on parameter count. The depth-3 winner (5->7->4) has 94 parameters vs 49 for depth-1 (16). The inter-layer connections contribute $w_i \times w_{i+1}$ parameters per layer boundary. Whether depth helps at fixed parameter count (rather than fixed neuron count) remains an open question. At this scale the confound is significant.

5.3 The Depth Cliff

At depth 11 with 16 total neurons, at least 10 layers must have width 1. A chain of width-1 ReLU layers implements:

$$x \mapsto \text{ReLU}(\text{ReLU}(\cdots \text{ReLU}(ax + b) \cdots))$$

Each ReLU clips negative values. After enough applications, the output is either $f(x) = 0$ (constant) or $f(x) = cx + d$ for $x &gt; x_0$ (half-space linear). This gives exactly 1 breakpoint — the same as a single ReLU neuron. All depths 11-16 saturate at the same loss ($\sim 33.4$ for linear, $\sim 108$ for quadratic), regardless of where the remaining 5 "wide" neurons are placed.

5.4 The Two-Factor Mechanism

The optimal architecture is governed by two competing factors:

  1. Basis count (last hidden layer width $w_L$): The final layer before the output produces $w_L$ ReLU basis functions. Each is a hinge at a different position on the input axis. More basis functions = finer piecewise-linear approximation.

  2. Basis diversity (minimum interior width, i.e., bottleneck): If any intermediate layer has width $w_b$, then the $w_L$ last-layer neurons receive at most $w_b$ distinct input features. If $w_b &lt; w_L$, the basis functions are constrained to be linear combinations of $w_b$ templates — many are redundant copies.

Performance = basis count x basis diversity. This product model explains why:

  • 4->3->9 (9 last neurons, bottleneck 3) has loss 57 on $x^2$, while 5->7->4 (4 last neurons, bottleneck 5) has loss 0.72. The 3-neuron bottleneck makes 6 of the 9 last-layer neurons functionally redundant.
  • Wide-last shapes beat wide-first: 6->10 (loss 0.97) beats 10->6 (loss 2.98). The last layer directly constructs the output approximation; the first layer merely feeds it.
  • Depth-1 [16] is strong because it has bottleneck = 16 and basis count = 16 — no compression anywhere.

Cascading narrow layers are lethal. The shape 10->2->2->2 produces $f(x) \approx 0$ (loss 108.0, identical to chance). Each width-2 ReLU layer clips negative pre-activations; after three such layers, all neurons are dead. This is not a gradient vanishing problem — it is a representation collapse caused by cumulative ReLU truncation.

5.5 Breakpoint Count as Proximate Cause

We verified the causal chain:

$$\text{Shape} \rightarrow \text{Alive neurons} \rightarrow \text{Breakpoint count} \rightarrow \text{Loss}$$

The scatter of breakpoints vs. loss on $y = x^2$ follows the theoretical curve $\text{MSE} \propto 1/n^2$ (Yarotsky, 2017, for uniform breakpoint approximation of smooth functions). With 20 breakpoints, MSE $\approx 0.7$; with 5, MSE $\approx 10$; with 0, MSE = 108 (constant output).


6. The Depth-Width Tradeoff on Multi-Dimensional Input

6.1 The Prediction and Its Failure

The depth separation literature (Montufar et al., 2014; Eldan & Shamir, 2016) shows that deep networks can represent exponentially more linear regions than shallow networks, with the exponent proportional to input dimension $d$. At $d = 1$, the advantage is polynomial, not exponential. This led to the natural prediction:

Hypothesis: Depth advantage should grow with input dimension.

We tested this with 16 neurons on 2D input (targets: $(x_1^2 + x_2^2)/2$ and $x_1 \cdot x_2$). The prediction was wrong.

6.2 Depth Advantage Reverses

Depth 1D ($x^2$) relative perf. 2D ($\frac{x_1^2 + x_2^2}{2}$) relative perf.
1 1.0x (baseline) 1.0x (baseline)
2 1.6x better 1.1x better
3 2.0x better 0.75x (worse)
4 2.0x better 0.33x (worse)
5 2.1x better 0.19x (5x worse)

On 1D, depth-5 was 2.1x better than depth-1. On 2D, depth-5 is 5x worse. The advantage doesn't merely fail to grow — it inverts.

The interaction target $x_1 \cdot x_2$ was predicted to specifically require depth (it is a nonlinear function of products of inputs). It does not: depth-1 [16] achieves loss 2.92e-3, beating every deeper shape including 8->8 (loss 4.41e-3).

6.3 Feature Importance Flips

We computed Pearson correlations between architectural features and $\log_{10}(\text{loss})$ across 736 shapes on 2D input:

Feature 1D ($r$ with log loss) 2D quadratic 2D interaction
Last width -0.431 -0.175 -0.081
First width (irrelevant on 1D) -0.314 -0.445
Bottleneck -0.483 -0.520 -0.491

The importance flips. On 1D, last layer width is the strongest architectural predictor ($r = -0.431$). On 2D interaction, first layer width dominates ($r = -0.445$) and last width is nearly irrelevant ($r = -0.081$). The feature that predicted performance on 1D barely matters on 2D.

6.4 The Three-Factor Mechanism

On $d$-dimensional input with neuron budget $N$:

$$\text{Performance} = \underbrace{\text{Input capture}}_{\text{first width} \geq d} \times \underbrace{\text{Basis diversity}}_{\text{bottleneck}} \times \underbrace{\text{Basis count}}_{\text{last width}}$$

On 1D input ($d = 1$): The first factor is trivially satisfied — a single-neuron first layer captures the entire 1D input. The remaining two factors (from Section 5.4) determine performance, and distributing neurons across layers can improve diversity.

On multi-D input ($d \geq 2$): The first factor dominates. A first layer with width $w_1$ projects $d$-dimensional input into $\mathbb{R}^{w_1}$. If $w_1 &lt; d$, dimensions are irreversibly lost — no subsequent layer can recover them. If $w_1 = d$, the representation is barely adequate. Only $w_1 \gg d$ gives the first layer room to create useful features.

With $N = 16$ and $d = 2$, a depth-5 shape like 3->4->2->2->5 starts with a $2 \rightarrow 3$ projection — already compressing the input. Then $3 \rightarrow 4$ barely recovers. Meanwhile, depth-1 [16] does $2 \rightarrow 16$ — an 8x expansion that creates 16 diverse features from just 2 inputs.

6.5 The Constraint: When Can Depth Help?

For depth to help on $d$-dimensional input, the first layer needs width $\geq d$ (and ideally $\gg d$), leaving at most $N - d$ neurons for all remaining layers. With $N = 16$ and $d = 10$, this means at most 6 neurons for depth — not enough for any depth advantage.

The dimension sweep confirms: on $y = \frac{1}{d}\sum x_i^2$, the only deep shape that ever beats depth-1 is 8->8 at $d = 5$ (1.15x advantage). At $d = 10$, depth-1 wins by 3.4x. Deeper shapes ($d \geq 3$) never cross break-even at any input dimension tested.

Depth is a luxury that requires scale. With $N \gg d^2$, you can afford a wide first layer and have neurons remaining for depth. This explains the empirical observation that depth helps at practical scale: GPT-2 has 768 hidden dimensions on a ~50K token vocabulary; ResNet has 64 channels on 3-channel input. At those width-to-input ratios, the input capture bottleneck never binds.

6.6 Dead Neurons Increase on Multi-D

Narrow intermediate layers are even more lethal on multi-D input:

Shape Dead on 1D $x^2$ Dead on 2D $x_1 \cdot x_2$
[16] 2 0
5->7->4 3 6
4->4->4->4 4 7
5->1->4->6 2 10

5->1->4->6 has 10/16 neurons dead on the 2D interaction target — 63% of the network is non-functional. The width-1 bottleneck after the first layer projects a 2D manifold through a scalar, destroying all directional information.


7. Dead Neurons: Mechanism and Consequences

7.1 The Gradient Access Mechanism

Dead ReLU neurons (pre-activation $\leq 0$ for all inputs) receive zero gradient and cannot recover. We isolated the mechanism by comparing four activations on $y = x^2$:

Activation Normalized Loss vs ReLU
ReLU 1.19e-3 baseline
LeakyReLU (0.01) 5.30e-4 2.2x better
GELU 5.47e-4 2.2x better
Tanh 4.15e-4 2.9x better

LeakyReLU with slope 0.01 is functionally near-identical to ReLU (the negative slope is negligible) but outperforms it 2.2x. The reason: gradient still flows through negative pre-activations. The dead neuron problem is about gradient access, not activation shape.

7.2 Revival and Re-Death

Manually reviving dead neurons (reinitializing their weights) and continuing training yields 11-20% improvement over a control (continued training without revival). However, the revived neurons re-die within 2000 steps. The loss landscape actively drives neurons to death — it is not merely an initialization problem.

7.3 Effective Capacity

Dead neurons contribute exactly zero to the output function. LeakyReLU-16 (16 effective neurons) matches ReLU-20 (15 effective, 5 dead) in performance. This establishes that effective neurons = nominal minus dead is the relevant capacity measure.

Counterintuitively, ReLU-16 underperforms ReLU-12 on $y = x^2$ despite having more effective neurons (11 vs 10). We attribute this to the dying process damaging solution quality for surviving neurons — a form of training instability that increases with network size.

7.4 Dead Neurons in Deep Networks

In 1->8->8->1, Layer 1 has 0/8 dead neurons while Layer 2 has 3/8 dead. This is not cascading failure — Layer 2 is intrinsically more death-prone because it receives non-negative ReLU outputs as input, making pre-activations more likely to be entirely negative.

Each dead L2 neuron wastes 10 parameters (8 input weights + 1 bias + 1 output weight) compared to 3 per dead neuron in a shallow network (1 input weight + 1 bias + 1 output weight). Depth amplifies the dead neuron tax.


8. Training-Free Function Construction

8.1 The Jacobian and Its Ill-Conditioning

For a converged network $f_\theta$, the Jacobian $J = \partial f / \partial \theta \in \mathbb{R}^{n \times p}$ (n = evaluation points, p = parameters) maps weight perturbations to output changes. For a 1->16->1 network evaluated on 1000 grid points, $J$ is $1000 \times 49$.

The SVD of $J$ reveals 4+ orders of magnitude in singular values (198 to 0.00001), with 26 of 49 values being significant. The top singular vectors are not global properties (slope, intercept) but individual-neuron kink patterns — each affects a localized region of the input space. Slope and intercept are emergent collective properties that no single SVD direction captures.

8.2 Naive Pseudoinverse Failure

To change the function by $\Delta f$ (e.g., from $2x + 1$ to $2.1x + 1.05$), the naive approach solves $J \Delta\theta = \Delta f$ via pseudoinverse: $\Delta\theta = J^\dagger \Delta f$.

This fails catastrophically:

  • Condition number: 25,000,000
  • Maximum amplification: 130,000x
  • For the tiny change $2x+1 \rightarrow 2.1x + 1.05$: $|\Delta\theta| = 59$ vs $|\theta| = 4.2$ (14x the model size)

The ill-conditioning comes from the near-null singular vectors — kink patterns that barely affect the output but receive enormous weight changes through the pseudoinverse.

8.3 Targeted 2-DOF Construction

The key insight: the target function $y = ax + b$ has only 2 degrees of freedom (slope and intercept), not 1000 (the evaluation grid). We project the desired change onto these 2 properties and solve only the 2x49 system:

$$\begin{bmatrix} \partial\text{slope}/\partial\theta \ \partial\text{intercept}/\partial\theta \end{bmatrix} \Delta\theta = \begin{bmatrix} \Delta\text{slope} \ \Delta\text{intercept} \end{bmatrix}$$

This system has condition number 2.4 — ten million times better conditioned than the full system. Single-step construction works accurately within approximately $\pm 3$ units in (slope, intercept) space.

8.4 Iterative Convergence

Beyond the linearization radius (~3 units), ReLU neurons change their on/off patterns, invalidating the Jacobian. We apply iterative refinement: compute the Jacobian at the current weights, solve the 2x2 targeted system, update weights, repeat.

Target Iterations Final slope error Final intercept error
$3x + 2$ 3 0.0002 0.0002
$5x + 3$ 4 <0.0001 <0.0001
$-x + 4$ 5 0.0009 0.0020
$10x - 5$ 5 <0.0001 <0.0001
$-5x + 10$ 7 0.0001 0.0001

Convergence is exponential (1-2 orders of magnitude per iteration, matching Newton's method rate). Total cost: 3-7 autograd calls + 3-7 matrix multiplies. No loss function, no optimizer, no learning rate — purely linear algebra.

This construction does not optimize the function approximation error (which is limited by the ReLU piecewise-linear structure) but achieves the closest achievable linear function to the target, moving in function space without the overhead of gradient descent's manifold exploration.


9. Discussion

9.1 Relationship to Existing Literature

Depth separation theory. Montufar et al. (2014) showed that deep ReLU networks can represent $O\left(\left(\frac{n}{d}\right)^{(L-1)d} \cdot n^d\right)$ linear regions, growing exponentially in depth $L$ for fixed width $n$ and input dimension $d$. At $d = 1$, this collapses to polynomial growth — consistent with our finding that depth advantage is modest on 1D. Our contribution is showing that at fixed neuron budget (not fixed width), even the polynomial advantage vanishes on multi-D because the first layer cannot be kept wide while also allocating neurons to depth.

Minimum width theorems. Park et al. (2021) proved that universal approximation requires width $\geq \max(d+1, d_y)$. Hanin & Sellke (2017) proved the minimum width for UAP is exactly $d+1$. Our experiments confirm this operationally: shapes with any layer of width $&lt; d$ on $d$-dimensional input fail catastrophically (10/16 dead neurons for width-1 bottleneck on 2D input).

Wide Residual Networks. Zagoruyko & Komodakis (2016) showed that 16-layer-wide WRN outperforms 1000-layer-deep ResNet at matched parameters on CIFAR-10. This is the large-scale analog of our finding: when parameters are matched, width often beats depth.

Eigenspectrum structure. Sagun et al. (2017) identified bulk + outlier structure in the Hessian of practical networks. Our complete eigendecomposition at minimal scale shows this is a fundamental property: it emerges already at 16 neurons, with exact correspondence between outlier eigenvalues and the target's DOF.

Loss of plasticity. Lyle et al. (2023) identified loss of plasticity in continual learning, partly attributable to dead neurons. Our finding that the loss landscape actively kills neurons (not just initialization) and that revival is temporary provides a mechanistic basis for this phenomenon.

9.2 What Is Novel

The individual pieces — minimum width, depth separation, permutation symmetry, dead neurons — are known. Our contributions are:

  1. The complete parameter space anatomy from k=0 to k=16. No prior work constructs the eigenspectrum neuron-by-neuron and identifies all four tiers with their physical interpretations. The k=0 result (parameter space = function space with Hessian = 2I) is to our knowledge the first explicit demonstration of a zero-redundancy neural architecture.

  2. The eigenspectrum phase transition during training. The collapse from 27 to 4 steep eigenvalues between steps 50-100 has not been documented. It provides a precise, measurable criterion for "when a network has learned" — distinct from and earlier than loss convergence.

  3. Exhaustive architecture enumeration at fixed neuron budget. NAS-Bench (Ying et al., 2019) does something analogous for convolutional networks on CIFAR, but not for the simplest possible case (fully-connected, 16 neurons, 1D regression). The 575,000x performance range and the two-factor mechanism are new.

  4. The three-factor mechanism unifying width-depth tradeoffs across dimensions. The literature identifies the individual factors (minimum width theorems for input capture, bottleneck effects for diversity, output layer effects for basis count) but does not compose them into a single product model. This model makes testable predictions: depth should help when $N \gg d^2$, and the transition point should scale as $N \sim 4d^2$.

  5. Clean demonstration of depth advantage reversal between d=1 and d=2. No prior work shows this as a controlled experiment at minimal scale. The reversal from 2.1x better to 5x worse is a striking quantitative result that illustrates the constraint $N \gg d^2$ in the sharpest possible way.

  6. Iterative targeted construction. The 2-DOF reformulation reducing condition number from 25 million to 2.4, and its iterative extension achieving any target in 3-7 steps, is a new training-free construction method. While related to the NTK framework's linearization, it differs in projecting onto the target's DOF rather than working in the full output space.

9.3 Limitations

Scale. Our results are established for networks with 2-97 parameters. Whether the four-tier Hessian structure, the phase transition, and the three-factor mechanism hold at practical scale (millions of parameters) requires verification. The full Hessian is intractable beyond ~100 parameters; stochastic approximations would be needed.

Targets. We study univariate and low-dimensional regression with smooth polynomial targets. Image classification, language modeling, and other practical tasks involve high-dimensional structured data where the relevant "input dimension" may be much lower than the ambient dimension.

Optimizer. All experiments use Adam. SGD, SGD with momentum, and other optimizers may produce different manifold exploration patterns and dead neuron dynamics.

Activation. Our architecture analysis (Sections 5-6) uses only ReLU. The two-factor and three-factor mechanisms rely on piecewise-linear geometry (breakpoints, kinks). They may not directly apply to smooth activations (Tanh, GELU), though the input capture factor (first width >= d) likely remains relevant.

Parameter vs. neuron budget. Our architecture comparisons fix the neuron count at 16, but deeper networks have more parameters due to inter-layer connections. A parameter-matched comparison would be more controlled but would conflate architecture effects with scale effects (different numbers of neurons).

9.4 Implications for Practice

  1. Early stopping by eigenspectrum. The phase transition at steps 50-100 suggests that monitoring the Hessian's steep-eigenvalue count could provide a principled early stopping criterion — stop when the count stabilizes, not when the loss plateaus. The remaining 97% of training time is manifold exploration, not functional improvement.

  2. Architecture design as constraint satisfaction. The three-factor mechanism provides a design rule: ensure first layer width $\geq 2d$ (input capture), minimum interior width $\geq d$ (basis diversity), and maximize last layer width (basis count). This is consistent with practical architectures like ResNet (initial 7x7 conv expanding to 64 channels from 3-channel input, maintaining width throughout, wide final layers before classification).

  3. Dead neuron monitoring. Effective capacity = nominal minus dead. LeakyReLU is a free lunch: slope 0.01 is functionally identical to ReLU but prevents gradient death, yielding 2.2x improvement at zero cost.

  4. Depth requires scale. Do not add depth without width. The constraint $N \gg d^2$ provides a rough guideline: for $d$-dimensional input, ensure at least $4d^2$ neurons before considering depth > 1.


10. Conclusion

By studying neural networks at the smallest scale where they exhibit nontrivial behavior, we have uncovered structural properties that are difficult to observe at practical scale. The parameter space of even a 16-neuron ReLU network has rich geometric structure: four-tiered Hessian eigenspectrum, globally flat solution manifolds, and a sharp phase transition during training that separates functional learning (3% of steps) from manifold exploration (97%).

The exhaustive enumeration of all 3,662 architectures distributing 16 neurons across 1-16 layers reveals that architecture choice spans 575,000x in performance, governed by a simple mechanistic model: performance equals the product of basis count, basis diversity, and (on multi-dimensional input) input capture capacity. This model unifies minimum-width theorems, depth separation results, and the empirical success of wide residual networks under a single framework.

The depth-width tradeoff is not a simple "deeper is better" or "wider is better" dichotomy. It depends on the ratio of neuron budget to input dimension. Depth is a luxury — it requires sufficient scale ($N \gg d^2$) to afford both adequate input capture and meaningful intermediate processing. At the scale of modern networks (GPT-2: 768 dimensions, 3-channel images: 64 initial channels), this constraint is trivially satisfied, which is why depth helps in practice. At 16 neurons on 2D input, it is binding, which is why depth hurts.

These findings demonstrate the value of first-principles investigation: by controlling the variables that practical-scale experiments conflate, we isolate causal mechanisms that explain large-scale phenomena.


References

Dandi, Y., Pesce, L., et al. (2025). The Computational Advantage of Depth. NeurIPS 2025.

Eldan, R. & Shamir, O. (2016). The Power of Depth for Feedforward Neural Networks. COLT 2016.

Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. (2022). The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks. ICLR 2022.

Hanin, B. & Sellke, M. (2017). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv:1710.11278.

Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The Expressive Power of Neural Networks: A View from the Width. NeurIPS 2017.

Lyle, C., Zheng, Z., Nikishin, E., et al. (2023). Understanding Plasticity in Neural Networks. ICML 2023.

Montufar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. NeurIPS 2014.

Park, S., Yun, C., Lee, J., & Shin, J. (2021). Minimum Width for Universal Approximation. ICLR 2021.

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., & Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks. ICML 2017.

Safran, I. & Shamir, O. (2017). Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks. ICML 2017.

Sagun, L., Evci, U., Guney, V.U., Dauphin, Y., & Bottou, L. (2017). Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. arXiv:1706.04454.

Telgarsky, M. (2016). Benefits of Depth in Neural Networks. COLT 2016.

Vardi, G., Yehudai, G., & Shamir, O. (2022). Width is Less Important than Depth in ReLU Neural Networks. COLT 2022.

Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94, 103-114.

Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., & Hutter, F. (2019). NAS-Bench-101: Towards Reproducible Neural Architecture Search. ICML 2019.

Zagoruyko, S. & Komodakis, N. (2016). Wide Residual Networks. BMVC 2016.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment