The Anatomy of Small Neural Networks: Parameter Space Structure, Architecture Selection, and the Depth-Width Tradeoff
We present a systematic empirical investigation of feedforward ReLU networks at minimal scale (1-16 hidden neurons), revealing structural properties of parameter space, training dynamics, and architecture selection that are obscured at practical scale. Through 11 controlled experiments totaling over 10,000 training runs, we establish several novel findings: (1) the Hessian eigenspectrum of a trained network decomposes into exactly four tiers — steep, moderate, weak, and zero — whose counts are predictable from architecture alone, with the steep tier recovering the target function's degrees of freedom; (2) this structure emerges during training via a sharp phase transition (27 steep eigenvalues collapse to 4 between steps 50-100 for a 16-neuron network), after which 97% of training time is spent exploring the solution manifold rather than improving the function; (3) exhaustive enumeration of all 3,662 architectures distributing 16 neurons across 1-16 layers reveals a 575,000x performance range, governed by a two-factor mechanism (basis count x basis diversity); (4) this mechanism extends to a three-factor model on multi-dimensional input (input capture x basis diversity x basis count), explaining why depth advantage reverses from 2.1x better to 5x worse between 1D and 2D at fixed neuron budget; (5) iterative targeted construction — a training-free method using Jacobian projections — converges to any target function in 3-7 iterations with <0.002 error. These results unify minimum-width theorems, depth separation results, and the wide residual network phenomenon under a single mechanistic framework grounded in the piecewise-linear geometry of ReLU networks.
The practical success of deep neural networks has outpaced our understanding of their internal structure. Questions as basic as "why does this architecture work better than that one?" remain answered primarily by empirical benchmarking at scale, where the interaction of optimizer, regularization, data augmentation, and architecture makes causal isolation difficult.
We take the opposite approach. By studying the smallest networks that exhibit nontrivial behavior — single hidden layers with 1 to 16 ReLU neurons, and multi-layer networks distributing 16 neurons across 1 to 16 layers — we can:
- Compute the full Hessian eigendecomposition (not stochastic approximations)
- Enumerate all possible architectures (not sample from a search space)
- Train thousands of models to convergence in minutes (not days)
- Track every neuron's alive/dead status across training
- Isolate single variables (optimizer, activation, architecture) with controlled experiments
This paper reports the findings of 11 experiments conducted in sequence, each building on the previous. We organize the results thematically rather than chronologically, focusing on what is novel relative to existing literature.
Our contributions fall into five categories:
I. Parameter space anatomy. We construct the complete eigenspectrum of the loss Hessian for networks with k=0 to k=16 hidden neurons, showing how each added neuron inflates the null-space-adjacent region. At k=0 (no hidden layer), parameter space is function space — both Hessian eigenvalues equal 2.0 for standard normal data, giving a perfectly circular loss landscape with zero redundancy. At k=16, only 8% of parameter dimensions are functionally relevant; the remaining 92% encode symmetries (permutation, scaling), dead neuron parameters, and kink redistributions.
II. Training dynamics through the eigenspectrum. We track the Hessian at 8 checkpoints during training and discover a sharp phase transition: a 16-neuron network initializes with 27/49 steep eigenvalues (>0.1) which collapse to 4 by step 100. This transition marks the boundary between Phase 1 (function alignment, 3% of training) and Phase 2 (manifold exploration, 97% of training). Function error converges 100-1000x faster than weight distance to the final solution.
III. Exhaustive architecture enumeration. We train all 3,662 ways to distribute 16 neurons across 1-16 layers (all integer compositions of 16, depths 1-5 complete + sampling at higher depths). Performance spans 575,000x on y=2x+1 — a trivial problem — governed by a two-factor mechanism: performance = basis count (last layer width) x basis diversity (bottleneck width). This predicts the optimal shape from first principles without training.
IV. Depth advantage reversal on multi-dimensional input. The two-factor mechanism extends to three factors on multi-D input: input capture (first width >= d) x basis diversity x basis count. On 1D, the first factor is trivially satisfied, so depth helps by increasing diversity. On 2D with 16 neurons, the first factor dominates — every neuron spent on depth is a neuron stolen from input capture. Depth-5 goes from 2.1x better (1D) to 5x worse (2D). This explains why depth requires scale (GPT's 768-wide layers on 3-channel input, ResNet's 64 channels on 3 inputs).
V. Training-free function construction. The naive Jacobian pseudoinverse has condition number 25 million, but projecting onto the target's 2-DOF subspace yields condition number 2.4. Iterative application (relinearize + solve) converges to any target in 3-7 steps with no loss function and no optimizer — purely linear algebra, with convergence rate matching Newton's method.
All experiments share a common framework.
The base architecture is a fully-connected feedforward network with ReLU activations:
where
We parameterize architectures by their hidden layer widths. A "shape"
- Optimizer: Adam (lr=0.01, default betas)
-
Batch size: 64, sampled from
$\mathcal{N}(0, 1)^d$ (fresh each step) - Loss: Mean squared error
- Steps: 2000 (experiments 001-010) or 3000 (experiment 011)
- Precision: float32 (Apple Silicon / MPS)
- Initialization: PyTorch defaults (Kaiming uniform)
- No regularization: No dropout, weight decay, batch norm, or learning rate schedule
- Dead neuron detection: A neuron is "dead" if its pre-activation is non-positive for all samples in a batch of 1000. Batch-level detection (n=64) overestimates by ~40%.
- Breakpoint counting: Number of distinct linear regions in the piecewise-linear output function, computed on a fine grid.
-
Hessian: Full eigendecomposition of the
$p \times p$ loss Hessian viatorch.autograd.functional.hessian(). Feasible for$p \leq 100$ . - Functional fraction: Proportion of Hessian eigenvalues in the steep tier (> 0.1), measuring what fraction of parameters control the output function.
| Target | Formula | Class | Properties |
|---|---|---|---|
| Linear | In-class | Exactly representable by ReLU net | |
| Absolute | $y = | x | $ |
| Quadratic | Out-of-class | Smooth, requires approximation | |
| 2D quadratic | Out-of-class | Radially symmetric | |
| 2D interaction | Out-of-class | Requires feature combination |
For a trained 1->16->1 ReLU network on
| Tier | Eigenvalue range | Count | Interpretation |
|---|---|---|---|
| Steep | > 0.1 | 3-4 | Functional DOF (slope, intercept, kink structure) |
| Moderate |
|
5-8 | ReLU kink position adjustments |
| Weak |
|
23-24 | Scaling symmetry, kink redistribution |
| Zero | < |
13-16 | Dead neuron parameters |
The top eigenvalue (
This is verified by computing the Jacobian
We constructed the eigenspectrum for every architecture from k=0 (no hidden layer, 2 parameters) to k=16 (49 parameters). This reveals how redundancy enters the parameter space neuron by neuron.
k=0: Parameter space IS function space. The 2x2 Hessian has eigenvalues [2.012, 2.000] — essentially
k=1: The first symmetry. The 4-dimensional space decomposes as: 2 steep (function DOF) + 1 moderate + 1 near-zero. The near-zero direction has cosine similarity 1.0000 with the analytical scaling symmetry direction
k=2 to k=16: Systematic inflation. Each neuron adds approximately:
- 0 steep directions (the function's DOF is independent of network size)
- ~1 kink-position direction (moderate if enough neurons contribute, weak otherwise)
- ~1 scaling symmetry direction (weak)
- ~1 kink redistribution direction (weak)
The result: functional fraction drops as approximately
Walking along Hessian eigenvectors from a converged solution:
-
Steep direction: Loss rises to 1-10 within
$\pm 1$ unit -
Moderate direction: Loss rises to
$\sim 0.01$ within$\pm 1$ unit -
Weak direction: Loss stays below
$10^{-4}$ for$\pm 3$ units -
Zero direction (dead neurons): Loss stays below
$10^{-5}$ for arbitrary distance
The weak-direction flatness extends over distances exceeding 60% of the model's total weight norm. The solution manifold is not just locally tangent — it is a globally flat subspace of dimension approximately
With 16 hidden neurons,
- Raw PCA: 28 components for 95% variance. Top 2 PCs explain only 7.5% and 7.1% — near-isotropic.
- After neuron alignment (greedy matching by activation correlation): 27 components for 95% variance, but total variance drops 52.9%.
Permutation symmetry inflates the solution cloud uniformly in all directions (volume effect) rather than adding specific directions of variation (dimensionality effect). This is why PCA of neural network weight matrices is near-isotropic despite the underlying solution manifold having rich structure.
Mode connectivity. Before alignment, linear interpolation between two independently trained models encounters a loss barrier of 2.12. After neuron alignment, the barrier drops to 0.088 — a 95.8% reduction. Models that appear to live in different basins are actually in the same basin, modulo neuron relabeling. This confirms the results of Entezari et al. (2022) at minimal scale where the claim can be verified exhaustively.
On nonlinear targets (
Beyond permutation, a second symmetry preserves the function: multiplying a neuron's input weights by
The four-tier Hessian structure from Section 3 does not exist at initialization. Tracking the eigenspectrum during training reveals a sharp phase transition:
| Step | Loss | Steep (>0.1) | Moderate | Weak | Zero |
|---|---|---|---|---|---|
| 0 | 22.4 | 27 | 6 | 8 | 8 |
| 25 | 2.21 | 28 | 5 | 6 | 10 |
| 50 | 0.578 | 23 | 9 | 6 | 11 |
| 100 | 0.250 | 4 | 27 | 7 | 11 |
| 200 | 0.021 | 4 | 13 | 23 | 9 |
| 500 | 0.011 | 4 | 27 | 10 | 8 |
| 3000 | 5.7e-5 | 4 | 10 | 27 | 8 |
Between steps 50 and 100, the steep count collapses from 23 to 4. This is the moment the network "finds" its solution — the loss landscape transforms from a generic high-dimensional bowl (many steep directions) to a narrow functional subspace embedded in a vast flat manifold.
The transition is remarkably sharp: over approximately 50 gradient steps (less than 2% of total training time), 19 directions transition from steep to moderate/weak. By step 200, the manifold structure is fully established and does not change qualitatively for the remaining 2800 steps.
This transition divides training into two mechanistically distinct phases:
Phase 1: Function alignment (steps 0-100, ~3% of training). The network's output function converges to within
Phase 2: Manifold exploration (steps 100-3000, ~97% of training). The function is already learned; weights continue moving in weak and moderate directions without changing the output. Movement budget analysis confirms: after step 200, almost all parameter movement is in the weak tier (null-space-adjacent directions).
The movement along zero-eigenvalue directions (dead neuron parameters) follows a pure random walk, reaching normalized displacements of
For k=16, function error (measured as RMSE on a held-out grid) reaches
This observation has implications for convergence diagnostics: monitoring loss is sufficient for detecting functional convergence, but weight-space metrics (gradient norms, weight changes) can be highly misleading because they conflate functional progress with manifold exploration.
We trained all ways to distribute
Performance range. On
Depth frontier. Depth is not monotonically bad. The best shape at each depth:
| Depth | Best shape | Params | Loss ( |
vs depth-1 |
|---|---|---|---|---|
| 1 | [16] | 49 | 1.51 | baseline |
| 2 | [6, 10] | 99 | 0.97 | 1.6x better |
| 3 | [5, 7, 4] | 94 | 0.72 | 2.1x better |
| 4 | [4, 4, 4, 4] | 77 | 1.08 | 1.4x better |
| 5 | [3, 4, 2, 2, 5] | 59 | 0.72 | 2.1x better |
| 10 | — | — | 62.6 | 41x worse |
| 11+ | — | — | 108.0 | 72x worse |
Depth-3 and depth-5 beat depth-1 on
Caveat on parameter count. The depth-3 winner (5->7->4) has 94 parameters vs 49 for depth-1 (16). The inter-layer connections contribute
At depth 11 with 16 total neurons, at least 10 layers must have width 1. A chain of width-1 ReLU layers implements:
Each ReLU clips negative values. After enough applications, the output is either
The optimal architecture is governed by two competing factors:
-
Basis count (last hidden layer width
$w_L$ ): The final layer before the output produces$w_L$ ReLU basis functions. Each is a hinge at a different position on the input axis. More basis functions = finer piecewise-linear approximation. -
Basis diversity (minimum interior width, i.e., bottleneck): If any intermediate layer has width
$w_b$ , then the$w_L$ last-layer neurons receive at most$w_b$ distinct input features. If$w_b < w_L$ , the basis functions are constrained to be linear combinations of$w_b$ templates — many are redundant copies.
Performance = basis count x basis diversity. This product model explains why:
- 4->3->9 (9 last neurons, bottleneck 3) has loss 57 on
$x^2$ , while 5->7->4 (4 last neurons, bottleneck 5) has loss 0.72. The 3-neuron bottleneck makes 6 of the 9 last-layer neurons functionally redundant. - Wide-last shapes beat wide-first: 6->10 (loss 0.97) beats 10->6 (loss 2.98). The last layer directly constructs the output approximation; the first layer merely feeds it.
- Depth-1 [16] is strong because it has bottleneck = 16 and basis count = 16 — no compression anywhere.
Cascading narrow layers are lethal. The shape 10->2->2->2 produces
We verified the causal chain:
The scatter of breakpoints vs. loss on
The depth separation literature (Montufar et al., 2014; Eldan & Shamir, 2016) shows that deep networks can represent exponentially more linear regions than shallow networks, with the exponent proportional to input dimension
Hypothesis: Depth advantage should grow with input dimension.
We tested this with 16 neurons on 2D input (targets:
| Depth | 1D ( |
2D ( |
|---|---|---|
| 1 | 1.0x (baseline) | 1.0x (baseline) |
| 2 | 1.6x better | 1.1x better |
| 3 | 2.0x better | 0.75x (worse) |
| 4 | 2.0x better | 0.33x (worse) |
| 5 | 2.1x better | 0.19x (5x worse) |
On 1D, depth-5 was 2.1x better than depth-1. On 2D, depth-5 is 5x worse. The advantage doesn't merely fail to grow — it inverts.
The interaction target
We computed Pearson correlations between architectural features and
| Feature | 1D ( |
2D quadratic | 2D interaction |
|---|---|---|---|
| Last width | -0.431 | -0.175 | -0.081 |
| First width | (irrelevant on 1D) | -0.314 | -0.445 |
| Bottleneck | -0.483 | -0.520 | -0.491 |
The importance flips. On 1D, last layer width is the strongest architectural predictor (
On
On 1D input (
On multi-D input (
With
For depth to help on
The dimension sweep confirms: on
Depth is a luxury that requires scale. With
Narrow intermediate layers are even more lethal on multi-D input:
| Shape | Dead on 1D |
Dead on 2D |
|---|---|---|
| [16] | 2 | 0 |
| 5->7->4 | 3 | 6 |
| 4->4->4->4 | 4 | 7 |
| 5->1->4->6 | 2 | 10 |
5->1->4->6 has 10/16 neurons dead on the 2D interaction target — 63% of the network is non-functional. The width-1 bottleneck after the first layer projects a 2D manifold through a scalar, destroying all directional information.
Dead ReLU neurons (pre-activation
| Activation | Normalized Loss | vs ReLU |
|---|---|---|
| ReLU | 1.19e-3 | baseline |
| LeakyReLU (0.01) | 5.30e-4 | 2.2x better |
| GELU | 5.47e-4 | 2.2x better |
| Tanh | 4.15e-4 | 2.9x better |
LeakyReLU with slope 0.01 is functionally near-identical to ReLU (the negative slope is negligible) but outperforms it 2.2x. The reason: gradient still flows through negative pre-activations. The dead neuron problem is about gradient access, not activation shape.
Manually reviving dead neurons (reinitializing their weights) and continuing training yields 11-20% improvement over a control (continued training without revival). However, the revived neurons re-die within 2000 steps. The loss landscape actively drives neurons to death — it is not merely an initialization problem.
Dead neurons contribute exactly zero to the output function. LeakyReLU-16 (16 effective neurons) matches ReLU-20 (15 effective, 5 dead) in performance. This establishes that effective neurons = nominal minus dead is the relevant capacity measure.
Counterintuitively, ReLU-16 underperforms ReLU-12 on
In 1->8->8->1, Layer 1 has 0/8 dead neurons while Layer 2 has 3/8 dead. This is not cascading failure — Layer 2 is intrinsically more death-prone because it receives non-negative ReLU outputs as input, making pre-activations more likely to be entirely negative.
Each dead L2 neuron wastes 10 parameters (8 input weights + 1 bias + 1 output weight) compared to 3 per dead neuron in a shallow network (1 input weight + 1 bias + 1 output weight). Depth amplifies the dead neuron tax.
For a converged network
The SVD of
To change the function by
This fails catastrophically:
- Condition number: 25,000,000
- Maximum amplification: 130,000x
- For the tiny change
$2x+1 \rightarrow 2.1x + 1.05$ :$|\Delta\theta| = 59$ vs$|\theta| = 4.2$ (14x the model size)
The ill-conditioning comes from the near-null singular vectors — kink patterns that barely affect the output but receive enormous weight changes through the pseudoinverse.
The key insight: the target function
This system has condition number 2.4 — ten million times better conditioned than the full system. Single-step construction works accurately within approximately
Beyond the linearization radius (~3 units), ReLU neurons change their on/off patterns, invalidating the Jacobian. We apply iterative refinement: compute the Jacobian at the current weights, solve the 2x2 targeted system, update weights, repeat.
| Target | Iterations | Final slope error | Final intercept error |
|---|---|---|---|
| 3 | 0.0002 | 0.0002 | |
| 4 | <0.0001 | <0.0001 | |
| 5 | 0.0009 | 0.0020 | |
| 5 | <0.0001 | <0.0001 | |
| 7 | 0.0001 | 0.0001 |
Convergence is exponential (1-2 orders of magnitude per iteration, matching Newton's method rate). Total cost: 3-7 autograd calls + 3-7 matrix multiplies. No loss function, no optimizer, no learning rate — purely linear algebra.
This construction does not optimize the function approximation error (which is limited by the ReLU piecewise-linear structure) but achieves the closest achievable linear function to the target, moving in function space without the overhead of gradient descent's manifold exploration.
Depth separation theory. Montufar et al. (2014) showed that deep ReLU networks can represent
Minimum width theorems. Park et al. (2021) proved that universal approximation requires width
Wide Residual Networks. Zagoruyko & Komodakis (2016) showed that 16-layer-wide WRN outperforms 1000-layer-deep ResNet at matched parameters on CIFAR-10. This is the large-scale analog of our finding: when parameters are matched, width often beats depth.
Eigenspectrum structure. Sagun et al. (2017) identified bulk + outlier structure in the Hessian of practical networks. Our complete eigendecomposition at minimal scale shows this is a fundamental property: it emerges already at 16 neurons, with exact correspondence between outlier eigenvalues and the target's DOF.
Loss of plasticity. Lyle et al. (2023) identified loss of plasticity in continual learning, partly attributable to dead neurons. Our finding that the loss landscape actively kills neurons (not just initialization) and that revival is temporary provides a mechanistic basis for this phenomenon.
The individual pieces — minimum width, depth separation, permutation symmetry, dead neurons — are known. Our contributions are:
-
The complete parameter space anatomy from k=0 to k=16. No prior work constructs the eigenspectrum neuron-by-neuron and identifies all four tiers with their physical interpretations. The k=0 result (parameter space = function space with Hessian = 2I) is to our knowledge the first explicit demonstration of a zero-redundancy neural architecture.
-
The eigenspectrum phase transition during training. The collapse from 27 to 4 steep eigenvalues between steps 50-100 has not been documented. It provides a precise, measurable criterion for "when a network has learned" — distinct from and earlier than loss convergence.
-
Exhaustive architecture enumeration at fixed neuron budget. NAS-Bench (Ying et al., 2019) does something analogous for convolutional networks on CIFAR, but not for the simplest possible case (fully-connected, 16 neurons, 1D regression). The 575,000x performance range and the two-factor mechanism are new.
-
The three-factor mechanism unifying width-depth tradeoffs across dimensions. The literature identifies the individual factors (minimum width theorems for input capture, bottleneck effects for diversity, output layer effects for basis count) but does not compose them into a single product model. This model makes testable predictions: depth should help when
$N \gg d^2$ , and the transition point should scale as$N \sim 4d^2$ . -
Clean demonstration of depth advantage reversal between d=1 and d=2. No prior work shows this as a controlled experiment at minimal scale. The reversal from 2.1x better to 5x worse is a striking quantitative result that illustrates the constraint
$N \gg d^2$ in the sharpest possible way. -
Iterative targeted construction. The 2-DOF reformulation reducing condition number from 25 million to 2.4, and its iterative extension achieving any target in 3-7 steps, is a new training-free construction method. While related to the NTK framework's linearization, it differs in projecting onto the target's DOF rather than working in the full output space.
Scale. Our results are established for networks with 2-97 parameters. Whether the four-tier Hessian structure, the phase transition, and the three-factor mechanism hold at practical scale (millions of parameters) requires verification. The full Hessian is intractable beyond ~100 parameters; stochastic approximations would be needed.
Targets. We study univariate and low-dimensional regression with smooth polynomial targets. Image classification, language modeling, and other practical tasks involve high-dimensional structured data where the relevant "input dimension" may be much lower than the ambient dimension.
Optimizer. All experiments use Adam. SGD, SGD with momentum, and other optimizers may produce different manifold exploration patterns and dead neuron dynamics.
Activation. Our architecture analysis (Sections 5-6) uses only ReLU. The two-factor and three-factor mechanisms rely on piecewise-linear geometry (breakpoints, kinks). They may not directly apply to smooth activations (Tanh, GELU), though the input capture factor (first width >= d) likely remains relevant.
Parameter vs. neuron budget. Our architecture comparisons fix the neuron count at 16, but deeper networks have more parameters due to inter-layer connections. A parameter-matched comparison would be more controlled but would conflate architecture effects with scale effects (different numbers of neurons).
-
Early stopping by eigenspectrum. The phase transition at steps 50-100 suggests that monitoring the Hessian's steep-eigenvalue count could provide a principled early stopping criterion — stop when the count stabilizes, not when the loss plateaus. The remaining 97% of training time is manifold exploration, not functional improvement.
-
Architecture design as constraint satisfaction. The three-factor mechanism provides a design rule: ensure first layer width
$\geq 2d$ (input capture), minimum interior width$\geq d$ (basis diversity), and maximize last layer width (basis count). This is consistent with practical architectures like ResNet (initial 7x7 conv expanding to 64 channels from 3-channel input, maintaining width throughout, wide final layers before classification). -
Dead neuron monitoring. Effective capacity = nominal minus dead. LeakyReLU is a free lunch: slope 0.01 is functionally identical to ReLU but prevents gradient death, yielding 2.2x improvement at zero cost.
-
Depth requires scale. Do not add depth without width. The constraint
$N \gg d^2$ provides a rough guideline: for$d$ -dimensional input, ensure at least$4d^2$ neurons before considering depth > 1.
By studying neural networks at the smallest scale where they exhibit nontrivial behavior, we have uncovered structural properties that are difficult to observe at practical scale. The parameter space of even a 16-neuron ReLU network has rich geometric structure: four-tiered Hessian eigenspectrum, globally flat solution manifolds, and a sharp phase transition during training that separates functional learning (3% of steps) from manifold exploration (97%).
The exhaustive enumeration of all 3,662 architectures distributing 16 neurons across 1-16 layers reveals that architecture choice spans 575,000x in performance, governed by a simple mechanistic model: performance equals the product of basis count, basis diversity, and (on multi-dimensional input) input capture capacity. This model unifies minimum-width theorems, depth separation results, and the empirical success of wide residual networks under a single framework.
The depth-width tradeoff is not a simple "deeper is better" or "wider is better" dichotomy. It depends on the ratio of neuron budget to input dimension. Depth is a luxury — it requires sufficient scale (
These findings demonstrate the value of first-principles investigation: by controlling the variables that practical-scale experiments conflate, we isolate causal mechanisms that explain large-scale phenomena.
Dandi, Y., Pesce, L., et al. (2025). The Computational Advantage of Depth. NeurIPS 2025.
Eldan, R. & Shamir, O. (2016). The Power of Depth for Feedforward Neural Networks. COLT 2016.
Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. (2022). The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks. ICLR 2022.
Hanin, B. & Sellke, M. (2017). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv:1710.11278.
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). The Expressive Power of Neural Networks: A View from the Width. NeurIPS 2017.
Lyle, C., Zheng, Z., Nikishin, E., et al. (2023). Understanding Plasticity in Neural Networks. ICML 2023.
Montufar, G., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. NeurIPS 2014.
Park, S., Yun, C., Lee, J., & Shin, J. (2021). Minimum Width for Universal Approximation. ICLR 2021.
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., & Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks. ICML 2017.
Safran, I. & Shamir, O. (2017). Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks. ICML 2017.
Sagun, L., Evci, U., Guney, V.U., Dauphin, Y., & Bottou, L. (2017). Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. arXiv:1706.04454.
Telgarsky, M. (2016). Benefits of Depth in Neural Networks. COLT 2016.
Vardi, G., Yehudai, G., & Shamir, O. (2022). Width is Less Important than Depth in ReLU Neural Networks. COLT 2022.
Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94, 103-114.
Ying, C., Klein, A., Christiansen, E., Real, E., Murphy, K., & Hutter, F. (2019). NAS-Bench-101: Towards Reproducible Neural Architecture Search. ICML 2019.
Zagoruyko, S. & Komodakis, N. (2016). Wide Residual Networks. BMVC 2016.