The paper "The Curse of Depth in Large Language Models" introduces an important concept that affects the efficiency and performance of modern language models. The authors identify a phenomenon where nearly half of the layers in LLMs are less effective than expected, calling this the "Curse of Depth" (CoD).
-
The Problem: Deeper layers in LLMs like Llama, Mistral, DeepSeek, and Qwen contribute significantly less to the final output compared to earlier layers. This creates inefficiency, as training these models requires substantial computational resources.
-
Root Cause: The authors identify Pre-Layer Normalization (Pre-LN) as the culprit. While Pre-LN stabilizes training, it causes the output variance to grow exponentially with model depth. This causes deeper transformer blocks to act almost like identity matrices, barely transforming the data in meaningful ways.
-
The Solution: LayerNorm Scaling, which scales the output of the layer normalization inversely by the square root of its depth (1/√layer_depth). This simple modification controls the variance explosion issue.
-
Results: Experiments across models from 130M to 1B parameters show that LayerNorm Scaling significantly improves pre-training performance compared to Pre-LN and carries these benefits through to supervised fine-tuning.
Below is a clean, standalone PyTorch implementation of LayerNorm Scaling that can be used as a drop-in replacement for standard LayerNorm.
The code provides two main classes:
- LayerNormScaling: A drop-in replacement for standard LayerNorm with depth scaling
- RMSNormScaling: A version for LLaMA-like models that use RMSNorm instead of LayerNorm
The implementation can be used in three ways:
When designing a new model, you can directly use the LayerNormScaling or RMSNormScaling classes:
from layernorm_scaling import RMSNormScaling
# In your transformer layer
self.input_layernorm = RMSNormScaling(hidden_size, layer_idx=current_layer_index)
The apply_cod_mitigation_to_llama
function shows how to convert an existing LLaMA model to use the scaling technique:
from transformers import LlamaForCausalLM
from layernorm_scaling import apply_cod_mitigation_to_llama
# Load your model
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Apply the CoD mitigation
model = apply_cod_mitigation_to_llama(model)
The TransformerLayerWithScaling class demonstrates how to integrate the scaling into a transformer layer architecture.
- Improved Performance: The paper shows consistent improvements in perplexity across model sizes
- Resource Efficiency: Makes better use of all layers, improving training efficiency
- Simple Implementation: Requires minimal code changes and no additional parameters
- Compatible with Existing Models: Can be retrofitted to already trained models
This implementation follows the paper's approach while providing flexibility for different use cases and model architectures.
based on original work:
@article{sun2025curse,
title={The Curse of Depth in Large Language Models},
author={Sun, Wenfang and Song, Xinyuan and Li, Pengxiang and Yin, Lu and Zheng, Yefeng and Liu, Shiwei},
journal={arXiv preprint arXiv:2502.05795},
year={2025}
}
see original codebase here: https://github.com/lmsdss/LayerNorm-Scaling/tree/main