Summarizing what I've found about standard model initialization practices in public code bases and papers.
Shared depth-scaling utility (param_init.py):
depth_scaled_std(base_std, layer_id) = base_std / sqrt(2 * (layer_id + 1)) (0-indexed layers).