Skip to content

Instantly share code, notes, and snippets.

@lucidrains
Last active December 7, 2022 14:05
Show Gist options
  • Save lucidrains/0d6560077edac419ab5d3aa29e674d5c to your computer and use it in GitHub Desktop.
Save lucidrains/0d6560077edac419ab5d3aa29e674d5c to your computer and use it in GitHub Desktop.
def unitwise_norm(x):
if len(x.squeeze().shape) <= 1:
dim = None
keepdim = False
elif len(x.shape) in (2, 3):
dim = 1
keepdim = True
elif len(x.shape) == 4:
dim = (1, 2, 3) # pytorch convolution kernel is OIHW
keepdim = True
else:
raise ValueError(f'got a parameter with shape not in (1, 2, 3, 4) {x}')
return x.norm(dim = dim, keepdim = keepdim, p = 2)
def adaptive_clip_grad_(parameters, clipping = 0.01, eps = 1e-3):
parameters = [p for p in parameters if p.grad is not None]
if len(parameters) == 0:
return
for p in parameters:
param_norm = unitwise_norm(p).clamp_(min = eps)
grad_norm = unitwise_norm(p.grad)
max_norm = param_norm * clipping
trigger = grad_norm > max_norm
clipped_grad = p.grad * (max_norm / grad_norm.clamp(min = 1e-6))
new_grads = torch.where(trigger, clipped_grad, p.grad)
p.grad.detach().copy_(new_grads)
@sayakpaul
Copy link

Hi @lucidrains.

Thank you for providing it!

Could you explain why the additional clamping is required in https://gist.github.com/lucidrains/0d6560077edac419ab5d3aa29e674d5c#file-adaptive_gradient_clip-py-L27?

@lucidrains
Copy link
Author

@sayakpaul Hello! It was done for extra insurance in the original deepmind repo apparently (they had a comment explaining it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment