Skip to content

Instantly share code, notes, and snippets.

@ngocson2vn
Created November 2, 2024 11:03
Show Gist options
  • Save ngocson2vn/b3f25c689c6e166be2dfe64d9a729346 to your computer and use it in GitHub Desktop.
Save ngocson2vn/b3f25c689c6e166be2dfe64d9a729346 to your computer and use it in GitHub Desktop.
How to upgrade NVIDIA driver for H100 GPUs

Uninstall current nvidia driver completely

# Step 1: Run uninstaller
sudo /usr/bin/nvidia-uninstall

# Step 2: Stop fabric manager (H100)
sudo systemctl stop nvidia-fabricmanager.service

# Step 3: Remove kernel modules
sudo lsmod | grep -E 'Module|nvidia'
Module                  Size  Used by
nvidia              56750080  0

# Remove all modules with "Used by" 0
sudo rmmod nvidia

# Double-check
sudo lsmod | grep -E 'Module|nvidia'

# Step 4: If "Used by" > 0, then find all processes still using /dev/nvidia* and kill them all
sudo lsof | grep nvidia

Install new nvidia driver version

# Driver 550.54.15
wget https://us.download.nvidia.com/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run

# The corresponding fabric manager
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb

sudo bash NVIDIA-Linux-x86_64-550.54.15.run
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo systemctl daemon-reload
sudo systemctl start nvidia-fabricmanager

Reboot

sudo reboot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment