hiwonjoon/Install NVIDIA Driver and CUDA.md

Forked from zhanwenchen/Install NVIDIA Driver and CUDA.md

Created September 7, 2018 19:54

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/hiwonjoon/412ceb2b594b47a67d89628c8ee2f1dc.js"></script>
Save hiwonjoon/412ceb2b594b47a67d89628c8ee2f1dc to your computer and use it in GitHub Desktop.

Download ZIP

Install NVIDIA CUDA 9.0 on Ubuntu 16.04.4 LTS

Raw

Install NVIDIA Driver and CUDA.md

In this article, I will share some of my experience on installing the NVIDIA driver and CUDA. I will mainly use Ubuntu as example. Comments for CentOS/Fedora are also provided as much as I can.

Install NVIDIA Graphics Driver via apt-get
Install NVIDIA Graphics Driver via runfile
Install CUDA
Install cuDNN

Table of contents generated with markdown-toc

Install NVIDIA Graphics Driver via apt-get

In Ubuntu systems, drivers for NVIDIA Graphics Cards are already provided in the official repository. Installation is as simple as one command.

For ubuntu 14.04.5 LTS, the latest version is 352. To install the driver, excute sudo apt-get nvidia-352 nvidia-modprobe, and then reboot the machine.

For ubuntu 16.04.1 LTS, the latest version is 361. To install the driver, excute sudo apt-get nvidia-361 nvidia-modprobe, and then reboot the machine.

The nvidia-modprobe utility is used to load NVIDIA kernel modules and create NVIDIA character device files automatically everytime your machine boots up.

It is recommended for new users to install the driver via this way because it is simple. However, it has some drawbacks:

The driver included in official Ubuntu repository is usually not the latest.
There would be some naming conflicts when other repositories (e.g. ones from CUDA) are added to the system.
One has to reinstall the driver when Linux kernel are updated.

Install NVIDIA Graphics Driver via runfile

For advanced user who wants to get the latest version of the driver, get rid of the reinstallation issue caused bby dkms, or using Linux distributions that do not have nvidia drivers provided in the repositories, installing from runfile is recommended.

Remove Previous Installations (Important)

One might have installed the driver via apt-get. So before reinstall the driver from runfile, uninstalling previous installations is required. Executing the following scripts carefully one by one.

sudo apt-get purge nvidia*

# Note this might remove your cuda installation as well
sudo apt-get autoremove 

# Recommended if .deb files from NVIDIA were installed
# Change 1404 to the exact system version or use tab autocompletion
# After executing this file, /etc/apt/sources.list.d should contain no files related to nvidia or cuda
sudo dpkg -P cuda-repo-ubuntu1404

Download the Driver

The latest driver for NVIDIA products can always be fetched from NVIDIA's official website. It is not necessary to select all terms carefully. The driver provided for the same Product Series and Operating System is generally the same. For example, in order to find a driver for a GTX TITAN X graphics card, selecting GeForce 900 Series in Product Series and Linux 64-bit in Operating System is enough.

Scripts for this part

cd ~
wget http://download.nvidia.com/XFree86/Linux-x86_64/367.44/NVIDIA-Linux-x86_64-367.44.run

Detailed installation instruction can be found in the download page via a README hyperlink. I have also summarized key steps below.

Install Dependencies

Software required for the runfile are officially listed here. But this page seems to be stale and not easy to follow.

For Ubuntu, installing the following dependencies is enough.

build-essential -- For building the driver
gcc-multilib -- For providing 32-bit support
dkms -- For providing dkms support
(Optional) xorg and xorg-dev. On a workstation with GUI, this is require but usually have already been installed. Otherwise how do you get the graphic display? On headless servers without GUI, this is not a must.

As a summary, excuting sudo apt-get install build-essential gcc-multilib dkms to install all dependencies.

Required packages for CentOS are epel-release dkms libstdc++.i686. Execute yum install epel-release dkms libstdc++.i686.

Required packages for Fedora are dkms libstdc++.i686. Execute dnf install dkms libstdc++.i686.

Creat Blacklist for Nouveau Driver

Create a file at /etc/modprobe.d/blacklist-nouveau.conf with the following contents:

blacklist nouveau
options nouveau modeset=0

Note: It is also possible for the NVIDIA installation runfile to creat this blacklist file automatically. Excute the runfile and follow instructions when an error realted Nouveau appears.

Then,

for Ubuntu 14.04 LTS, reboot the computer;
for Ubuntu 16.04 LTS, excute sudo update-initramfs -u and reboot the computer;
for CentOS/Fedora, excute sudo dracut --force and reboot the computer.

Stop lightdm/gdm/kdm

After the computer is rebooted. We need to stop the desktop manager before excuting the runfile to install the driver. lightdm is the default desktop manager in Ubuntu. If GNOME or KDE desktop environment is used, installed desktop manager will then be gdm or kdm.

For Ubuntu 14.04 / 16.04, excuting sudo service lightdm stop (or use gdm or kdm instead of lightdm)
For Ubuntu 16.04 / Fedora / CentOS, excuting sudo systemctl stop lightdm (or use gdm or kdm instead of lightdm)

Excuting the Runfile

After above batch of preparition, we can eventually start excuting the runfile. So this is why I, from the very begining, recommend new users to install the driver via apt-get.

cd ~
chmod +x NVIDIA-Linux-x86_64-367.44.run
sudo ./NVIDIA-Linux-x86_64-367.44.run --dkms -s

Note:

option --dkms is used for register dkms module into the kernel so that update of the kernel will not require a reinstallation of the driver. This option should be turned on by default.
option -s is used for silent installation which should used for batch installation. For installation on a single computer, this option should be turned off for more installtion information.
option --no-opengl-files can also be added if non-NVIDIA (AMD or Intel) graphics are used for display while NVIDIA graphics are used for display.
The installer may prompt warning on a system without X.Org installed. It is safe to ignore that.

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.

Check the Installation

After a succesful installation, nvidia-smi command will report all your CUDA-capable devices in the system.

Common Errors and Solutions

ERROR: Unable to load the 'nvidia-drm' kernel module.

One probable reason is that the system is boot from UEFI but Secure Boot option is turned on in the BIOS setting. Turn it off and the problem will be solved.

Additional Notes

nvidia-smi -pm 1 can enable the persistent mode, which will save some time from loading the driver. It will have significant effect on machines with more than 4 GPUs.

nvidia-smi -e 0 can disable ECC on TESLA products, which will provide about 1/15 more video memory. Reboot is reqired for taking effect. nvidia-smi -e 1 can be used to enable ECC again.

nvidia-smi -pl <some power value> can be used for increasing or decrasing the TDP limit of the GPU. Increasing will encourage higher GPU Boost frequency, but is somehow DANGEROUS and HARMFUL to the GPU. Decreasing will help to same some power, which is useful for machines that does not have enough power supply and will shutdown unintendedly when pull all GPU to their maximum load.

-i <GPUID> can be added after above commands to specify individual GPU.

These commands can be added to /etc/rc.local for excuting at system boot.

Install CUDA

Installing CUDA from runfile is much simpler and smoother than installing the NVIDIA driver. It just involves copying files to system directories and has nothing to do with the system kernel or online compilation. Removing CUDA is simply removing the installation directory. So I personally does not recommend adding NVIDIA's repositories and install CUDA via apt-get or other package managers as it will not reduce the complexity of installation or uninstallation but increase the risk of messing up the configurations for repositories.

The CUDA runfile installer can be downloaded from NVIDIA's websie. But what you download is a package the following three components:

an NVIDIA driver installer, but usually of stale version;
the actual CUDA installer;
the CUDA samples installer;

To extract above three components, one can execute the runfile installer with --extract option. Then, executing the second one will finish the CUDA installation. Installation of the samples are also recommended because useful tool such as deviceQuery and p2pBandwidthLatencyTest are provided.

Scripts for installing CUDA Toolkit are summarized below.

cd ~
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
chmod +x cuda_7.5.18_linux.run
./cuda_7.5.18_linux.run --extract=$HOME
sudo ./cuda-linux64-rel-7.5.18-19867135.run

After the installation finishes, configure runtime library.

sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig

It is also recommended for Ubuntu users to append string /usr/local/cuda/bin to system file /etc/environments so that nvcc will be included in $PATH. This will take effect after reboot.

Install cuDNN

The recommended way for installing cuDNN is to first copy the tgz file to /usr/local and then extract it, and then remove the tgz file if necessary. This method will preserve symbolic links. At last, execute sudo ldconfig to update the shared library cache.

Author

hiwonjoon commented Sep 7, 2018

sudo ubuntu-drivers autoinstall?

Author

hiwonjoon commented Sep 7, 2018

sudo ubuntu-drivers autoinstall works fine.

When you install with nvidia-drivers with opengl, it comes with a small problem in executing qt program only if it is executed inside of tmux. To workaround, add (or write) below to /etc/X11/xorg.conf

Section "ServerFlags"
Option "AllowIndirectGLX" "on"
Option "IndirectGLX" "on"
EndSection

Author

hiwonjoon commented Oct 19, 2018 •

edited

Loading

Well, for a server, where the monitor is not directly connected to GPU, download the newest driver and try to install without "--no-opengl-files". NVIDIA_Driver 410.66 version works fine. Also, when install CUDA, don't forget to use the flag "--no-opengl-libs".

Check this out https://davidsanwald.github.io/2016/11/13/building-tensorflow-with-gpu-support.html
It seems works fine now.

Author

hiwonjoon commented Oct 19, 2018

$ sudo crontab -e

At the bottom of the file, add the following lines:

@reboot nvidia-smi -pm 1
@reboot nvidia-smi daemon

Author

hiwonjoon commented Oct 19, 2018

Test installation with https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist.py and https://github.com/openai/gym/blob/master/examples/agents/random_agent.py

Author

hiwonjoon commented Aug 5, 2019 •

edited

Loading

Update Aug 4 2019

nVidia Driver Install

Use ubuntu-drivers autoinstall.

By using this command, you will have a GUI system that you can select the graphic driver you want to use; interoperability between nouveau and official nvidia-drivers.

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo ubuntu-drivers autoinstall
sudo reboot

If you have a problem with the newest driver, please log in again with command-line mode (you can get into kernel select window by holding a shift key right after POST), then reinstall the appropriate driver version, by, for instance, sudo apt-get install nvidia-415 --fix-missing.

Install CUDA & cuDNN

sudo ./cuda_10.0.130_410.48_linux --no-opengl-libs

Edit crontab

$ sudo crontab -e

At the bottom of the file, add the following lines:

@reboot nvidia-smi -pm 1
@reboot nvidia-smi daemon

Trouble shooting

Infinite login-loop

sudo chmod 755 .Xauthority
sudo chown <username> .Xauthority

Author

hiwonjoon commented Aug 5, 2019 •

edited

Loading

OpenGL Troubleshooting

Long story

The explanation below could contain inaccurate information.

There are three OpenGL rendering backends; GLFW, EGL, OSMesa. The main problem is GLFW since it does not support the headless rendering. The most elegant solution support EGL, but the most commonly used OpenGL library in Python, namely pyglet only support GLFW, so it becomes painful. There has been a discussion, but it is unsure when the update will be made. (Luckily, mujoco_py and dm_control libraries support EGL or OSMesa so I have not noticed this problem for now.)

One way to workaround is by using a virtual framebuffer, such as xvfb, as it is noted in many places like this and this.

However, there are known problem between nVidia driver and xvfb since xvfb is totally cpu-based ref1, ref2. It does not know how to interact with GPU based OpenGL implementation supported by the driver. Therefore, people suggest installing drivers and CUDA library without OpenGL library with --no-opengl-libs options ref1, ref2. It could be one solution, but it is not satisfactory since it abandons hardware-accelerated rendering.

If you can run X-server, (possibly more elegant?) solution is simply running an X-server to generate a virtual monitor. It can be done easily by

(maybe not required) sudo apt-get install -y xserver-xorg mesa-utils
sudo nvidia-xconfig --busid=PCI:0:30:0 --use-display-device=none --virtual=1280x1024
sudo Xorg :1

and specifying the virtual display in front of a command by, for instance, DISPLAY=:1. You can get busid with the command nvidia-xconfig --query-gpu-info.

Tips if you have a trouble in running nvidia-xconfig

Actually, you don't need to run nvidia-xconfig. All you need is a properly set xorg.conf file and PCI bus ID.

First, you can just generate a default xorg.conf file by running the following command (note that there is no options at all).

sudo nvidia-xconfig

Then, change a few sections related to a screen and serverflags. Copy /etc/X11/xorg.conf to your home directory or whatever directory you want, then change the sections.

Section "Screen"
 Identifier Screen0"
 Device "Device0"
 Monitor "Monitor0"
 DefaultDepth 24
 Option "UseDisplayDevice" "None"
 SubSection "Display"
 Virtual 1280 1024
 Depth 24
 EndSubSection
EndSection
Section "ServerFlags"
 Option "AllowMouseOpenFail" "true"
 Option "AutoAddGPU" "false"
 Option "ProbeAllGpus" "false"
EndSection

You can grab a PCI id with different command. (actually, nvidia-smi includes this info, too)

 lspci -vnn | grep VGA

You will get a result like below.

02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])
81:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1) (prog-if 00 [VGA controller])

Then, your PCI ID is PCI:2:0:0, PCI:3:0:0, PCI:129:0:0, PCI:130:0:0. (note that 81, 82 is hexadecimal, so 8*16+1=129).

Finally, you can run X-server, by running a command:

sudo Xorg -noreset -sharevts -novtswitch -isolateDevice "<PCI-ID>" -config <your xorg.conf file> :<display id, such as 0,1> vt1 &

Don't forget the last &, because it is a daemon process.

You can check your x-server by monitoring nvidia-smi while running glxgears command, or with glxinfo.

DISPLAY=:0 glxinfo | grep OpenGL
glxgears -display :0

Solutions

If you can connect a monitor directly to a machine

Just connect it; then run app with DISPLAY=:0 option.

Connecting a monitor is impossible, but you have a `sudo` access.

Run X-server; generate a virtual monitor Ref.

You don't have a monitor, and you don't have a `sudo` access.

You can workaround OpenGL problem with library preloading trick: LD_PRELOAD

First, install OSMesa OpenGL library. (Or, all you need is mesa/libGL.so file. You can just copy from your local computer to a server.)

sudo apt-get libglu1-mesa libgl1-mesa-dev

It will be usually installed under /usr/lib/x86_64-linux-gnu/. Check whether mesa/libGL.so exists of which we will preload.

❯❯❯ ls -all /usr/lib/x86_64-linux-gnu/ | grep GL
lrwxrwxrwx    13 root 14 Jun  2018 libGL.so -> mesa/libGL.so

Note that, installed OSMesa library is not included in the ldconfig cache, so it won't be loaded (unsure..) unless we specify it with LD_PRELOAD.

❯❯❯ ldconfig -p | grep libGL
        libGLdispatch.so.0 (libc6,x86-64) => /usr/lib/nvidia-415/libGLdispatch.so.0
        libGLdispatch.so.0 (libc6) => /usr/lib32/nvidia-415/libGLdispatch.so.0
        libGLX_nvidia.so.0 (libc6,x86-64) => /usr/lib/nvidia-415/libGLX_nvidia.so.0
        libGLX_nvidia.so.0 (libc6) => /usr/lib32/nvidia-415/libGLX_nvidia.so.0
        libGLX.so.0 (libc6,x86-64) => /usr/lib/nvidia-415/libGLX.so.0
        libGLX.so.0 (libc6) => /usr/lib32/nvidia-415/libGLX.so.0
        libGLX.so (libc6,x86-64) => /usr/lib/nvidia-415/libGLX.so
        libGLX.so (libc6) => /usr/lib32/nvidia-415/libGLX.so
        libGLU.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGLU.so.1
        libGLEWmx.so.1.13 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGLEWmx.so.1.13
        libGLEW.so.1.13 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGLEW.so.1.13
        libGLESv2_nvidia.so.2 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv2_nvidia.so.2
        libGLESv2_nvidia.so.2 (libc6) => /usr/lib32/nvidia-415/libGLESv2_nvidia.so.2
        libGLESv2.so.2 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv2.so.2
        libGLESv2.so.2 (libc6) => /usr/lib32/nvidia-415/libGLESv2.so.2
        libGLESv2.so (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv2.so
        libGLESv2.so (libc6) => /usr/lib32/nvidia-415/libGLESv2.so
        libGLESv1_CM_nvidia.so.1 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv1_CM_nvidia.so.1
        libGLESv1_CM_nvidia.so.1 (libc6) => /usr/lib32/nvidia-415/libGLESv1_CM_nvidia.so.1
        libGLESv1_CM.so.1 (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv1_CM.so.1
        libGLESv1_CM.so.1 (libc6) => /usr/lib32/nvidia-415/libGLESv1_CM.so.1
        libGLESv1_CM.so (libc6,x86-64) => /usr/lib/nvidia-415/libGLESv1_CM.so
        libGLESv1_CM.so (libc6) => /usr/lib32/nvidia-415/libGLESv1_CM.so
        libGL.so.1 (libc6,x86-64) => /usr/lib/nvidia-415/libGL.so.1
        libGL.so.1 (libc6) => /usr/lib32/nvidia-415/libGL.so.1
        libGL.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libGL.so
        libGL.so (libc6,x86-64) => /usr/lib/nvidia-415/libGL.so
        libGL.so (libc6) => /usr/lib32/nvidia-415/libGL.so

Anyway, now you can easily prevent /usr/lib32/nvidia-415/libGL.so from loading by specifying OSMesa OpenGL library.

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libGL.so xvfb-run --auto-servernum -s "-screen 0 640x480x24" glxinfo | grep OpenGL

Here are some illustrative runs. The loaded library for OpenGL is now different.

Some useful commands & links

nvidia-xconfig --query-gpu-info
glxinfo or DISPLAY=:0 glxinfo | grep OpenGL
lsof -p <process_id>; observe shared libraries in used from the process
dpkg-query -L <package-name>; check included file from the package
ldd <binary>; list libraries the binary linked against
ldconfig -p; check library cache

openai/gym#366

https://bitbucket.org/pyglet/pyglet/issues/219/egl-support-headless-rendering

Author

hiwonjoon commented Jan 30, 2020

Version check when upgrade nvidia-drivers

apt list --installed | grep cuda
apt list --installed | grep nvidia

Author

hiwonjoon commented Nov 18, 2020

nvidia-smi -q: shows detailed information about GPUs.

hiwonjoon/Install NVIDIA Driver and CUDA.md

Select an option

No results found

Select an option

No results found

Table of Contents

Install NVIDIA Graphics Driver via apt-get

Install NVIDIA Graphics Driver via runfile

Remove Previous Installations (Important)

Download the Driver

Install Dependencies

Creat Blacklist for Nouveau Driver

Stop lightdm/gdm/kdm

Excuting the Runfile

Check the Installation

Common Errors and Solutions

Additional Notes

Install CUDA

Install cuDNN

hiwonjoon commented Sep 7, 2018

Uh oh!

hiwonjoon commented Sep 7, 2018

Uh oh!

hiwonjoon commented Oct 19, 2018 •

edited

Loading

Uh oh!

hiwonjoon commented Oct 19, 2018

Uh oh!

hiwonjoon commented Oct 19, 2018

Uh oh!

hiwonjoon commented Aug 5, 2019 •

edited

Loading

Uh oh!

hiwonjoon commented Aug 5, 2019 •

edited

Loading

Uh oh!

hiwonjoon commented Jan 30, 2020

Uh oh!

hiwonjoon commented Nov 18, 2020

Uh oh!

hiwonjoon/Install NVIDIA Driver and CUDA.md

Table of Contents

Install NVIDIA Graphics Driver via apt-get

Install NVIDIA Graphics Driver via runfile

Remove Previous Installations (Important)

Download the Driver

Install Dependencies

Creat Blacklist for Nouveau Driver

Stop lightdm/gdm/kdm

Excuting the Runfile

Check the Installation

Common Errors and Solutions

Additional Notes

Install CUDA

Install cuDNN

hiwonjoon commented Sep 7, 2018

Uh oh!

hiwonjoon commented Sep 7, 2018

Uh oh!

hiwonjoon commented Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiwonjoon commented Oct 19, 2018

Uh oh!

hiwonjoon commented Oct 19, 2018

Uh oh!

hiwonjoon commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update Aug 4 2019

nVidia Driver Install

Install CUDA & cuDNN

Edit crontab

Trouble shooting

Uh oh!

hiwonjoon commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenGL Troubleshooting

Long story

Tips if you have a trouble in running nvidia-xconfig

Solutions

If you can connect a monitor directly to a machine

Connecting a monitor is impossible, but you have a sudo access.

You don't have a monitor, and you don't have a sudo access.

Some useful commands & links

Uh oh!

hiwonjoon commented Jan 30, 2020

Uh oh!

hiwonjoon commented Nov 18, 2020

Uh oh!

hiwonjoon commented Oct 19, 2018 •

edited

Loading

hiwonjoon commented Aug 5, 2019 •

edited

Loading

hiwonjoon commented Aug 5, 2019 •

edited

Loading

Connecting a monitor is impossible, but you have a `sudo` access.

You don't have a monitor, and you don't have a `sudo` access.