sonots/nvvp.md

Last active February 18, 2026 17:17

Star (89) You must be signed in to star a gist
Fork (16) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sonots/5abc0bccec2010ac69ff74788b265086.js"></script>
Save sonots/5abc0bccec2010ac69ff74788b265086 to your computer and use it in GitHub Desktop.

Download ZIP

How to use NVIDIA profiler

Raw

nvvp.md

Usually, located at /usr/local/cuda/bin

Non-Visual Profiler

$ nvprof python train_mnist.py

I prefer to use --print-gpu-trace.

$ nvprof --print-gpu-trace python train_mnist.py

Visual Profiler

On GPU machine, run

$ nvprof -o prof.nvvp python train_mnist.py

Copy prof.nvvp into your local machine

$ scp your_gpu_machine:/path/to/prof.nvvp .

Then, run nvvp (nvidia visual profiler) on your local machine:

$ nvvp prof.nvvp

It works more comfortably than X11 forwarding or something.

Author

sonots commented Jul 11, 2017 •

edited

Loading

For macOS: Install cuda toolkit for macOS. nvvp is included in it. Run nvvp command on console. You don't need GPU on your macOS.

https://developer.nvidia.com/cuda-downloads

$ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp

Author

sonots commented Jul 21, 2017 •

edited

Loading

An example of nvvp output: https://developer.nvidia.com/nvidia-visual-profiler

Author

sonots commented Jul 21, 2017 •

edited

Loading

An example of nvprof output:

http://topsecret.hpc.co.jp/wiki/index.php/CUDA_5%E3%81%AE%E6%96%B0%E6%A9%9F%E8%83%BD(4):_nvprof%E3%83%97%E3%83%AD%E3%83%95%E3%82%A1%E3%82%A4%E3%83%A9

Author

sonots commented Oct 19, 2017 •

edited

Loading

An example of nvprof result:

$ nvprof python examples/stream/cusolver.py                                                                                                                                              [10/1910]
==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py
==27986== Profiling application: python examples/stream/cusolver.py
==27986== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 41.70%  125.73us         4  31.431us  30.336us  33.312us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
 21.94%  66.144us        36  1.8370us  1.7600us  2.1760us  [CUDA memcpy DtoH]
 13.77%  41.536us        48     865ns     800ns  1.4400us  [CUDA memcpy HtoD]
  3.02%  9.1200us         2  4.5600us  3.8720us  5.2480us  void syhemv_kernel<double, int=64, int=128, int=4, int=5, bool=1, bool=0>(cublasSyhemvParams<double>)
  2.65%  8.0000us         2  4.0000us  3.8720us  4.1280us  void gemv2T_kernel_val<double, double, double, int=128, int=16, int=2, int=2, bool=0>(int, int, double, double const *, int, double const *, i
nt, double, double*, int)
  2.63%  7.9360us         2  3.9680us  3.8720us  4.0640us  cupy_copy
  2.44%  7.3600us         2  3.6800us  3.1680us  4.1920us  void syr2_kernel<double, int=128, int=5, bool=1>(cublasSyher2Params<double>, int, double const *, double)
  2.23%  6.7200us         2  3.3600us  3.2960us  3.4240us  void dot_kernel<double, double, double, int=128, int=0, int=0>(cublasDotParams<double, double>)
  1.88%  5.6640us         2  2.8320us  2.7840us  2.8800us  void reduce_1Block_kernel<double, double, double, int=128, int=7>(double*, int, double*)
  1.74%  5.2480us         2  2.6240us  2.5600us  2.6880us  void ger_kernel<double, double, int=256, int=5, bool=0>(cublasGerParams<double, double>)
  1.57%  4.7360us         2  2.3680us  2.1760us  2.5600us  void axpy_kernel_val<double, double, int=0>(cublasAxpyParamsVal<double, double, double>)
  1.28%  3.8720us         2  1.9360us  1.7920us  2.0800us  void lacpy_kernel<double, int=5, int=3>(int, int, double const *, int, double*, int, int, int)
  1.19%  3.5840us         2  1.7920us  1.6960us  1.8880us  void scal_kernel_val<double, double, int=0>(cublasScalParamsVal<double, double>)
  0.98%  2.9440us         2  1.4720us  1.2160us  1.7280us  void reset_diagonal_real<double, int=8>(int, double*, int)
  0.98%  2.9440us         4     736ns     736ns     736ns  [CUDA memset]

==27986== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 60.34%  408.55ms         9  45.395ms  4.8480us  407.94ms  cudaMalloc
 37.60%  254.60ms         2  127.30ms     556ns  254.60ms  cudaFree
  0.94%  6.3542ms       712  8.9240us     119ns  428.32us  cuDeviceGetAttribute
  0.72%  4.8747ms         8  609.33us  320.37us  885.26us  cuDeviceTotalMem
  0.10%  693.60us        82  8.4580us  2.8370us  72.004us  cudaMemcpyAsync
  0.08%  511.79us         1  511.79us  511.79us  511.79us  cudaHostAlloc
  0.08%  511.75us         8  63.969us  41.317us  99.232us  cuDeviceGetName
  0.05%  310.04us         1  310.04us  310.04us  310.04us  cuModuleLoadData
  0.03%  234.87us        24  9.7860us  5.7190us  50.465us  cudaLaunch
  0.01%  50.874us         2  25.437us  16.898us  33.976us  cuLaunchKernel
  0.01%  49.923us         2  24.961us  15.602us  34.321us  cudaMemcpy
  0.01%  47.622us         4  11.905us  8.6190us  19.889us  cudaMemsetAsync
  0.01%  44.811us         2  22.405us  9.5590us  35.252us  cudaStreamDestroy
  0.01%  35.136us        27  1.3010us     289ns  5.8480us  cudaGetDevice
  0.00%  31.113us        24  1.2960us     972ns  3.2380us  cudaStreamSynchronize
  0.00%  30.736us         2  15.368us  4.4580us  26.278us  cudaStreamCreate
  0.00%  13.932us        17     819ns     414ns  3.7090us  cudaEventCreateWithFlags
  0.00%  13.678us        70     195ns     130ns     801ns  cudaSetupArgument
  0.00%  12.050us         4  3.0120us  2.1290us  4.5130us  cudaFuncGetAttributes
  0.00%  10.407us        22     473ns     268ns  1.9540us  cudaDeviceGetAttribute
  0.00%  10.370us        40     259ns     126ns  1.4100us  cudaGetLastError
  0.00%  9.9680us        16     623ns     185ns  2.9600us  cuDeviceGet

Author

sonots commented Oct 19, 2017 •

edited

Loading

An example of nvprof --print-gpu-trace result:

$ nvprof --print-gpu-trace python examples/stream/cusolver.py
==28079== NVPROF is profiling process 28079, command: python examples/stream/cusolver.py
==28079== Profiling application: python examples/stream/cusolver.py
==28079== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
652.12ms  1.5360us                    -               -         -         -         -       72B  44.703MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
885.35ms  3.5520us              (1 1 1)         (9 1 1)        35        0B        0B         -           -  GeForce GTX TIT         1        13  cupy_copy [412]
1.17031s  1.2160us                    -               -         -         -         -      112B  87.838MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
1.17104s  1.2800us                    -               -         -         -         -        4B  2.9802MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17117s  2.2400us                    -               -         -         -         -       72B  30.654MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17119s     864ns                    -               -         -         -         -        4B  4.4152MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17123s  1.3760us              (1 1 1)       (256 1 1)         8        0B        0B         -           -  GeForce GTX TIT         1        13  void reset_diagonal_real<double, int=8>(int, double*, i
nt) [840]
1.17125s     768ns                    -               -         -         -         -       16B  19.868MB/s  GeForce GTX TIT         1        13  [CUDA memset]
1.17127s  32.928us              (1 1 1)       (128 1 1)        30  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void nrm2_kernel<double, double, double, int=0, int=0,
int=128, int=0>(cublasNrm2Params<double, double>) [848]
1.17130s  30.016us              (1 1 1)       (128 1 1)        30  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void nrm2_kernel<double, double, double, int=0, int=0,
int=128, int=0>(cublasNrm2Params<double, double>) [853]
1.17134s  2.0160us                    -               -         -         -         -        8B  3.7844MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17135s  1.7920us                    -               -         -         -         -        8B  4.2575MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17137s  1.8560us              (1 1 1)       (384 1 1)        10        0B        0B         -           -  GeForce GTX TIT         1        13  void scal_kernel_val<double, double, int=0>(cublasScalP
aramsVal<double, double>) [863]
1.17138s     832ns                    -               -         -         -         -        8B  9.1699MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17138s     864ns                    -               -         -         -         -        8B  8.8303MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17139s  1.8240us                    -               -         -         -         -        8B  4.1828MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17140s  1.8880us                    -               -         -         -         -        8B  4.0410MB/s  GeForce GTX TIT         1        13  [CUDA memcpy DtoH]
1.17141s     864ns                    -               -         -         -         -        8B  8.8303MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17142s     832ns                    -               -         -         -         -        8B  9.1699MB/s  GeForce GTX TIT         1        13  [CUDA memcpy HtoD]
1.17143s  5.6320us             (64 1 1)       (128 1 1)        48  5.5000KB        0B         -           -  GeForce GTX TIT         1        13  void syhemv_kernel<double, int=64, int=128, int=4, int=
5, bool=1, bool=0>(cublasSyhemvParams<double>) [875]
1.17145s  3.9360us              (1 1 1)       (128 1 1)        14  1.0000KB        0B         -           -  GeForce GTX TIT         1        13  void dot_kernel<double, double, double, int=128, int=0,
 int=0>(cublasDotParams<double, double>) [882]
1.17146s  3.0400us              (1 1 1)       (128 1 1)        16  1.5000KB        0B         -           -  GeForce GTX TIT         1        13  void reduce_1Block_kernel<double, double, double, int=1
28, int=7>(double*, int, double*) [888]

[omitted]

Author

sonots commented Oct 19, 2017 •

edited

Loading

An example of nvvp result:

shizukanaskytree commented Aug 10, 2018

Thank you for your post!

rohith14 commented Nov 1, 2018

Hi,
I get "No kernels profiled" as seen shown below. Any idea?

$ nvprof --print-gpu-trace python train_mnist.py --network mlp --num-epochs 1
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=1, num_examples=60000, num_layers=None, optimizer='sgd', test_io=0, top_k=0, wd=0.0001)
==27259== NVPROF is profiling process 27259, command: python train_mnist.py --network mlp --num-epochs 1
INFO:root:Epoch[0] Batch [100] Speed: 39195.15 samples/sec accuracy=0.779548
INFO:root:Epoch[0] Batch [200] Speed: 54730.25 samples/sec accuracy=0.915781
INFO:root:Epoch[0] Batch [300] Speed: 52417.13 samples/sec accuracy=0.923281
INFO:root:Epoch[0] Batch [400] Speed: 52111.75 samples/sec accuracy=0.935781
INFO:root:Epoch[0] Batch [500] Speed: 52394.11 samples/sec accuracy=0.946250
INFO:root:Epoch[0] Batch [600] Speed: 52185.51 samples/sec accuracy=0.947656
INFO:root:Epoch[0] Batch [700] Speed: 52354.36 samples/sec accuracy=0.953125
INFO:root:Epoch[0] Batch [800] Speed: 52180.64 samples/sec accuracy=0.958750
INFO:root:Epoch[0] Batch [900] Speed: 52312.02 samples/sec accuracy=0.953281
INFO:root:Epoch[0] Train-accuracy=0.955236
INFO:root:Epoch[0] Time cost=3.287
INFO:root:Epoch[0] Validation-accuracy=0.963774
==27259== Profiling application: python train_mnist.py --network mlp --num-epochs 1
==27259== Profiling result:
No kernels were profiled.

Double1996 commented Jul 29, 2019

Thank you very much!

dorraghzela commented Aug 16, 2019

Hello,

Thank you for sharing this. I have a silly question actually, I want to know if the profiling time for example memcpy includes the time for the API call cudaMemcpy ??

Thank you

jrevels commented Nov 1, 2019

Thanks!

To help out other MacOS users in case they run into the same problem I did:

Make sure to point nvvp to a supported version of the JRE:

/Developer/NVIDIA/CUDA-10.1/bin/nvvp -vm /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/jre/bin/java

Also make sure that the argument to -vm is an absolute path; it doesn't seem to understand relative paths.

dboyda commented Nov 13, 2019 •

edited

Loading

@jrevels, Thank you!

There are also problems with the installation of the Cuda toolkit on Catalina. It can be solved by adding a link /Developer to any place. To do it on Catalina you need to add a line
Developer /Users/userName
to file /etc/synthetic.conf : like it is written here.

sumit-byte commented Feb 7, 2020

Hello, iam having problem on viewing the visual profiler. Every time it open upon selecting a file it returns a error, saying "The application being profilled returned a non-zero code"

HenryYihengXu commented Feb 24, 2020

Hello, do you know how to see GPU utilization with nvvp? I only see the duration of a function. But I want to also see the percentage of GPU utilization during a function.

keke8273 commented Mar 4, 2020 •

edited

Loading

i have to use the --profile-child-processes option to get the profile to work on windows.

hudannag commented Apr 20, 2020

I get no outputs using nvprof. No matter if I give arguments or not.
Doesn't work with .exe file build with VS2019, nor using nvprof python program.py.

Can you please help?

hitvoice commented Apr 22, 2020

On Mac, nvvp prof.nvvp doesn't work.
An absolute path (instead of a relative path like "prof.nvvp") has to be provided.

danpovey commented Nov 16, 2020

My recommendation to anyone who wants to install nvvp on an up-to-date Mac (catalina) is not to even try.
Firstly you have to disable gatekeeper:
sudo spctl --master-disable
but then after you download nvvp it won't run because it wants an outdated java runtime (JRE) that's hard to get or install.
I still haven't figured it out.
Currently I'm trying to figure out how to do this, but it looks like it's very very painful.

1chimaruGin commented Dec 8, 2020

Thanks man! Great help

ccjjs commented Jun 12, 2021

PDD XBJ GOD, YYDS

sonots/nvvp.md

Non-Visual Profiler

Visual Profiler

sonots commented Jul 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonots commented Jul 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonots commented Jul 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonots commented Oct 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonots commented Oct 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonots commented Oct 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shizukanaskytree commented Aug 10, 2018

Uh oh!

rohith14 commented Nov 1, 2018

Uh oh!

Double1996 commented Jul 29, 2019

Uh oh!

dorraghzela commented Aug 16, 2019

Uh oh!

jrevels commented Nov 1, 2019

Uh oh!

dboyda commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sumit-byte commented Feb 7, 2020

Uh oh!

HenryYihengXu commented Feb 24, 2020

Uh oh!

keke8273 commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hudannag commented Apr 20, 2020

Uh oh!

hitvoice commented Apr 22, 2020

Uh oh!

danpovey commented Nov 16, 2020

Uh oh!

1chimaruGin commented Dec 8, 2020

Uh oh!

ccjjs commented Jun 12, 2021

Uh oh!

sonots commented Jul 11, 2017 •

edited

Loading

sonots commented Jul 21, 2017 •

edited

Loading

sonots commented Jul 21, 2017 •

edited

Loading

sonots commented Oct 19, 2017 •

edited

Loading

sonots commented Oct 19, 2017 •

edited

Loading

sonots commented Oct 19, 2017 •

edited

Loading

dboyda commented Nov 13, 2019 •

edited

Loading

keke8273 commented Mar 4, 2020 •

edited

Loading