Background: I have tested Nvidia’s waveglow which you can see here following the methods under “Generate audio with our pre-existing model”.
I have tested it successfully on its own:
python inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_old.pt -o . --is_fp16 -s 0.6
and have ran it with Nsight Systems profiler:
nsys profile python inference.py -f mel_spectrograms/LJ001-0015.wav.pt -w waveglow_256channels.pt -o . --is_fp16 -s 0.6
They both ran successfully, and the results of Nsight Systems looked fine.
However, when I run with Nsight Compute:
nv-nsight-cu-cli -f path/to/python inference.py -f <(ls mel_spectrograms/*.pt) -w waveglow_256channels.pt -o . --is_fp16 -s 0.6
I get:
==PROF== Profiling - 1: 0%....50%....100%
Traceback (most recent call last):
File "inference.py", line 84, in <module>
args.sampling_rate, args.is_fp16, args.denoiser_strength)
File "inference.py", line 38, in main
waveglow = waveglow.remove_weightnorm(waveglow)
File "/home/msl/isaac/waveglow/glow.py", line 299, in remove_weightnorm
WN.in_layers = remove(WN.in_layers)
File "/home/msl/isaac/waveglow/glow.py", line 308, in remove
old_conv = torch.nn.utils.remove_weight_norm(old_conv)
File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/utils/weight_norm.py", line 113, in remove_weight_norm
hook.remove(module)
File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/utils/weight_norm.py", line 48, in remove
weight = self.compute_weight(module)
File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/utils/weight_norm.py", line 18, in compute_weight
return _weight_norm(v, g, self.dim)
RuntimeError: CUDA error: an illegal memory access was encountered
==PROF== Report: profile.nsight-cuprof-report
weight_norm_fwd_first_dim_kernel, 2019-Apr-10 17:47:31
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
Memory Frequency Ghz 6.47
SOL FB % 0.65
Elapsed Cycles cycle 11,676.75
SM Frequency Ghz 1.79
Memory [%] % 8.84
Duration usecond 6.53
SOL L2 % 1.07
SOL TEX % 1.97
SM [%] % 18.25
---------------------------------------------------------------------- --------------- ------------------------------
Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active inst/cycle 1.23
Executed Ipc Elapsed inst/cycle 0.78
Issued Ipc Active inst/cycle 1.25
Issue Slots Busy % 20.87
SM Busy % 18.25
---------------------------------------------------------------------- --------------- ------------------------------
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Gbyte/second 1.26
Mem Busy % 8.84
Max Bandwidth % 6.41
L2 Hit Rate % 86.20
Mem Pipes Busy % 22.18
L1 Hit Rate % 71.99
---------------------------------------------------------------------- --------------- ------------------------------
Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Active Warps Per Scheduler warp/cycle 12.07
Eligible Warps Per Scheduler warp/cycle 0.64
No Eligible % 60.10
Instructions Per Active Issue Slot inst/issue 1.09
Issued Warp Per Scheduler issue/cycle 0.43
One or More Eligible % 42.74
---------------------------------------------------------------------- --------------- ------------------------------
Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Not Predicated Off Threads Per Warp thread/inst 25.94
Avg. Active Threads Per Warp thread/inst 30.93
Warp Cycles Per Executed Instruction cycle/inst 26.11
Warp Cycles Per Issued Instruction cycle/inst 25.54
Warp Cycles Per Issue Active cycle/issue 27.72
---------------------------------------------------------------------- --------------- ------------------------------
Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler inst 2,262.40
Executed Instructions inst 180,992
Avg. Issued Instructions Per Scheduler inst 2,312.15
Issued Instructions inst 184,972
---------------------------------------------------------------------- --------------- ------------------------------
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 256
Grid Size 256
Registers Per Thread register/thread 13
Shared Memory Configuration Size Kbyte 48
Dynamic Shared Memory Per Block Kbyte/block 1
Static Shared Memory Per Block byte/block 0
Threads thread 65,536
Waves Per SM 1.60
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers register 16
Block Limit Local Mem byte 96
Block Limit Warps warp 8
Achieved Active Warps Per SM warp/cycle 53.04
Achieved Occupancy % 82.88
Theoretical Active Warps per SM warp/cycle 64
Theoretical Occupancy % 100
---------------------------------------------------------------------- --------------- ------------------------------
and the program terminates due to illegal memory access.