cuDNN crashes ever since an error during training

OS: Ubuntu 16.04 LTS, tested also on Windows 10
CUDA: v9.0 (installed with .deb)
CuDNN: v7.0.5 for CUDA 9.0 (installed with .deb)
NVIDIA drivers: originally 384, now 396.26
GPU: GeForce GTX 1080 Ti

Hi All,

I was training and testing my stuff for a few weeks after getting a new GPU, without any problems. Suddenly, I got an error mid-training:

E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:208] Unexpected Event status: 1

Ever since, even the mnistCUDNN sample fails randomly. Sometimes it passes, sometimes fails with one of two errors:

Loading image data/five_28x28.pgm
Performing forward propagation ...
Cuda failure
Error: an illegal memory access was encountered
mnistCUDNN.cpp:605
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_ALLOC_FAILED
mnistCUDNN.cpp:558

CUDA is installed properly using the package manager, the CUDA samples pass. Other processes (Xorg, compiz, firefox…) don’t have any problems using the GPU. Tested also on a game on Win10 (same machine) - The Witcher 3 works fine.

Things I’ve already tried:

  1. reinstalling CUDA and cuDNN (with package manager)
  2. reinstalling Ubuntu (and then 1.)
  3. testing on Win10 on the same machine (no cudnn samples, but Tensorflow (1.8.0) raises the same errors as it does on Ubuntu when trying to run TF samples)
  4. Updating the drivers
  5. Usind sudo to run the samples. The persistence daemon works properly and the problem occurs with persistence both enabled and disabled

Has anyone encountered similar problems before? Could it be a faulty GPU? Am I missing something…?

Update: I contacted the general NVIDIA support but was advised to update this thread and wait for the team. I also edited the thread’s name, so hopefully it better explains the situation now (previously: “cuDNN sample fails since a random crash during training”).

Hello Jbrt, cuDNN team here, can you try 3 things so we have a better understanding of the issue?

  1. can you run cuda-memcheck on the mnistCUDNN sample and post the results?

  2. can you try something like memtest86 on your machine? From what we have seen, sometimes these kind of random “illegal memory access” may be caused by (host) ram failure.

  3. Can you try the sample in the latest cuDNN v7.1.4 and see if the issue still remains?

Thanks!

Hello Yanxu, thanks for responding. I’ve just noticed I didn’t mention in my original post that I’ve actually already run cuda-memcheck.

  1. The output of cuda-memcheck is not fully reproducible - I noticed three types of errors that are thrown under the exactly same circumstances. The output is pretty long, so I’ll try to shorten it down a bit:

1.1:

cudnnGetVersion() : 7104 , CUDNN_VERSION from cudnn.h : 7104 (7.1.4)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28  Capabilities 6.1, SmClock 1620.0 Mhz, MemSize (Mb) 11177, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:558
Aborting...
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

1.2:

cudnnGetVersion() : 7104 , CUDNN_VERSION from cudnn.h : 7104 (7.1.4)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 28  Capabilities 6.1, SmClock 1620.0 Mhz, MemSize (Mb) 11177, MemClock 5505.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading image data/one_28x28.pgm
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 8
=========     at 0x00000098 in void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
=========     by thread (7,5,0) in block (0,0,99)
=========     Address 0x7fe6cf645518 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134d952]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134db47]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x137c8d5]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99abc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99b99]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9acfc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9a6cb]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe7345b]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe6abce]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac2be]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac948]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb210c]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb3921]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x780fa3]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x842c7]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x846e6]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2cc) [0x854ec]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x89368]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x8e993]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnFindConvolutionForwardAlgorithm + 0x248) [0x7fa78]
=========     Host Frame:mnistCUDNN [0x189bb]
=========     Host Frame:mnistCUDNN [0x10d67]
=========     Host Frame:mnistCUDNN [0xe23b]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:mnistCUDNN [0x74d9]
=========

### similar as above, 6 times more ###

=========
========= Invalid __global__ read of size 8
=========     at 0x00000098 in void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
=========     by thread (1,5,0) in block (0,0,15)
=========     Address 0x7fe6cf645278 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134d952]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x134db47]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x137c8d5]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99abc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe99b99]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9acfc]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe9a6cb]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe7345b]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xe6abce]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac2be]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcac948]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb210c]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xcb3921]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x780fa3]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x842c7]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x846e6]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2cc) [0x854ec]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x89368]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x8e993]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnFindConvolutionForwardAlgorithm + 0x248) [0x7fa78]
=========     Host Frame:mnistCUDNN [0x189bb]
=========     Host Frame:mnistCUDNN [0x10d67]
=========     Host Frame:mnistCUDNN [0xe23b]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:558
Aborting...

=========     Host Frame:mnistCUDNN [0x74d9]
=========
========= ERROR SUMMARY: 8 errors

1.3: as above, but with 4 errors of the same type.

  1. Thanks, I ran memtest86 as advised - it passed with no errors.

  2. I’ve just tested it with cuDNN v7.1.4 + CUDA 9.0 and cuDNN v7.1.4 + CUDA 9.2. The problem persists.

If you have any ideas, please let me know. If it’s a hardware issue, I can submit a warranty claim to the distributor. It’s very important for me to solve this as soon as possible. Thanks for help!

Hello jbrt. I also met this problem with the 1080Ti, CentOS7, CUDA9.0, CuDNN7.1.4 for.I have reinstalled the NVdDIA driver, CuDNN and CUDA but neither of them works. So have you solved this problem?

Hi zzhanhuimei,

Sorry for a late response. I submitted a warranty claim, got my money back, got a new GPU - it works now.
Must’ve been a hardware problem, unfortunately.

Hi,

I also got CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered when using a GeForce GTX 1080 Ti on Ubuntu. My convolutional neural network application worked well but then it got this error. I have also a GeForce RTX 2080 Ti Rev. A GPU on which the same application is still running (before I ran the same application in parallel on both GPU for many weeks). Reading the post here, I conclude that’s a hardware failure and I’ll send the GPU back?

Any advice appreciated, thanks!