Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

charel.van.hoof · November 8, 2018, 4:40pm

How come I get cuda errors when I run a colab on my PC. Next you try to find your way in the impossible support pages of Nvidia.
I get this error if I run CPU accelerator runtime: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
Everuting runs fine without the GPU accelerator.
Tried a lot downloaded some \cudnn-10.0-windows10-x64-v7.3.1.20.zip and did the manual coy past instruction (really?) Did not help.

Windows 10
Google colab
Keras Neural Network
Cuda version 10

Any help most welcome

charel.van.hoof · November 8, 2018, 4:45pm

Neural network in case it matters

self.model = tf.keras.models.Sequential()
self.model.add(tf.keras.layers.Conv2D(32, (8, 8), input_shape=(84, 84, NUM_FRAMES), strides=(4, 4)))
self.model.add(tf.keras.layers.Activation(‘relu’))
self.model.add(tf.keras.layers.Conv2D(64, (4, 4), strides=(2, 2)))
self.model.add(tf.keras.layers.Activation(‘relu’))
self.model.add(tf.keras.layers.Conv2D(64, (3, 3)))
self.model.add(tf.keras.layers.Activation(‘relu’))
self.model.add(tf.keras.layers.Flatten())
self.model.add(tf.keras.layers.Dense(512))
self.model.add(tf.keras.layers.Activation(‘relu’))
self.model.add(tf.keras.layers.Dense(NUM_ACTIONS))

charel.van.hoof · November 8, 2018, 5:01pm

A simple check colab run on my pc

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != ‘/device:GPU:0’:
raise SystemError(‘GPU device not found’)
print(‘Found GPU at: {}’.format(device_name))

shows
Found GPU at: /device:GPU:0

and result of test:
Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
9.33978390694
GPU (s):
0.196625232697
GPU speedup over CPU: 47x

So I do have GPU power, somehow Conv2D crashes?

stevedevney · December 11, 2018, 11:37am

Were you ever able to solve this problem? I have the same issue.

iro2017009 · January 2, 2019, 3:07pm

This is a compatibility issue with the new versions of tensorflow-gpus 1.10.x plus versions with cuda 9.0 and cudnn 7.0.5. Easiest fix is to downgrade tensorflow to version 1.8

pip install --upgrade tensorflow-gpu==1.8.0

will solve the problem.

dt.parmpal · January 21, 2019, 12:53pm

Thanx iro2017009, it works for me.

shuonan · April 5, 2019, 9:07pm

what if I am using tf_2.0alpha?

mania_baghda · April 19, 2019, 7:55pm

I have tf 2.0 alpha
I had to downgrade to CUDA 10.0 and cuDNN 7.4 to make it work.

zouzhipeng.1 · May 10, 2019, 12:26pm

However I have GPUs : RTX 2070, the CUDA recommended by invidia is 10.0, so I can only download TensorFlow v1.13.1 according to the TensorFlow official description. But I got errors above, how can I solve this problem?

dt.parmpal · May 10, 2019, 12:36pm

@zouzhipeng verify the cudnn version also it should 7.4.1 for tensorflow 1.13.1 with cuda 10.0

zouzhipeng.1 · May 10, 2019, 1:05pm

I have checked and verified again, it is the same as what I reported , and I have run the example code provided by invidia named mnistCUDNN, this test passed. But I can’t get the conv in TensorFlow v1.13.1, the the TensorFlow is installed by pip, the version of python is python3.6.5, the version of CUDA is 10.0, and the CUDNN is 7.4.1.

So what is the reason of this problem ?

dt.parmpal · May 10, 2019, 1:12pm

What os you are using and what error you received?

zouzhipeng.1 · May 10, 2019, 1:25pm

Now, we are using CentOS Linux release 7.6.1810 (Core)

The error is

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 115, in <module>
    model.train(input_fn, steps=num_steps)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/python3.6/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node ConvNet/conv2d/Conv2D (defined at test.py:45) ]]

And the code is cloned from https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py

mikechen6688 · May 20, 2019, 9:38am

I have the same problem.

It seems that I have already installed it into Ubuntu 16.04 LTS(Ubuntu 18.04 LTS has a gcc compatibility problem, so I uninstalled Ubuntu 18.04 LTS). I have installed two environments including NVIDIA-Linux-x86_64-415.27.run, Anaconda3-2019.03-Linux-x86_64.sh. With usage of conda install tensorflow(v13.1), I installed tensorflow-gpu that includes both cudatoolkit-10.0.130 and cudnn-7.3.1.

I can use the following commands to initiitate the Jupyter environment.

(tf-gpu) mike@mike:~$ conda install jupyter
…
(tf-gpu) mike@mike:~$ python -m ipykernel install --user --name tf-gpu --display-name “TensorFlow-GPU”
(tf-gpu) mike@mike:~$ jupyter notebook

However, I had the the error. After I put the MNIST test code into the cell of TensorFlow-GPU, It failed to get convolution algorithm since cuDNN failed to initialize. I update all commands according to this forum suggestoins. However, it still had the problem after updating tensorflow-gpu in the environment of tf-gpu. Please have a look at the following information.

UnknownError: Failed to get convolution algorithm

UnknownError Traceback (most recent call last)
in
33
34 model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
—> 35 validation_data=(X_test,y_test), callbacks=[tensor_board])
36

~/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1037 initial_epoch=initial_epoch,
1038 steps_per_epoch=steps_per_epoch,
→ 1039 validation_steps=validation_steps)
1040

…

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[{{node metrics/acc/Mean}}]]

Please help me solve the issue.

Thanks in advance,

Mike

dt.parmpal · May 20, 2019, 9:56am

tensorflow 1.13.1 require cudnn version 7.4.1. update your cudnn it will work.

zouzhipeng.1 · May 20, 2019, 10:06am

Actually, I tried many versions from 7.3-7.5, but it doesn’t wok.

dt.parmpal · May 20, 2019, 11:00am

check the cudnn version at /usr/include/cudnn.h it should like
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
and also try to install tensoflow by pip
pip install tensorflow-gpu

hydrioninvoker · August 2, 2019, 6:36am

hello, are you using tensorflow when you met this problem? Maybe the problem is raised by your tensorflow verson, try to update tf to version 1.14.0.
My problem is quite similiar to yours, I also passed cudnn test in cuda 10.0 and cudnn 7.5.0 like you, but when I run tensorflow it still has this problem. Finally, I solve this problem by update tensorflow’s version.

cdesivo92 · August 11, 2019, 9:36pm

I tried this but it says that there is no version that satisfies my requirements. I’m using a jetson nano.

michael.gschwind · September 5, 2019, 7:36pm

So I tried TF 1.8, but that won’t work with CUDA 10.0.
$ pip3 install --upgrade tensorflow-gpu==1.8.0

[…]

~/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py in
72 for some common reasons and solutions. Include the entire stack trace
73 above this error message when asking for help.“”" % traceback.format_exc()
—> 74 raise ImportError(msg)
75
76 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long

ImportError: Traceback (most recent call last):
File “/home/mkg/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py”, line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File “/home/mkg/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File “/home/mkg/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 24, in swig_import_helper
_mod = imp.load_module(‘_pywrap_tensorflow_internal’, fp, pathname, description)
File “/usr/lib/python3.6/imp.py”, line 243, in load_module
return load_dynamic(name, filename, file)
File “/usr/lib/python3.6/imp.py”, line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory