Failing to install 10.1 via .run file on RHEL7 as non-root

tobyykj47 · March 27, 2019, 5:22pm

Hi

% uname -r
3.10.0-693.17.1.el7.x86_64

More than one issue here. Installing as non-root, the --verbose option isn’t recognised:

% sh ./cuda_10.1.105_418.39_linux.run --toolkit --toolkitpath=/public/EM/CUDA/cuda-10.1 --samples --samplespath=/public/EM/CUDA/cuda-10.1 --silent --verbose
Unknown option: --verbose

Removing --verbose (and --silent):

% sh ./cuda_10.1.105_418.39_linux.run --toolkit --toolkitpath=/public/EM/CUDA/cuda-10.1 --samples --samplespath=/public/EM/CUDA/cuda-10.1
/usr/lib64/libcublasLt.so.10.1.0.105 can’t be opened
/usr/lib64/libcublas.so.10.1.0.105 can’t be opened
/usr/lib64/libnvblas.so.10.1.0.105 can’t be opened
Completed with errors. See log at /tmp/cuda-installer.log for details.

The end of (14K lines of) cuda-installer.log reads:

[INFO]: Installed: /public/EM/CUDA/cuda-10.1/targets/x86_64-linux/lib/stubs/libcusolver.so
[WARNING]: Unable to write to directory: /var/log
[WARNING]: Unable to write uninstall manifest to /var/log/nvidia/.uninstallManifests/CUDA_Toolkit_10.1-components/CUDA_Libraries_10.1-components/CUDA_Development_10.1-components/!
[INFO]: libcublas-dev
[INFO]: /usr/bin/lsb_release

[WARNING]: Unable to write to directory: /
[ERROR]: Unable to write to /usr/
[ERROR]: Install of libcublas-dev failed, quitting

Cheers
Toby

tobyykj47 · April 2, 2019, 10:28am

Anyone?

byning · April 5, 2019, 11:36am

Hi, I think you have to use root or ‘sudo’ to install the Nvidia Driver because the default directories of the driver files usually locate at the /var and /usr, which need root authority. At least after many times installation of Nvidia Driver on Ubuntu, I used sudo command every time.

BTW, in your post, warnings and errors are about the root authority issues.

Hope this could help.

tobyykj47 · April 5, 2019, 12:17pm

Hi

Thanks for the suggestion, but I don’t want to install the driver, just the toolkit and samples. It’s being installed on an NFS share to be accessed by many other systems, so anything installed to a local /usr isn’t going to be of any use.

Cheers
Toby

byning · April 5, 2019, 12:41pm

Hi tobby,

Sorry for misreading your post, I took it for granted that you were trying to instal the driver. Sorry.

The lastest CUDA version I have installed so far is 10.0, which was done on Ubuntu 18.04.2 a couple days ago. I always followed the official Linux installation guide where the ‘root authority’ is a must because the toolkit is to be installed in /usr/local/ by default. As to your problem, I don’t know why you were not using sudo command but I still think the ‘root authority’ is a must according to your post.

Still, that’s my rookie view and hope you can find a solution after all. Good luck!

bernard.at.spark · April 5, 2019, 11:41pm

Hi,

I installed cuda 10.1 in my $HOME just for testing while the system still uses cuda 10.0. I installed cuda 10.1 with the .run file as a local user (without sudo) and skipped the driver (which is already installed)

This seems to be similar to your use case (I don’t share it, just for my own testing, so you may need extra configurations and probably fooling with selinux)

Here is what worked for me.
From the folder where the .run file is

./cuda_10.1.105_418.39_linux.run --silent --toolkit --toolkitpath=$HOME/opt/cuda_test/cuda --defaultroot=$HOME/opt/cuda_test/cuda

I am on Ubuntu 16.04, should be similar for RHEL

The samples are inside the cuda folder, you can just copy it afterwards or set samplespath
Here is my thread
https://devtalk.nvidia.com/default/topic/1047863/cuda-setup-and-installation/cuda-10-1-install-path/

(in it I said simpleCudaGraphs didn’t compile, turned out to be my mistake. So in case you have another cuda installed system wide and run into similar problems.
https://devtalk.nvidia.com/default/topic/1048270/cuda-setup-and-installation/cuda-10-1-simplecudagraphs-doesn-t-compile/ )

The unable to write to /var/log errors are harmless and got suppressed with --silent. I have tested that my test cuda-10.1 does work as all the samples compiled and run. Many third party apps such as tensorflow break, but that is because Nvidia moved some of the files around (see AndyDick’s post in my first link) and changed the sonames of the libs, these are not cuda installation’s problems.

tobyykj47 · April 6, 2019, 9:20am

Hi Bernard

Thank you so much, using --defaultroot=… was the magic sauce!

Cheers
Toby

phoenire · July 22, 2020, 7:53am

I now the subject is closed but I had a similar issue

/var/log/nvidia/.uninstallManifests/CUDA_Toolkit_11.0-components/CUDA_Libraries_11.0-components/CUDA_Development_11.0-components/ can’t be opened
/var/log/nvidia/.uninstallManifests/CUDA_Toolkit_11.0-components/CUDA_Libraries_11.0-components/CUDA_Development_11.0-components/ can’t be opened
/var/log/nvidia/.uninstallManifests/CUDA_Toolkit_11.0-components/CUDA_Libraries_11.0-components/CUDA_Development_11.0-components/ can’t be opened
/var/log/nvidia/.uninstallManifests/CUDA_Toolkit_11.0-components/CUDA_Libraries_11.0-components/CUDA_Development_11.0-components/ can’t be opened
terminate called after throwing an instance of ‘boost::filesystem::filesystem_error’
what(): boost::filesystem::copy_file: No such file or directory: “./builds/cuda_cudart/targets/x86_64-linux/include/CL/cl.h”, “/usr/local/cuda-11.0/targets/x86_64-linux/include/CL/cl.h”
Aborted (core dumped)

And I fixed it just by removing : sudo rm -rf /usr/local/cuda-11.0

Even after running both:
sudo /usr/local/cuda-10.2/bin/cuda-uninstaller
sudo apt-get remove --auto-remove nvidia-cuda-toolkit

It was still there. Maybe created by running the script run script but had a previous error. Anyway removing it solved my issue.

wangxf0001 · May 30, 2021, 8:38am

I have the same problem.