Error installing nvidia drivers on x86_64 amazon ec2 gpu cluster (T20 GPU)
Hi, I am trying to install the nvidia drivers on the amzon gpu cluster but I get error when installing drivers. The nvidia-insaller.log is attached. The kernel version is: 3.8.0-19-generic lshw reports the following: lshw -C display WARNING: you should run this program as super-user. *-display:0 UNCLAIMED description: VGA compatible controller product: GD 5446 vendor: Cirrus Logic physical id: 2 bus info: pci@0000:00:02.0 version: 00 width: 32 bits clock: 33MHz capabilities: vga_controller bus_master configuration: latency=0 resources: memory:d0000000-d1ffffff memory:d7100000-d7100fff *-display:1 UNCLAIMED description: 3D controller product: GF100GL [Tesla T20 Processor] vendor: NVIDIA Corporation physical id: 3 bus info: pci@0000:00:03.0 version: a3 width: 64 bits clock: 33MHz capabilities: bus_master cap_list configuration: latency=0 resources: memory:d2000000-d3ffffff memory:c0000000-c3ffffff memory:c4000000-c7ffffff ioport:c100(size=128) memory:d7000000-d707ffff *-display:2 UNCLAIMED description: 3D controller product: GF100GL [Tesla T20 Processor] vendor: NVIDIA Corporation physical id: 4 bus info: pci@0000:00:04.0 version: a3 width: 64 bits clock: 33MHz capabilities: bus_master cap_list configuration: latency=0 resources: memory:d4000000-d5ffffff memory:c8000000-cbffffff memory:cc000000-cfffffff ioport:c180(size=128) memory:d7080000-d70fffff Also as per Amazon Ec2 docs, the gpu cluster cg1.4x is based on Tesla M 2050 based GPUS but what lspci / lshw seems to report that it is Tesla T20 GPU's. From what I understand Tesla M class GPUs are based on T20 chip so hopefully I have selected the right drivers. The version of drivers that I have tried are NVIDIA-Linux-x86_64-319.23.run and NVIDIA-Linux-x86_64-319.17.run and both of them seem to report the same problem which do support Tesla M class GPUs. Thanks & Regards, Divick
Hi,

I am trying to install the nvidia drivers on the amzon gpu cluster but I get error when installing drivers. The nvidia-insaller.log is attached.


The kernel version is:

3.8.0-19-generic

lshw reports the following:

lshw -C display
WARNING: you should run this program as super-user.
*-display:0 UNCLAIMED
description: VGA compatible controller
product: GD 5446
vendor: Cirrus Logic
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master
configuration: latency=0
resources: memory:d0000000-d1ffffff memory:d7100000-d7100fff
*-display:1 UNCLAIMED
description: 3D controller
product: GF100GL [Tesla T20 Processor]
vendor: NVIDIA Corporation
physical id: 3
bus info: pci@0000:00:03.0
version: a3
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
resources: memory:d2000000-d3ffffff memory:c0000000-c3ffffff memory:c4000000-c7ffffff ioport:c100(size=128) memory:d7000000-d707ffff
*-display:2 UNCLAIMED
description: 3D controller
product: GF100GL [Tesla T20 Processor]
vendor: NVIDIA Corporation
physical id: 4
bus info: pci@0000:00:04.0
version: a3
width: 64 bits
clock: 33MHz
capabilities: bus_master cap_list
configuration: latency=0
resources: memory:d4000000-d5ffffff memory:c8000000-cbffffff memory:cc000000-cfffffff ioport:c180(size=128) memory:d7080000-d70fffff


Also as per Amazon Ec2 docs, the gpu cluster cg1.4x is based on Tesla M 2050 based GPUS but what lspci / lshw seems to report that it is Tesla T20 GPU's. From what I understand Tesla M class GPUs are based on T20 chip so hopefully I have selected the right drivers.

The version of drivers that I have tried are NVIDIA-Linux-x86_64-319.23.run and NVIDIA-Linux-x86_64-319.17.run and both of them seem to report the same problem which do support Tesla M class GPUs.


Thanks & Regards,
Divick
Attachments

nvidia-installer.log

#1
Posted 06/17/2013 11:25 AM   
I think that you don't have the drm kernel modules installed on that system. The last part of the log file indicates, that the nvidia module doesn't find drm_* symbols. Maybe you have to install them first or load them into the runtime via modprobe.
I think that you don't have the drm kernel modules installed on that system. The last part of the log file indicates, that the nvidia module doesn't find drm_* symbols. Maybe you have to install them first or load them into the runtime via modprobe.

#2
Posted 06/17/2013 11:45 AM   
In addition to that, the missing "drm_gem_prime_export" symbols seems to be within the 3.9 kernel only. I don't know it this symbol is a hard requirement, but maybe you should try an older nvidia driver if the installation of the drm modules doesn't work or a newer kernel.
In addition to that, the missing "drm_gem_prime_export" symbols seems to be within the 3.9 kernel only. I don't know it this symbol is a hard requirement, but maybe you should try an older nvidia driver if the installation of the drm modules doesn't work or a newer kernel.

#3
Posted 06/17/2013 11:52 AM   
[quote="karolherbst"]I think that you don't have the drm kernel modules installed on that system. The last part of the log file indicates, that the nvidia module doesn't find drm_* symbols. Maybe you have to install them first or load them into the runtime via modprobe.[/quote] Hi, thanks for the reply. Hmm I see ... that means I would need to build and install the kernel modules for the kernel installed on the amazon ami isn't it? I then tried with an older AMI (i.e. for ubuntu 12.04 instead of ubuntu 13.04) and the driver installed just fine. Nevertheless when I get hold again of ubuntu 13.04 AMI, I will try building and installing the kernel modules for drm.
karolherbst said:I think that you don't have the drm kernel modules installed on that system. The last part of the log file indicates, that the nvidia module doesn't find drm_* symbols. Maybe you have to install them first or load them into the runtime via modprobe.


Hi, thanks for the reply. Hmm I see ... that means I would need to build and install the kernel modules for the kernel installed on the amazon ami isn't it? I then tried with an older AMI (i.e. for ubuntu 12.04 instead of ubuntu 13.04) and the driver installed just fine. Nevertheless when I get hold again of ubuntu 13.04 AMI, I will try building and installing the kernel modules for drm.

#4
Posted 06/17/2013 02:45 PM   
you could look into /lib/modules and search there. A file named "modules.symbols" should have all symbols exported by the modules listed. You could also try to modprobe the drm module(s). Or you could look into the kernel configuration in /proc/config(.gz) and see there if the kernel is configured with drm. Also lsmod should be worth a look.
you could look into /lib/modules and search there. A file named "modules.symbols" should have all symbols exported by the modules listed. You could also try to modprobe the drm module(s). Or you could look into the kernel configuration in /proc/config(.gz) and see there if the kernel is configured with drm.

Also lsmod should be worth a look.

#5
Posted 06/17/2013 03:55 PM   
I am unable to download your Attachment nvidia-installer.log . What error you are seeing ?
I am unable to download your Attachment nvidia-installer.log . What error you are seeing ?

Thanks,
Sandip.

#6
Posted 06/18/2013 02:31 PM   
The errors are as logged below: -> Unable to determine if Secure Boot is enabled: No such file or directory ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel mod ule was built against the wrong or improperly configured kernel sources, with a version of gcc that dif fers from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device (s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver releas e. Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/va r/log/nvidia-installer.log' for more information. -> Kernel module load error: No such file or directory -> Kernel messages: [ 1117.323913] nvidia: Unknown symbol drm_gem_mmap (err 0) [ 1117.323918] nvidia: Unknown symbol drm_ioctl (err 0) [ 1117.323928] nvidia: Unknown symbol drm_gem_object_free (err 0) [ 1117.323942] nvidia: Unknown symbol drm_read (err 0) [ 1117.323957] nvidia: Unknown symbol drm_gem_handle_create (err 0) [ 1117.323962] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0) [ 1117.324002] nvidia: Unknown symbol drm_pci_exit (err 0) [ 1117.324079] nvidia: Unknown symbol drm_release (err 0) [ 1117.324084] nvidia: Unknown symbol drm_gem_prime_export (err 0) [ 1863.597421] mtrr: no MTRR for d0000000,100000 found [ 3341.270419] nvidia: Unknown symbol drm_open (err 0) [ 3341.270426] nvidia: Unknown symbol drm_fasync (err 0) [ 3341.270436] nvidia: Unknown symbol drm_poll (err 0) [ 3341.270449] nvidia: Unknown symbol drm_pci_init (err 0) [ 3341.270499] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0) [ 3341.270517] nvidia: Unknown symbol drm_gem_private_object_init (err 0) [ 3341.270532] nvidia: Unknown symbol drm_gem_mmap (err 0) [ 3341.270537] nvidia: Unknown symbol drm_ioctl (err 0) [ 3341.270546] nvidia: Unknown symbol drm_gem_object_free (err 0) [ 3341.270559] nvidia: Unknown symbol drm_read (err 0) [ 3341.270575] nvidia: Unknown symbol drm_gem_handle_create (err 0) [ 3341.270580] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0) [ 3341.270619] nvidia: Unknown symbol drm_pci_exit (err 0) [ 3341.270636] nvidia: Unknown symbol drm_release (err 0) [ 3341.270639] nvidia: Unknown symbol drm_gem_prime_export (err 0) ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
The errors are as logged below:

-> Unable to determine if Secure Boot is enabled: No such file or directory
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel mod
ule was built against the wrong or improperly configured kernel sources, with a version of gcc that dif
fers from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau
is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA graphics device
(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver releas
e.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/va
r/log/nvidia-installer.log' for more information.
-> Kernel module load error: No such file or directory
-> Kernel messages:
[ 1117.323913] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 1117.323918] nvidia: Unknown symbol drm_ioctl (err 0)
[ 1117.323928] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 1117.323942] nvidia: Unknown symbol drm_read (err 0)
[ 1117.323957] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 1117.323962] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 1117.324002] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 1117.324079] nvidia: Unknown symbol drm_release (err 0)
[ 1117.324084] nvidia: Unknown symbol drm_gem_prime_export (err 0)
[ 1863.597421] mtrr: no MTRR for d0000000,100000 found
[ 3341.270419] nvidia: Unknown symbol drm_open (err 0)
[ 3341.270426] nvidia: Unknown symbol drm_fasync (err 0)
[ 3341.270436] nvidia: Unknown symbol drm_poll (err 0)
[ 3341.270449] nvidia: Unknown symbol drm_pci_init (err 0)
[ 3341.270499] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 3341.270517] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 3341.270532] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 3341.270537] nvidia: Unknown symbol drm_ioctl (err 0)
[ 3341.270546] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 3341.270559] nvidia: Unknown symbol drm_read (err 0)
[ 3341.270575] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 3341.270580] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 3341.270619] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 3341.270636] nvidia: Unknown symbol drm_release (err 0)
[ 3341.270639] nvidia: Unknown symbol drm_gem_prime_export (err 0)
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

#7
Posted 06/19/2013 04:11 AM   
This issue is resolved now. In case it is helpful to someone else putting my resolution here. The issue is seen with Ubuntu 13.04 with kernel 3.8.0-19-generic. The issue as that it was unable to find and load the drm.ko module. Somehow I did not even find it installed in /lib/modules/3.8.0-19-generic. So I installed the kernel sources from ubuntu repository and then built the kernel and modules. And then I inserted the drm.ko and tried to build the nvidia drivers and it succeeded. 1. sudo apt-get source linux-image-3.8.0-19-generic 2. cd linux-3.8.0 3. sudo cp /boot/config-3.8.0-19-generic .config 4. sudo make menuconfig Select Device drivers ---> Graphics support ---> <M> Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) ---> 5. make j16 6. sudo insmod ./drivers/gpu/drm/drm.ko 7. sudo NVIDIA-Linux-x86_64-319.23.run --opengl-headers That's all. BTW I trid installing the modules but somehow I still don't see it in /lib/modules/3.8.0-19-generic/, so not sure if on reboot the nvidia kernel drivers will load or not.
This issue is resolved now. In case it is helpful to someone else putting my resolution here.

The issue is seen with Ubuntu 13.04 with kernel 3.8.0-19-generic. The issue as that it was unable to find and load the drm.ko module. Somehow I did not even find it installed in /lib/modules/3.8.0-19-generic. So I installed the kernel sources from ubuntu repository and then built the kernel and modules. And then I inserted the drm.ko and tried to build the nvidia drivers and it succeeded.

1. sudo apt-get source linux-image-3.8.0-19-generic
2. cd linux-3.8.0
3. sudo cp /boot/config-3.8.0-19-generic .config
4. sudo make menuconfig

Select

Device drivers --->
Graphics support --->
<M> Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) --->

5. make j16
6. sudo insmod ./drivers/gpu/drm/drm.ko
7. sudo NVIDIA-Linux-x86_64-319.23.run --opengl-headers

That's all. BTW I trid installing the modules but somehow I still don't see it in /lib/modules/3.8.0-19-generic/, so not sure if on reboot the nvidia kernel drivers will load or not.

#8
Posted 07/05/2013 08:05 AM   
I have found the issue with building of modules but not getting loaded on reboot. Apparently the kernel version that gets built show 3.8.13 instead of 3.8.0-19, so the modules get placed in /lib/modules/3.8.13.2/. So you need to change the kernel version at the top in the Makefile or by some other mechanism. I don't know of a way to do so apart from this.
I have found the issue with building of modules but not getting loaded on reboot. Apparently the kernel version that gets built show 3.8.13 instead of 3.8.0-19, so the modules get placed in /lib/modules/3.8.13.2/. So you need to change the kernel version at the top in the Makefile or by some other mechanism. I don't know of a way to do so apart from this.

#9
Posted 07/05/2013 04:26 PM   
I am getting this error after trying modprobe nvidia after building the 331 drivers on an AWS GPU cluster machine with 14.04: [code]modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg) [/code] [code][ 9552.922683] nvidia: Unknown symbol drm_open (err 0) [ 9552.922696] nvidia: Unknown symbol drm_poll (err 0) [ 9552.922707] nvidia: Unknown symbol drm_pci_init (err 0) [ 9552.922750] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0) [ 9552.922763] nvidia: Unknown symbol drm_gem_private_object_init (err 0) [ 9552.922775] nvidia: Unknown symbol drm_gem_mmap (err 0) [ 9552.922779] nvidia: Unknown symbol drm_ioctl (err 0) [ 9552.922787] nvidia: Unknown symbol drm_gem_object_free (err 0) [ 9552.922798] nvidia: Unknown symbol drm_read (err 0) [ 9552.922813] nvidia: Unknown symbol drm_gem_handle_create (err 0) [ 9552.922819] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0) [ 9552.922857] nvidia: Unknown symbol drm_pci_exit (err 0) [ 9552.922871] nvidia: Unknown symbol drm_release (err 0) [ 9552.922874] nvidia: Unknown symbol drm_gem_prime_export (err 0) [ 9836.615496] nvidia: Unknown symbol drm_open (err 0) [ 9836.615509] nvidia: Unknown symbol drm_poll (err 0) [ 9836.615520] nvidia: Unknown symbol drm_pci_init (err 0) [ 9836.615564] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0) [ 9836.615577] nvidia: Unknown symbol drm_gem_private_object_init (err 0) [ 9836.615589] nvidia: Unknown symbol drm_gem_mmap (err 0) [ 9836.615593] nvidia: Unknown symbol drm_ioctl (err 0) [ 9836.615601] nvidia: Unknown symbol drm_gem_object_free (err 0) [ 9836.615612] nvidia: Unknown symbol drm_read (err 0) [ 9836.615626] nvidia: Unknown symbol drm_gem_handle_create (err 0) [ 9836.615632] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0) [ 9836.615668] nvidia: Unknown symbol drm_pci_exit (err 0) [ 9836.615682] nvidia: Unknown symbol drm_release (err 0) [ 9836.615685] nvidia: Unknown symbol drm_gem_prime_export (err 0) [/code]
I am getting this error after trying modprobe nvidia after building the 331 drivers on an AWS GPU cluster machine with 14.04:

modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)


[ 9552.922683] nvidia: Unknown symbol drm_open (err 0)
[ 9552.922696] nvidia: Unknown symbol drm_poll (err 0)
[ 9552.922707] nvidia: Unknown symbol drm_pci_init (err 0)
[ 9552.922750] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 9552.922763] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 9552.922775] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 9552.922779] nvidia: Unknown symbol drm_ioctl (err 0)
[ 9552.922787] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 9552.922798] nvidia: Unknown symbol drm_read (err 0)
[ 9552.922813] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 9552.922819] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 9552.922857] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 9552.922871] nvidia: Unknown symbol drm_release (err 0)
[ 9552.922874] nvidia: Unknown symbol drm_gem_prime_export (err 0)
[ 9836.615496] nvidia: Unknown symbol drm_open (err 0)
[ 9836.615509] nvidia: Unknown symbol drm_poll (err 0)
[ 9836.615520] nvidia: Unknown symbol drm_pci_init (err 0)
[ 9836.615564] nvidia: Unknown symbol drm_gem_prime_handle_to_fd (err 0)
[ 9836.615577] nvidia: Unknown symbol drm_gem_private_object_init (err 0)
[ 9836.615589] nvidia: Unknown symbol drm_gem_mmap (err 0)
[ 9836.615593] nvidia: Unknown symbol drm_ioctl (err 0)
[ 9836.615601] nvidia: Unknown symbol drm_gem_object_free (err 0)
[ 9836.615612] nvidia: Unknown symbol drm_read (err 0)
[ 9836.615626] nvidia: Unknown symbol drm_gem_handle_create (err 0)
[ 9836.615632] nvidia: Unknown symbol drm_prime_pages_to_sg (err 0)
[ 9836.615668] nvidia: Unknown symbol drm_pci_exit (err 0)
[ 9836.615682] nvidia: Unknown symbol drm_release (err 0)
[ 9836.615685] nvidia: Unknown symbol drm_gem_prime_export (err 0)

#10
Posted 07/23/2014 02:59 AM   
Thanks for keeping us posted on your development. I had the same problem today on Ubuntu Server 14.04. I tried your method of compiling the drm module and inserting it but to no avail. This was on kernel version 3.13 so it looks like the bug is still there. Oh and instead of using `make -j16` like you suggest in your post I used `make drivers/gpu/drm/` as to not compile the full kernel but just the module. However a `drm.ko` file was never generated. So I switched back to 12.04 and installing was not a problem. Works for now!
Thanks for keeping us posted on your development.

I had the same problem today on Ubuntu Server 14.04. I tried your method of compiling the drm module and inserting it but to no avail. This was on kernel version 3.13 so it looks like the bug is still there.

Oh and instead of using `make -j16` like you suggest in your post I used `make drivers/gpu/drm/` as to not compile the full kernel but just the module. However a `drm.ko` file was never generated.

So I switched back to 12.04 and installing was not a problem. Works for now!

#11
Posted 07/25/2014 08:47 PM   
I had the same problem with Ubuntu 14.04. What worked for me was a simple: [code]sudo apt-get install linux-image-extra-virtual[/code] Then the NVIDIA driver installed without a hitch.
I had the same problem with Ubuntu 14.04.

What worked for me was a simple:

sudo apt-get install linux-image-extra-virtual


Then the NVIDIA driver installed without a hitch.

#12
Posted 09/27/2014 08:04 PM   
thanks for the tutorial.
thanks for the tutorial.

#13
Posted 11/10/2015 01:06 AM   
I had a similar problem trying to install CUDA on an EC2 g2.2xlarge GPU instance with the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type AMI (ami-d05e75b8). Some characteristics below on the AMI, taken from the first login: [code]$ lsb_release -a[/code] [quote]No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.04.2 LTS Release: 14.04 Codename: trusty[/quote] [code]$ lspci[/code] [quote]00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)[/quote] [code]$ nvidia-smi[/code] [quote]nvidia-smi: command not found[/quote] AWS Support gave me a quick answer on how to resolve the issue. [code]$ sudo apt-get update && sudo apt-get -y upgrade \ # install the package maintainer's version (of /boot/grub/menu.lst) $ sudo apt-get install -y linux-image-extra-`uname -r` $ sudo apt-get update $ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb $ sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb $ sudo apt-get update $ sudo apt-get install -y cuda[/code] To validate the installation using the CUDA Toolkit's deviceQuery utility: [code]$ export PATH=/usr/local/cuda-7.5/bin:$PATH $ export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH $ cuda-install-samples-7.5.sh ~ $ cd ~/NVIDIA_CUDA-7.5_Samples/1_Utilities/deviceQuery/ $ make $ ~/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery[/code] [quote]/home/ubuntu/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GRID K520" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 4096 MBytes (4294770688 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 797 MHz (0.80 GHz) Memory Clock rate: 2500 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GRID K520 Result = PASS[/quote]
I had a similar problem trying to install CUDA on an EC2 g2.2xlarge GPU instance with the Ubuntu Server 14.04 LTS (HVM), SSD Volume Type AMI (ami-d05e75b8).

Some characteristics below on the AMI, taken from the first login:

$ lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty

$ lspci

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

$ nvidia-smi

nvidia-smi: command not found


AWS Support gave me a quick answer on how to resolve the issue.

$ sudo apt-get update && sudo apt-get -y upgrade \
# install the package maintainer's version (of /boot/grub/menu.lst)
$ sudo apt-get install -y linux-image-extra-`uname -r`
$ sudo apt-get update
$ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
$ sudo apt-get update
$ sudo apt-get install -y cuda


To validate the installation using the CUDA Toolkit's deviceQuery utility:

$ export PATH=/usr/local/cuda-7.5/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH
$ cuda-install-samples-7.5.sh ~
$ cd ~/NVIDIA_CUDA-7.5_Samples/1_Utilities/deviceQuery/
$ make
$ ~/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery

/home/ubuntu/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4096 MBytes (4294770688 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GRID K520
Result = PASS

#14
Posted 01/18/2016 05:16 AM   
Scroll To Top

Add Reply