Nvidia drivers hang in nv_rdtsc on CentOS 7 with Quadro K4000

Dear Aaron et al,

With a clean and up to date CentOS 7 install on an AMD Opteron 6300 platform with Nvidia Quadro K4000 and the current stable driver (361.45.11) installed, ‘nvidia-smi’ hangs [1] before exi for 5-10 minutes, giving a kernel backtrace [2]. nvidia-bug-report.sh also. The same occurs in applications using the GPU, making it unusable.

I guess the info-ROM warning is benign. Are you guys able to reproduce this there?

Thanks!
Daniel

– [1]

nvidia-smi

Fri Jun 10 09:24:32 2016
±-----------------------------------------------------+
| NVIDIA-SMI 361.45 Driver Version: 361.45.11 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4000 Off | 0000:04:00.0 Off | N/A |
| 30% 37C P0 33W / 87W | 9MiB / 3071MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:04:00.0

– [2]

[ 101.993755] BUG: soft lockup - CPU#0 stuck for 22s! [nvidia-smi:2942]
[ 101.993792] Modules linked in: kvm_amd kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_hdmi snd_hda_intel snd_hda_codec nvidia(POE) snd_hda_core snd_hwdep snd_seq snd_seq_device amd64_edac_mod sp5100_tco pcspkr edac_mce_amd snd_pcm sg fam15h_power k10temp i2c_piix4 edac_core snd_timer snd soundcore shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper igb ttm ahci ptp crct10dif_pclmul crct10dif_common libahci pps_core drm crc32c_intel serio_raw dca libata i2c_algo_bit i2c_core
[ 101.993794] CPU: 0 PID: 2942 Comm: nvidia-smi Tainted: P OE ------------ 3.10.0-327.18.2.el7.x86_64 #1
[ 101.993795] Hardware name: Supermicro PIO-2042G-LTRF-OEM/H8QGL, BIOS DS3.5 09/14/2015
[ 101.993797] task: ffff88209a04dc00 ti: ffff882097968000 task.ti: ffff882097968000
[ 101.994054] RIP: 0010:[] [] os_io_read_byte+0xc/0x10 [nvidia]
[ 101.994055] RSP: 0018:ffff88209796bc28 EFLAGS: 00000292
[ 101.994056] RAX: 0000000000000069 RBX: 0000000000000001 RCX: 0000000000000001
[ 101.994057] RDX: 00000000000003d5 RSI: 00000000000a0000 RDI: 00000000000003d5
[ 101.994058] RBP: ffff88209796bc28 R08: 00000000000c34dd R09: 00000000000c34dd
[ 101.994059] R10: 0000000000000001 R11: ffffffffa09e9f10 R12: 0000000000000001
[ 101.994059] R13: ffff883098c4af50 R14: 00000000575a4cba R15: ffff883098c4af50
[ 101.994061] FS: 00007f8e23025740(0000) GS:ffff88089f800000(0000) knlGS:0000000000000000
[ 101.994062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.994063] CR2: 00007f8e22125b50 CR3: 0000003098dcd000 CR4: 00000000000407f0
[ 101.994064] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 101.994064] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 101.994065] Stack:
[ 101.994079] ffff883098c4af60 ffffffffa09e9dbc 0000000000001a5a ffffffffa09f3821
[ 101.994088] ffff883098c4afa8 ffffffffa09ea625 ffff883098c4afa8 ffffffffa09ea404
[ 101.994097] ffff883096c84008 0000000000004f02 ffff883098c4afac 0000000000000000
[ 101.994097] Call Trace:
[ 101.994176] [] nv_rdtsc+0x1c/0x270 [nvidia]
[ 101.994251] [] ? _nv018126rm+0x84a1/0xbd60 [nvidia]
[ 101.994325] [] ? _nv000860rm+0x85/0xb0 [nvidia]
[ 101.994399] [] ? _nv013638rm+0x164/0x220 [nvidia]
[ 101.994477] [] ? _nv014137rm+0x7c/0x170 [nvidia]
[ 101.994556] [] ? _nv000756rm+0x2d5/0x370 [nvidia]
[ 101.994632] [] ? _nv000680rm+0x223/0x3b0 [nvidia]
[ 101.994709] [] ? _nv000692rm+0x2ba/0x340 [nvidia]
[ 101.994789] [] ? rm_disable_adapter+0x6a/0x130 [nvidia]
[ 101.994852] [] ? nv_close_device+0x115/0x160 [nvidia]
[ 101.994915] [] ? nvidia_close+0xda/0x330 [nvidia]
[ 101.994978] [] ? nvidia_frontend_close+0x2c/0x50 [nvidia]
[ 101.994983] [] ? __fput+0xe9/0x270
[ 101.994987] [] ? ____fput+0xe/0x10
[ 101.994992] [] ? task_work_run+0xa7/0xe0
[ 101.994997] [] ? do_notify_resume+0x92/0xb0
[ 101.995002] [] ? int_signal+0x12/0x17
[ 101.995019] Code: 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 89 f0 89 fa 48 89 e5 ef 5d c3 66 66 66 66 90 55 89 fa 48 89 e5 ec <5d> c3 66 90 66 66 66 66 90 55 89 fa 48 89 e5 66 ed 5d c3 90 66

Using the 367.27 driver that were just released, same:

[ 73.964502] BUG: soft lockup - CPU#0 stuck for 22s! [nvidia-smi:2519]
[ 73.964541] Modules linked in: kvm_amd kvm nvidia_drm(POE) crc32_pclmul ghash_clmulni_intel aesni_intel nvidia_modeset(POE) lrw gf128mul glue_helper ablk_helper snd_hda_codec_hdmi sp5100_tco nvidia(POE) cryptd snd_hda_intel snd_hda_codec snd_hda_core amd64_edac_mod snd_hwdep i2c_piix4 snd_seq edac_mce_amd snd_seq_device k10temp pcspkr fam15h_power edac_core sg snd_pcm snd_timer snd soundcore shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper ttm ahci ptp crct10dif_pclmul crct10dif_common pps_core libahci drm crc32c_intel serio_raw dca libata i2c_algo_bit i2c_core
[ 73.964544] CPU: 0 PID: 2519 Comm: nvidia-smi Tainted: P OE ------------ 3.10.0-327.18.2.el7.x86_64 #1
[ 73.964545] Hardware name: Supermicro PIO-2042G-LTRF-OEM/H8QGL, BIOS DS3.5 09/14/2015
[ 73.964546] task: ffff881099a46780 ti: ffff881095ad8000 task.ti: ffff881095ad8000
[ 73.964840] RIP: 0010:[] [] nv_rdtsc+0x108/0x270 [nvidia]
[ 73.964841] RSP: 0018:ffff881095adbc18 EFLAGS: 00000282
[ 73.964841] RAX: ffff8800000c34df RBX: 0000000000000001 RCX: 0000000000000001
[ 73.964842] RDX: 00000000000000c0 RSI: 00000000000a0000 RDI: 00000000000c34df
[ 73.964843] RBP: ffff88309aa62f60 R08: 00000000000c34df R09: 00000000000c34df
[ 73.964844] R10: 0000000000000001 R11: ffffffffa0a3b560 R12: 0000000000000001
[ 73.964844] R13: ffff88309aa62f60 R14: ffff881095adbba8 R15: ffff88309aa62f60
[ 73.964846] FS: 00007f923d186740(0000) GS:ffff88089f800000(0000) knlGS:0000000000000000
[ 73.964847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 73.964848] CR2: 00007f923c281bc0 CR3: 000000309956a000 CR4: 00000000000407f0
[ 73.964848] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 73.964849] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 73.964850] Stack:
[ 73.964864] 0000000000000018 ffffffffa0a3bce3 ffff88309aa62fa8 000000000000c000
[ 73.964873] 0000000000001a5a ffffffffa0a3d2e6 ffff88309aa62fa8 ffffffffa0a3bc75
[ 73.964882] ffff88309aa62fa8 ffffffffa0a3ba54 ffff883097e74008 ffff883099eb9e08
[ 73.964882] Call Trace:
[ 73.964970] [] ? _nv009392rm+0x33/0x60 [nvidia]
[ 73.965057] [] ? _nv021475rm+0x916/0xbd60 [nvidia]
[ 73.965143] [] ? _nv000922rm+0x85/0xb0 [nvidia]
[ 73.965228] [] ? _nv015901rm+0x164/0x220 [nvidia]
[ 73.965323] [] ? _nv016585rm+0x4d/0x140 [nvidia]
[ 73.965411] [] ? _nv000862rm+0x289/0x390 [nvidia]
[ 73.965503] [] ? _nv000780rm+0x220/0x3c0 [nvidia]
[ 73.965590] [] ? _nv000797rm+0x2ba/0x340 [nvidia]
[ 73.965680] [] ? rm_disable_adapter+0x6a/0x130 [nvidia]
[ 73.965754] [] ? nv_close_device+0x115/0x160 [nvidia]
[ 73.965827] [] ? nvidia_close+0xdb/0x2e0 [nvidia]
[ 73.965900] [] ? nvidia_frontend_close+0x2c/0x50 [nvidia]
[ 73.965905] [] ? __fput+0xe9/0x270
[ 73.965909] [] ? ____fput+0xe/0x10
[ 73.965914] [] ? task_work_run+0xa7/0xe0
[ 73.965920] [] ? do_notify_resume+0x92/0xb0
[ 73.965925] [] ? int_signal+0x12/0x17
[ 73.965942] Code: 08 89 f0 48 29 c2 eb ed 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 08 be 01 00 00 00 e8 92 ff ff ff 31 d2 48 85 c0 74 03 0f b6 10 <89> d0 48 83 c4 08 c3 90 48 83 ec 08 be 02 00 00 00 e8 72 ff ff
[ 101.991834] BUG: soft lockup - CPU#0 stuck for 22s! [nvidia-smi:2519]
[ 101.991855] Modules linked in: kvm_amd kvm nvidia_drm(POE) crc32_pclmul ghash_clmulni_intel aesni_intel nvidia_modeset(POE) lrw gf128mul glue_helper ablk_helper snd_hda_codec_hdmi sp5100_tco nvidia(POE) cryptd snd_hda_intel snd_hda_codec snd_hda_core amd64_edac_mod snd_hwdep i2c_piix4 snd_seq edac_mce_amd snd_seq_device k10temp pcspkr fam15h_power edac_core sg snd_pcm snd_timer snd soundcore shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt igb drm_kms_helper ttm ahci ptp crct10dif_pclmul crct10dif_common pps_core libahci drm crc32c_intel serio_raw dca libata i2c_algo_bit i2c_core
[ 101.991857] CPU: 0 PID: 2519 Comm: nvidia-smi Tainted: P OEL ------------ 3.10.0-327.18.2.el7.x86_64 #1
[ 101.991858] Hardware name: Supermicro PIO-2042G-LTRF-OEM/H8QGL, BIOS DS3.5 09/14/2015
[ 101.991859] task: ffff881099a46780 ti: ffff881095ad8000 task.ti: ffff881095ad8000
[ 101.991937] RIP: 0010:[] [] os_io_read_byte+0xc/0x10 [nvidia]
[ 101.991938] RSP: 0018:ffff881095adbc28 EFLAGS: 00000292
[ 101.991939] RAX: 0000000000000034 RBX: 0000000000000001 RCX: 0000000000000001
[ 101.991940] RDX: 00000000000003d5 RSI: 00000000000a0000 RDI: 00000000000003d5
[ 101.991940] RBP: ffff881095adbc28 R08: 00000000000c34dd R09: 00000000000c34dd
[ 101.991941] R10: 0000000000000001 R11: ffffffffa0a3b560 R12: 0000000000000001
[ 101.991942] R13: ffff88309aa62f60 R14: 0000000000000001 R15: ffff88309aa62f60
[ 101.991943] FS: 00007f923d186740(0000) GS:ffff88089f800000(0000) knlGS:0000000000000000
[ 101.991944] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.991945] CR2: 00007f923c281bc0 CR3: 000000309956a000 CR4: 00000000000407f0
[ 101.991946] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 101.991946] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 101.991947] Stack:
[ 101.991960] ffff88309aa62f70 ffffffffa0a3b40c 0000000000001a5a ffffffffa0a44e71
[ 101.991969] ffff88309aa62fa8 ffffffffa0a3bc75 ffff88309aa62fa8 ffffffffa0a3ba54
[ 101.991978] ffff883097e74008 ffff883099eb9e08 0000000000004f02 ffff88309aa62fa8
[ 101.991979] Call Trace:
[ 101.992065] [] nv_rdtsc+0x1c/0x270 [nvidia]
[ 101.992150] [] ? _nv021475rm+0x84a1/0xbd60 [nvidia]
[ 101.992235] [] ? _nv000922rm+0x85/0xb0 [nvidia]
[ 101.992321] [] ? _nv015901rm+0x164/0x220 [nvidia]
[ 101.992410] [] ? _nv016585rm+0x4d/0x140 [nvidia]
[ 101.992498] [] ? _nv000862rm+0x289/0x390 [nvidia]
[ 101.992586] [] ? _nv000780rm+0x220/0x3c0 [nvidia]
[ 101.992677] [] ? _nv000797rm+0x2ba/0x340 [nvidia]
[ 101.992770] [] ? rm_disable_adapter+0x6a/0x130 [nvidia]
[ 101.992843] [] ? nv_close_device+0x115/0x160 [nvidia]
[ 101.992916] [] ? nvidia_close+0xdb/0x2e0 [nvidia]
[ 101.992988] [] ? nvidia_frontend_close+0x2c/0x50 [nvidia]
[ 101.992992] [] ? __fput+0xe9/0x270
[ 101.992996] [] ? ____fput+0xe/0x10
[ 101.993000] [] ? task_work_run+0xa7/0xe0
[ 101.993004] [] ? do_notify_resume+0x92/0xb0
[ 101.993008] [] ? int_signal+0x12/0x17
[ 101.993025] Code: 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 89 f0 89 fa 48 89 e5 ef 5d c3 66 66 66 66 90 55 89 fa 48 89 e5 ec <5d> c3 66 90 66 66 66 66 90 55 89 fa 48 89 e5 66 ed 5d c3 90 66

This issue occurs when the BIOS has a >2GB MMIO window. In this case, with a 3GB MMIO window configured, this issue occurs. Configuring a small (~256MB) MMIO window, this issue doesn’t occur, but BAR allocation fails of other devices.