Titan X GPU locks up during thrust::sort_by_key

Hello, we’ve been experiencing consistently reproducible hardware lock-ups (driver resets after Xid 8 if attached to display, otherwise the GPU just stalls) when running thrust::sort_by_key with a custom comparator function on Titan X GPUs (Maxwell architecture, compute capability 5.2). The problem only manifests when sorting arrays with millions of elements, multiple keys and a custom comparator function. All our Titan X GPUs seem to be affected, and in all cases there is no sign of thermal issues, indicating (from the Xid value of 8) that this is either a problem in the driver or a problem in the (thrust) code. I’m not sure where the bug resides, so I’m reporting this on “both sides”.

Sample code to reproduce the issue is non trivial, so I’ve created the github project [url]https://github.com/Oblomov/titanxstall[/url] to host a sample program (Linux only, but if anyone wishes to adapt it to run on other platforms, please do). Running the program just repeatedly sorts the arrays (and scrambles them again) forever (or until the GPU locks up). With the default settings, the device usually locks up after less than a hundred thousand iterations (less than 10 minutes), but sometimes it locks up as quickly as 2K iterations. Other architectures (Fermi, Kepler) seem to not be affected. The problem also manifests with as few as 1024×1024 elements, in longer runs.

The test program can optionally be run with a custom caching allocator (based on the one in the thrust examples) as an option to verify that the problem manifests even without the continuous allocation/deallocation done by thrust::sort_by_key.

Hi @Bilog,

With sample code you created, GitHub - Oblomov/titanxstall: Tester for stalls in Titan X GPUs, we hit GPU(TitanX) usage locked to 99%~100% after the app run several Iterations. The session run the app hang while within new created session which try to launch another sample on the GPU has no response. Is this the “hang” issue you are experiencing?

$ make test
./titanxstall
Iteration 1000
Iteration 2000
Iteration 3000

==============

###############
System Info:
Ubuntu14.04(x64)+cuda7.5prod(7.5.18)+TitanX
###############

If this is not the issue you hit could you please detail the system info you hit the hang issue so we can try to repro it locally?

thanks.

Hello, sorry for the late reply. Yes, this is exactly the issue I’m seeing: when the GPU hangs, it’s completely inaccessible, until the hanging program is force-terminated (or the driver is reset by the watchdog, which happens if the GPU is attached to a display). After the driver reset (which in my machine also logs a kernel oops in dmesg) or the force-termination of the hanging program, the GPU becomes available again.

The oops is something like the following:

[ +0.000006] NVRM: Xid (PCI:0000:04:00): 8, Channel 00000018
[ +2.001703] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998307] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998308] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998307] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998308] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998306] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[Mar17 14:30] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998307] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998297] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998290] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +0.999638] INFO: rcu_sched detected stalls on CPUs/tasks:
[ +0.000004] 3: (21 GPs behind) idle=75f/1/0 softirq=28665546/28665546 fqs=5246
[ +0.000001] (detected by 5, t=5252 jiffies, g=127085193, c=127085192, q=1106)
[ +0.000003] Task dump for CPU 3:
[ +0.000001] swapper/3 R running task 0 0 1 0x00000008
[ +0.000002] ffffffff814583df 0000000000000010 0000000000000246 ffff88085bff3eb0
[ +0.000002] 0000000000000018 00000000d26c4935 ffffffff81afc980 ffff88085bff4000
[ +0.000001] ffffe8ffffcca2e0 ffffffff81aac820 ffff88085bff0000 ffff88085bff0000
[ +0.000001] Call Trace:
[ +0.000004] [] ? cpuidle_enter_state+0x11f/0x260
[ +0.000004] [] ? cpu_startup_entry+0x262/0x320
[ +0.000002] [] ? start_secondary+0x150/0x190
[ +0.998627] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998290] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.998290] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ +1.714999] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ +0.000001] Modules linked in: fuse nvidia_uvm(PO) binfmt_misc nls_utf8 nls_cp437 snd_hda_codec_hdmi vfat fat intel_rapl snd_hda_codec_realtek snd_hda_codec_generic iosf_mbi snd_hda_intel x86_pkg_temp_thermal snd_hda_codec intel_powerclamp snd_hda_core coretemp kvm_intel kvm snd_hwdep snd_pcm snd_timer crct10dif_pclmul crc32_pclmul snd sha256_ssse3 sha256_generic hmac drbg ansi_cprng iTCO_wdt iTCO_vendor_support aesni_intel soundcore sb_edac aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr serio_raw edac_core efi_pstore i2c_i801 lpc_ich mfd_core efivars shpchp sg tpm_tis tpm 8250_fintek ipmi_msghandler acpi_power_meter processor evdev nvidia(PO) drm parport_pc ppdev lp parport efivarfs autofs4 ext4 crc16 mbcache jbd2 raid1 md_mod hid_generic usbhid hid sr_mod cdrom sd_mod crc32c_intel
[ +0.000027] ahci libahci igb xhci_pci libata ehci_pci dca xhci_hcd ehci_hcd ptp pps_core i2c_algo_bit usbcore scsi_mod usb_common wmi button
[ +0.000007] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P O L 4.3.0-1-amd64 #1 Debian 4.3.3-7
[ +0.000001] Hardware name: Supermicro SYS-5038A-I/X10SRA, BIOS 1.0b 07/14/2015
[ +0.000001] task: ffff88085bfe6f00 ti: ffff88085bff0000 task.ti: ffff88085bff0000
[ +0.000001] RIP: 0010:[] [] _nv011203rm+0x2/0x130 [nvidia]
[ +0.000081] RSP: 0018:ffff88087fcc3c80 EFLAGS: 00000246
[ +0.000001] RAX: ffff8807cee40008 RBX: ffff8808556a8008 RCX: 0000000000000000
[ +0.000001] RDX: 0000000000000000 RSI: ffff8807cee40008 RDI: ffff8808556a8008
[ +0.000000] RBP: ffff8808395ff5e0 R08: ffffffffa0af8eb0 R09: ffffffffa0b09020
[ +0.000001] R10: 0000000000000000 R11: ffffffffa05ecd20 R12: ffff8807cee40008
[ +0.000001] R13: ffff88085a2e2008 R14: 0000000000000000 R15: ffff8808556a8008
[ +0.000001] FS: 0000000000000000(0000) GS:ffff88087fcc0000(0000) knlGS:0000000000000000
[ +0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000001] CR2: 0000556f5c5c3ef0 CR3: 0000000001c0c000 CR4: 00000000001406e0
[ +0.000001] Stack:
[ +0.000000] ffff8808556a8008 ffffffffa0679027 ffff8808556a8008 ffff8807cee40008
[ +0.000002] ffff88085a2e2008 0000000000000000 ffffffffffffffff ffffffffa06740e7
[ +0.000001] 0000000000000000 0000000000000001 ffff8808556a8008 ffff8807cee40008
[ +0.000001] Call Trace:
[ +0.000001]
[ +0.000069] [] ? _nv011301rm+0xe7/0x320 [nvidia]
[ +0.000068] [] ? _nv011300rm+0x177/0x1a0 [nvidia]
[ +0.000059] [] ? _nv007575rm+0x91/0x170 [nvidia]
[ +0.000060] [] ? _nv014035rm+0xf9/0x2b0 [nvidia]
[ +0.000061] [] ? _nv014016rm+0x221/0x400 [nvidia]
[ +0.000060] [] ? _nv014017rm+0x706/0x840 [nvidia]
[ +0.000060] [] ? _nv014061rm+0xac/0xc0 [nvidia]
[ +0.000059] [] ? _nv014061rm+0x70/0xc0 [nvidia]
[ +0.000059] [] ? _nv014063rm+0x4e3/0x640 [nvidia]
[ +0.000059] [] ? _nv014062rm+0x66/0x1a0 [nvidia]
[ +0.000034] [] ? _nv012794rm+0x1b4/0x1210 [nvidia]
[ +0.000026] [] ? rm_run_rc_callback+0x9b/0xe0 [nvidia]
[ +0.000018] [] ? nvidia_isr_bh+0x60/0x60 [nvidia]
[ +0.000017] [] ? nvidia_rc_timer+0x42/0x70 [nvidia]
[ +0.000002] [] ? call_timer_fn+0x30/0xe0
[ +0.000017] [] ? nvidia_isr_bh+0x60/0x60 [nvidia]
[ +0.000001] [] ? run_timer_softirq+0x21c/0x2d0
[ +0.000002] [] ? __do_softirq+0xf8/0x260
[ +0.000001] [] ? irq_exit+0x9b/0xa0
[ +0.000003] [] ? smp_apic_timer_interrupt+0x3e/0x50
[ +0.000001] [] ? apic_timer_interrupt+0x82/0x90
[ +0.000001]
[ +0.000002] [] ? cpuidle_enter_state+0x11f/0x260
[ +0.000001] [] ? cpu_startup_entry+0x262/0x320
[ +0.000002] [] ? start_secondary+0x150/0x190
[ +0.000000] Code: 00 00 e9 5f fe ff ff 31 ff e8 2b b0 1b 00 be 01 00 00 00 48 89 c2 31 ff e8 0c 7d 17 00 e9 3a fd ff ff 0f 1f 80 00 00 00 00 41 57 01 00 00 00 49 89 f7 be 1b 00 00 00 41 56 41 55 41 54 45 31
[ +0.282420] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

(This is on Debian with kernel 4.3.0-1-amd64)

@Bilog,

Seems like you are making tuples out of pointers via device_pointer_cast. Can you reproduce the same problem if you were to use device_vector?
There is a known issue with thrust::reduce_by_key(), on Maxwell architecture, with device_pointer_cast; does not hang but produces a wrong result.
The workaround for the reduce_by_key() problem is to use device_vector instead of device_pointer_cast.
Could it be that what you are experiencing is related?

@Vectorizer,

thanks for your reply. I’ll test with device_vector instead of device_pointer_cast and check if I can still reproduce the issue or not. If it does, this will be still an issue with our production code though, for which I’m not sure we can as easily adapt it to use device_vectors instead of device_pointer_cast.