Nvidia driver 384.59 triggers a kernel crash when we use kaldi software

Hello,

I notice that the last nvidia driver ( 384.59 ) triggers a kernel crash ( general protection fault ) when I use kaldi,

kaldi is a toolkit for speech recognition written in C++, it uses cuda 8 for gpu computing :

http://kaldi-asr.org/doc/about.html

the crash occured on different machines, all run linux ( kernel 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u2 (2017-06-26) x86_64 GNU/Linux ), and have 2 GPU cards,

machine 1 : 2 x Nvidia GTX 1080Ti
machine 2 : 2 x Nvidia tesla K40m

it’s a random bug, sometimes my script finishes without problem, sometimes the kernel linux crashes with the error “general protection fault”,

if I use an older nvidia driver ( for example 375.xx, 366.xx version ) then all is ok, no bugs

the kaldi script I use for trigger the bug is located in this directory in kaldi : kaldi/egs/tedlium/s5_r2/local/nnet3/run_tdnn.sh

crash-log.txt (5.33 KB)

Please attach nvidia bug report as soon as issue hit. Please provide crash dump and backtrace details . How long it take to trigger this issue? Are you sure reverting to earlier driver fix the issue?

Hello,

yes I am sure that using an older driver solves the issue,

it’s not possible to run the script “nvidia bug report”, when the crash occurs all the PC is frozen ( unable to connect it, no input ), and the PC is a remote server ( a cluster ) without x server,

the issue occurs randomly, it can occurs often after 24 or 48 hours

for the details of the crash I have provided in the first message an attachement “crash-log.txt” :

Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: general protection fault: 0000 [#1] SMP
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nvidia_uvm(PO) mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase 8021q garp mrp rdma_ucm ib_ipoib ib_uverbs ib_umad iw_nes
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: lrw gf128mul snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd ttm drm_kms_helper glue_helper soundcore iTCO_wdt evdev drm mxm_wmi ablk_helper cryptd iTCO_ve
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: CPU: 4 PID: 79563 Comm: nnet3-train Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u2
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.4.3 01/17/2017
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: task: ffff882016f3c250 ti: ffff88201d07c000 task.ti: ffff88201d07c000
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RIP: 0010:[] [] _nv024945rm+0x15/0x80 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RSP: 0018:ffff88201d07fc00 EFLAGS: 00010206
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RAX: ffff881119b1f854 RBX: ffff881119b1f854 RCX: ffff881119b1f854
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RDX: 04000000401f0000 RSI: 000000005c00000a RDI: ffff881119b1f854
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RBP: ffff88201eef7140 R08: ffffffffa0fc1de0 R09: ffff88201eef7208
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: R10: 0000000000000000 R11: ffffffffa0e14d20 R12: ffff88102d492230
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: R13: ffff88201eef7208 R14: ffff880ef70db110 R15: ffff880ef70db010
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: FS: 00007f407f608740(0000) GS:ffff88107f240000(0000) knlGS:0000000000000000
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: CR2: 00007f40790cf090 CR3: 0000000001813000 CR4: 00000000003407e0
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: Stack:
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: ffff88201eef71b0 ffffffffa0e12be9 ffff88102d492230 ffffffffa0e17f72
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: ffff88102d492230 ffff880ef70db010 ffff88201eef7208 ffff880ef70db110
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: ffff8810290810c8 ffffffffa0e17863 0000000000000000 00000000c1d00097
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: Call Trace:
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv024971rm+0x9/0x20 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv010627rm+0x262/0x2d0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv010624rm+0x93/0xc0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv010605rm+0x122/0x200 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv007711rm+0x41/0xd0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv032610rm+0x69/0xb0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv007589rm+0x34/0x60 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv007588rm+0x1f7/0x280 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? _nv001153rm+0x62/0xc0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? rm_free_unused_clients+0xc1/0xf0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? nvidia_close+0x20f/0x3a0 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? nvidia_frontend_close+0x27/0x50 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? __fput+0xca/0x1d0
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? task_work_run+0x8c/0xb0
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? do_exit+0x2b1/0xa70
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? signal_wake_up_state+0x1a/0x30
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? do_group_exit+0x39/0xa0
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? SyS_exit_group+0x10/0x10
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: [] ? system_call_fast_compare_end+0x10/0x15
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: Code: 89 df 48 89 c6 5b e9 8b fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 85 ff 74 48 48 8b 17 0f 1f 40 00 48 85 d2 74 0e <48> 39 32 76 16 48 8b 52 10 48 85 d2 75 f2 31 c0 48
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RIP [] _nv024945rm+0x15/0x80 [nvidia]
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: RSP
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: —[ end trace be0a548ed37c7998 ]—
Aug 03 07:54:23 grele-11.nancy.grid5000.fr kernel: Fixing recursive fault but reboot is needed!

okay. How to install and user kaldi software? So just by running run_tdnn_1a.sh trigger this issue? What this script actually doing? Did you observed this issue on any other os? Reboot system after crash and then run nvidia bug report and share generated log file with us. Is this issue hit with only one gpu installed in server? Are you running graphical desktop ? if yes, which one - gnome, kde m unity or else?

using kaldi is not easy to use for beginners ( it’s a tool for scientist researchers in speech recognition ), you have to follow this guide in order to install kaldi :

http://kaldi-asr.org/doc/install.html

and then create a GMM acoustic model with this script :

this script will also run a TDNN script, so it will trigger the same bug, create a GMM and a TDNN will take a lot of time ( several days ) if you have just a PC and not a cluster,

the cluster I use don’t have a GUI ( there is no xorg server, no KDE, no gnome ), I don’t know if the bug can occur when the PC has only one GPU, the machines I use have 2 GPU cards,

the script “/usr/bin/nvidia-bug-report.sh” requires root privileges and I don’t have the password for root, it’s a cluster located in a research center and we don’t have root rights in these machines

Looking at installation process of kaldi we just have to configure, make and make install. Am I correct? So just running kaldi/run.sh at master · kaldi-asr/kaldi · GitHub will create GMM acoustic model . Do I need to do any setting on app or os? It would be good if you share steps for install , configuration and reproducing this issue. If you have sudo user and I think you can run /usr/bin/nvidia-bug-report.sh . No rush! Lets do more testing and let us know if issue reproduce with single gpu and desktop or workstation.

Hello,

so the first step is to compil and install kaldi :

  1. download the git version of kaldi, use a linux PC with cuda installed :

    git clone GitHub - kaldi-asr/kaldi: kaldi-asr/kaldi is the official location of the Kaldi project. kaldi --origin upstream
    cd kaldi

    cd tools
    extras/check_dependencies.sh ( it will tell you if you have to install dependencies packages )
    make -j 4 ( use -j 4 if you have a CPU with 4 cores )

    cd …/src
    ./configure --shared
    make depend -j 4
    make -j 4

    cd …/tools
    extras/install_pocolm.sh
    extras/install_irstlm.sh

  2. go to the directory kaldi/egs/tedlium/s5_r2/
    edit the file “cmd.sh” in order to set the values of “train_cmd” and “decode_cmd” like this :

export train_cmd=“run.pl”
export decode_cmd=“run.pl”

make a “chmod +x” on the scripts path.sh, cmd.sh, run.sh

run the script “run.sh” by typing “./run.sh”

this script will create a GMM, then a TDNN acoustic model, the bug will occur at the step “tdnn creation” ( stage 17 in the script run.sh ) after several hours of GPU training if the nvidia driver version is 384.59

be careful, the GMM and TDNN creations will take a lot of time, several days if we use one PC instead of a cluster :

136 hours for a GMM creation with one PC
125 hours for a TDNN creation with one PC

the entire script “run.sh” will take approximately 10 days to run if we use one PC,

you can find with my message with the link “attachment” the log generated by nvidia-bug-report.sh
nvidia-bug-report.log.gz (345 KB)

Hello,

I tried the new nvidia driver 384.66 and it solves the bug :

no problems with 384.66 driver