Mapped memory access causes BSOD with drivers newer than 378.92

On Windows, the following code causes a blue screen of death (DPC_WATCHDOG_VIOLATION) using drivers newer than 378.92.

The problem might be related to the motherboard or HW in general, as it worked on some configurations with other CPU and motherboard (unfortunately I cannot reproduce that test anymore). The issue was reproduced on multiple devices (both PCs and GPUs) with the specs below.

I am trying to find the root cause of this issue. Please, if you have similar hardware, help me and see if you can reproduce this issue. Any possible workarounds are welcome. Thank you!

Software and hardware details:

  • Intel Core I7 4930K, ASUS P9X79-E WS, 32GB RAM
  • BIOS: Version 1704
  • Windows 10, 1703 and 1709
  • CUDA 8 and CUDA 9.1
  • GPU: GTX 750Ti and GTX 980Ti (980Ti was selected using cudaSetDevice for the tests)
  • Tested with multiple drivers: 378.92 does not produce the issue, drivers after that do (e.g. 391.35)

Error message:

Technical Information:

*** STOP: 0x00000133 (0x0000000000000001, 0x0000000000001e00, 0xfffff8025e806370, 
0x0000000000000000)

*** ntoskrnl.exe - Address 0xfffff8025e57f6e0 base at 0xfffff8025e40a000 DateStamp 
0x5a4a1659

Stack trace:

nt!KeBugCheckEx
nt!KeAccumulateTicks+0xfde11
nt!KeClockInterruptNotify+0xc6
hal!HalpTimerClockIpiRoutine+0x15
nt!KiCallInterruptServiceRoutine+0xa5
nt!KiInterruptSubDispatchNoLockNoEtw+0xea
nt!KiInterruptDispatchNoLockNoEtw+0x37
nvlddmkm+0x1c8b81
nvlddmkm+0x1ea347
nvlddmkm+0x1ea23d
nvlddmkm+0x1e5f7e
nvlddmkm+0x1fa415
nvlddmkm+0x1fa86b
nvlddmkm+0x1faf85
nvlddmkm+0x1c8fc2
nt!KiExecuteAllDpcs+0x1d2
nt!KiRetireDpcList+0xdf
nt!KxRetireDpcList+0x5
nt!KiDispatchInterruptContinue
nt!KiDpcInterruptBypass+0x25
nt!KiChainedDispatch+0xb1
nt!KiSystemServiceUser+0xe1
0x00007ff9`35754844

Code:

#include <cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <cstdio>
#include <cstdlib>
#include <malloc.h>

static size_t DEVICE_ELEMENTS = 256*1024*1024;
static size_t LAYER_ELEMENTS = 512*512;
static size_t LAYERS = 16*1024;

static size_t MEMORY_ALIGNMENT = 4096;

#define CUDA_SAFE_CALL( call ) \
{ \
    const cudaError_t error = call; \
    if( error != cudaSuccess ) \
    { \
        printf( "Error: %s: %d, ", __FILE__, __LINE__ ); \
        printf( "code: %d, reason: %s\n", error, cudaGetErrorString( error ) ); \
        exit( 1 ); \
    } \
} 

__global__ void doStuff( float* data, size_t n )
{
    auto idx = blockIdx.x*blockDim.x+threadIdx.x;
    if ( idx < n )
        data[idx] += idx;
}

int main(int argc, char **argv)
{
    int iterations = 1;
    int gpuId = 0;

    if ( argc >= 2 )
        iterations = atoi(argv[1]);

    if ( argc >= 3 )
        gpuId = atoi(argv[2]);

    printf("CUDA 980Ti driver crash test.\nThe test should result in a BSOD with drivers newer than 378.92\n");

    int devices;
    CUDA_SAFE_CALL( cudaGetDeviceCount(&devices) );

    printf("Using device #%d (of %d)\n", gpuId, devices);
    CUDA_SAFE_CALL( cudaSetDevice( gpuId ) );

    cudaDeviceProp prop;
    CUDA_SAFE_CALL( cudaGetDeviceProperties( &prop, gpuId ) );

    printf("Device name: %s, total mem: %f GiB\n", prop.name, prop.totalGlobalMem/1024.0f/1024.0f/1024.0f);

    float** hostData = new float*[LAYERS];

    printf("Allocating %f GiB on the host (%lu layers)...\n", LAYERS*LAYER_ELEMENTS*sizeof(float)/1024.0f/1024.0f/1024.0f, LAYERS );
    for (size_t i = 0; i < LAYERS; i++)
        hostData[i] = reinterpret_cast<float*>( malloc( LAYER_ELEMENTS*sizeof(float) ) );
        //hostData[i] = reinterpret_cast<float*>( _aligned_malloc( LAYER_ELEMENTS*sizeof(float), MEMORY_ALIGNMENT ) );

    printf("Initializing data...\n" );
    for (size_t i = 0; i < LAYERS; i++)
        for (size_t j = 0; j < LAYER_ELEMENTS; j++)
            hostData[i][j] = 42.0f;

    for ( int k = 0; k < iterations; k++ )
    {
        printf("Running mapped memory access tests %d/%d...\n", (k+1), iterations );

        for (size_t i = 0; i < LAYERS; i++)
        {
            float* devicePtr;

            CUDA_SAFE_CALL( cudaHostRegister( hostData[i], LAYER_ELEMENTS*sizeof(float), cudaHostRegisterMapped | cudaHostRegisterPortable ) );

            CUDA_SAFE_CALL( cudaHostGetDevicePointer( (void **)&devicePtr, (void *)hostData[i], 0 ) );

            doStuff<<< ceil(LAYER_ELEMENTS/1024.0) , 1024 >>>( devicePtr, LAYER_ELEMENTS );

            CUDA_SAFE_CALL( cudaDeviceSynchronize() );

            CUDA_SAFE_CALL( cudaHostUnregister( hostData[i] ) );
        }

    }

    printf("Finished tests.\n" );

    for (size_t i = 0; i < LAYERS; i++)
        //_aligned_free( hostData[i] );
        free( hostData[i] );

    delete[]( hostData );

    return 0;
}

code.zip (2.95 KB)
nvidia_DPC_Watchdog_Violation.txt (12.2 KB)





tasklist.txt (10.2 KB)
PCI_slot_test2.zip (3.73 KB)

On Ubuntu 16.04 this code works fine with 980 Ti and newer than 378.92 drivers.

(Though LAYERS had to be halved to fit into the memory.)

Added attachment to original post: nvidia_DPC_Watchdog_Violation.txt.

This contains the output of the !analyze –v command from Debugging Tools for Windows.

best thing to do is to file a bug at developer.nvidia.com

Done!

I’ve added some attachments about latency values.

I am also waiting for other user’s comments whether the issue is present or not on their PCs. It seems that a lots of different factors are needed for the bug to be observed and I am curious about the cause.

BIOS tests

Disabling CSM (Compatibility Support Module) in BIOS (Advanced/Boot) seems to solve the issue.

  • Legacy mode, CSM Enabled: BSOD happens
  • UEFI mode, CSM Disabled: BSOD does not happen

Notes:

  • BIOS mode can be checked with msinfo.exe.
  • An UEFI system with MBR can be set to Legacy boot by booting from another boot manager (without reinstalling Windows)

Additional tests:

  • GTX 1080 also produced the issue.
  • Windows 7 did not produced the issue.
  • Keep pressing the Win button at regular interval (or doing other things like moving windows, etc) makes the BSOD happen in an earlier iteration (~5 vs ~30).

I’ve attached some additional files: a task list and two screenshots about BIOS settings.

I could reproduce the BSOD on another computer with a clean Windows installation with only the NVIDIA driver installed, and reproduce it when booting with UEFI. All the cards I have tried caused a BSOD: 750 Ti, 980 Ti and 1080.
The key to the repro is which PCIe slots are used. So using the same motherboard is also important.

  • Intel Core i7 4930K, ASUS P9X79-E WS (BIOS version 1704), 32GB RAM
  • Two GPUs installed, one in slot PCIEX16_1, and one in slot PCIEX16_3. The executable test_bsod_980ti_CUDA8.0.exe has to run on the card in slot PCIEX16_3.
  • Windows 10 version 1709 or 1803
  • Drivers newer than 378.92 (e.g. 382.33, 391.35, 397.64)
  • Running the executable test_bsod_980ti_CUDA8.0.exe alone is not enough, you have to press repeatedly the Windows button or run LatencyMon or Furmark, etc. If only the executable is run, then it can go to 100 iterations without a BSOD. If something else is running, or Windows button is pressed repeatedly, then it usually crashes between iterations 3 to 15.

I keep getting emails about updates on this bug (I’ve submitted a bugreport regarding this earlier).
However I cannot access my bugs on the developer zone webpage. When I click on a bug in the list, I get a ‘Page Not Found’ error.

I’m not sure what the problem is on the bug reporting portal.

Your comments as recently as 6/12/2018 show up in the bug.

The team has been working on this issue as recently as last Friday 7/6/2018. Currently they have gotten a X79 motherboard and are setting it up for further testing. The issue has not been reproduced by our team yet.

The bug number in our system is 2111900

Thanks for the update. I am still having troubles with the webpage, I was able to access some of my bugs today, but by the time I composed a reply, the bug page was unresponsive again. Unfortunately I was unable to sent a reply at the bugreport page, so I must do it here:

[i]The exact type of the board is P9X79-E WS (the issue didn’t happen with the older P9X79 WS).

We tried slots 1+5, and also 2+4+6 (GTX 750 and 2x1080) and the system was stable.
The BSOD only happened when cards were inserted into slots 1 and 3.

In some of our configurations we use 3 GPUs and have to use at least 3 slots.

See https://devtalk.nvidia.com/cmd/default/download-comment-attachment/76196/
[/i]

Edit: slot numbers were incorrect (2+4+8 → 2+4+6)

Hello!

We did the mentioned tests:

Two monitors are connected via two DVI-D cables from 750 Ti, except when testing two 1080s, then one monitor is connected via DVI-D.

2 GPUs
Slot 1: 750 Ti, Slot 3: 1080 → very easy to repro BSOD when running test_bsod_980ti.exe on Slot 3 (1080)*
Slot 1: 1080, Slot 3: 750 Ti → very easy to repro BSOD when running test_bsod_980ti.exe on Slot 1 (1080)*
Slot 1: 1080, Slot 3: 1080 → very hard to repro BSOD when running test_bsod_980ti.exe and Furmark on Slot 3 (1080)

Slot 1: 750 Ti, Slot 5: 1080 → nothing
Slot 1: 1080, Slot 5: 750 Ti → nothing

3 GPUs
Slot 1: 750 Ti, Slot 3: 1080, Slot 5: 1080 → nothing
Slot 1: 1080, Slot 3: 750 Ti, Slot 5: 1080 → easy to repro BSOD when running test_bsod_980ti.exe on Slot 1 (1080) or Slot 3 (750 Ti).

  • One 1080 card is older, having serial number G6YV…, the other is newer with G8YV… If a 2 GPU configuration with 750 Ti and 1080 causes BSOD, then it causes the BSOD also when swapping the older with the newer 1080. So a faulty card can be ruled out.

Which slot is running the .exe is confirmed by GPU Shark, looking at GPU utilization:
Slot 1 is PCI Bus ID 3
Slot 3 is PCI Bus ID 4
Slot 5 is PCI Bus ID 8

We also saw this BSOD about once a week just using Windows without any CUDA app running (configuration was Slot 1: 750 Ti, Slot 3: 1080). Once while booting into Windows and opening Outlook, and once while editing in Visual Studio.

I uploaded the NVIDIA Control panel log files (I only removed the computer name form them):
https://devtalk.nvidia.com/cmd/default/download-comment-attachment/76213/

This appears to still be an issue; I’m on Windows 10 (1809, 17763.437) and this happens multiple times a day. I currently have the GeForce Game Ready Driver (419.67) installed. My card is a GTX 1060 (3GB). Machine is a Intel Core i7-7700 CPU @ 3.60GHz. Here’s my most recent mini dump:

Microsoft (R) Windows Debugger Version 10.0.18317.1001 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump

Microsoft (R) Windows Debugger Version 10.0.18317.1001 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\Windows\Minidump\040919-8109-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*
Executable search path is:
Windows 10 Kernel Version 17763 MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 17763.1.amd64fre.rs5_release.180914-1434
Machine Name:
Kernel base = 0xfffff80755015000 PsLoadedModuleList = 0xfffff80755430790
Debug session time: Tue Apr 9 12:31:50.653 2019 (UTC - 5:00)
System Uptime: 0 days 18:52:38.375
Loading Kernel Symbols




Loading User Symbols
Loading unloaded module list

For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff807551c86a0 48894c2408 mov qword ptr [rsp+8],rcx ss:0018:fffff807570d0310=0000000000000133
0: kd> !analyze -v


  •                                                                         *
    
  •                    Bugcheck Analysis                                    *
    
  •                                                                         *
    

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
DISPATCH_LEVEL or above. The offending component can usually be
identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff80755557380, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding the cumulative timeout
Arg4: 0000000000000000

Debugging Details:

KEY_VALUES_STRING: 1

PROCESSES_ANALYSIS: 1

SERVICE_ANALYSIS: 1

STACKHASH_ANALYSIS: 1

TIMELINE_ANALYSIS: 1

DUMP_CLASS: 1

DUMP_QUALIFIER: 400

BUILD_VERSION_STRING: 17763.1.amd64fre.rs5_release.180914-1434

SYSTEM_MANUFACTURER: System manufacturer

SYSTEM_PRODUCT_NAME: System Product Name

SYSTEM_SKU: SKU

SYSTEM_VERSION: System Version

BIOS_VENDOR: American Megatrends Inc.

BIOS_VERSION: 3402

BIOS_DATE: 07/05/2017

BASEBOARD_MANUFACTURER: Asus

BASEBOARD_PRODUCT: H110-PLUS

BASEBOARD_VERSION: Rev X.0x

DUMP_TYPE: 2

BUGCHECK_P1: 1

BUGCHECK_P2: 1e00

BUGCHECK_P3: fffff80755557380

BUGCHECK_P4: 0

DPC_TIMEOUT_TYPE: DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

CPU_COUNT: 8

CPU_MHZ: e10

CPU_VENDOR: GenuineIntel

CPU_FAMILY: 6

CPU_MODEL: 9e

CPU_STEPPING: 9

CPU_MICROCODE: 6,9e,9,0 (F,M,S,R) SIG: 84’00000000 (cache) 84’00000000 (init)

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXPNP: 1 (!blackboxpnp)

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT

BUGCHECK_STR: 0x133

PROCESS_NAME: System

CURRENT_IRQL: d

ANALYSIS_SESSION_HOST: DESKTOP-L93EKRH

ANALYSIS_SESSION_TIME: 04-09-2019 20:10:07.0748

ANALYSIS_VERSION: 10.0.18317.1001 amd64fre

LAST_CONTROL_TRANSFER: from fffff80755241a83 to fffff807551c86a0

STACK_TEXT:
fffff807570d0308 fffff80755241a83 : 0000000000000133 0000000000000001 0000000000001e00 fffff80755557380 : nt!KeBugCheckEx
fffff807570d0310 fffff807550f98bf : 0000be3f046c70f2 fffff807532c2180 0000000000000286 0000000000425d98 : nt!KeAccumulateTicks+0x144b53
fffff807570d0370 fffff80755a8647c : 0000000000000000 fffff80755aeb750 fffff807570d07e0 fffff80755aeb800 : nt!KeClockInterruptNotify+0xcf
fffff807570d0690 fffff80755148195 : fffff80755aeb750 fffff80755148195 fffff80755aeb750 fffff807532c2180 : hal!HalpTimerClockIpiRoutine+0x1c
fffff807570d06c0 fffff807551ca09a : fffff807570d07e0 fffff80755aeb750 0000000000000000 fffff807551c4ab0 : nt!KiCallInterruptServiceRoutine+0xa5
fffff807570d0710 fffff807551ca5e7 : 0000000000000007 fffff8008ef4a734 fffff807570d0b30 fffff807551c4ab0 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
fffff807570d0760 fffff8008ef2a3a5 : fffff8008ef4d9e0 ffffe488a6c6a000 0000000000000000 0000000000000000 : nt!KiInterruptDispatchNoLockNoEtw+0x37
fffff807570d08f8 fffff8008ef4d9e0 : ffffe488a6c6a000 0000000000000000 0000000000000000 fffff8008efc6ea0 : nvlddmkm+0x1ea3a5
fffff807570d0900 ffffe488a6c6a000 : 0000000000000000 0000000000000000 fffff8008efc6ea0 ffffe48800000020 : nvlddmkm+0x20d9e0
fffff807570d0908 0000000000000000 : 0000000000000000 fffff8008efc6ea0 ffffe48800000020 fffff807570d0940 : 0xffffe488`a6c6a000

THREAD_SHA1_HASH_MOD_FUNC: 588f764390cf3a7f40526e2ad7327176945e52e0

THREAD_SHA1_HASH_MOD_FUNC_OFFSET: 5d43bf8a09a8a4779f38975c02632def96effb3a

THREAD_SHA1_HASH_MOD: b65638c546229c8b9c3693c2a82ed471ad9bee32

FOLLOWUP_IP:
nvlddmkm+1ea3a5
fffff800`8ef2a3a5 c3 ret

FAULT_INSTR_CODE: 8accccc3

SYMBOL_STACK_INDEX: 7

SYMBOL_NAME: nvlddmkm+1ea3a5

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nvlddmkm

IMAGE_NAME: nvlddmkm.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 5c8de56a

STACK_COMMAND: .thread ; .cxr ; kb

BUCKET_ID_FUNC_OFFSET: 1ea3a5

FAILURE_BUCKET_ID: 0x133_ISR_nvlddmkm!unknown_function

BUCKET_ID: 0x133_ISR_nvlddmkm!unknown_function

PRIMARY_PROBLEM_CLASS: 0x133_ISR_nvlddmkm!unknown_function

TARGET_TIME: 2019-04-09T17:31:50.000Z

OSBUILD: 17763

OSSERVICEPACK: 404

SERVICEPACK_NUMBER: 0

OS_REVISION: 0

SUITE_MASK: 272

PRODUCT_TYPE: 1

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

OSEDITION: Windows 10 WinNt TerminalServer SingleUserTS

OS_LOCALE:

USER_LCID: 0

OSBUILD_TIMESTAMP: 2022-05-13 04:29:43

BUILDDATESTAMP_STR: 180914-1434

BUILDLAB_STR: rs5_release

BUILDOSVER_STR: 10.0.17763.1.amd64fre.rs5_release.180914-1434

ANALYSIS_SESSION_ELAPSED_TIME: 16d36

ANALYSIS_SOURCE: KM

FAILURE_ID_HASH_STRING: km:0x133_isr_nvlddmkm!unknown_function

FAILURE_ID_HASH: {f97493a5-ea2b-23ca-a808-8602773c2a86}

Followup: MachineOwner

40919-8109-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: srv*
Executable search path is: 
Windows 10 Kernel Version 17763 MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 17763.1.amd64fre.rs5_release.180914-1434
Machine Name:
Kernel base = 0xfffff807`55015000 PsLoadedModuleList = 0xfffff807`55430790
Debug session time: Tue Apr  9 12:31:50.653 2019 (UTC - 5:00)
System Uptime: 0 days 18:52:38.375
Loading Kernel Symbols
...............................................................
................................................................
................................................................
..........
Loading User Symbols
Loading unloaded module list
.............
For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff807`551c86a0 48894c2408      mov     qword ptr [rsp+8],rcx ss:0018:fffff807`570d0310=0000000000000133
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
	DISPATCH_LEVEL or above. The offending component can usually be
	identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff80755557380, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
	additional information regarding the cumulative timeout
Arg4: 0000000000000000

Debugging Details:
------------------


KEY_VALUES_STRING: 1


PROCESSES_ANALYSIS: 1

SERVICE_ANALYSIS: 1

STACKHASH_ANALYSIS: 1

TIMELINE_ANALYSIS: 1


DUMP_CLASS: 1

DUMP_QUALIFIER: 400

BUILD_VERSION_STRING:  17763.1.amd64fre.rs5_release.180914-1434

SYSTEM_MANUFACTURER:  System manufacturer

SYSTEM_PRODUCT_NAME:  System Product Name

SYSTEM_SKU:  SKU

SYSTEM_VERSION:  System Version

BIOS_VENDOR:  American Megatrends Inc.

BIOS_VERSION:  3402

BIOS_DATE:  07/05/2017

BASEBOARD_MANUFACTURER:  Asus

BASEBOARD_PRODUCT:  H110-PLUS

BASEBOARD_VERSION:  Rev X.0x

DUMP_TYPE:  2

BUGCHECK_P1: 1

BUGCHECK_P2: 1e00

BUGCHECK_P3: fffff80755557380

BUGCHECK_P4: 0

DPC_TIMEOUT_TYPE:  DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

CPU_COUNT: 8

CPU_MHZ: e10

CPU_VENDOR:  GenuineIntel

CPU_FAMILY: 6

CPU_MODEL: 9e

CPU_STEPPING: 9

CPU_MICROCODE: 6,9e,9,0 (F,M,S,R)  SIG: 84'00000000 (cache) 84'00000000 (init)

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXPNP: 1 (!blackboxpnp)


CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

BUGCHECK_STR:  0x133

PROCESS_NAME:  System

CURRENT_IRQL:  d

ANALYSIS_SESSION_HOST:  DESKTOP-L93EKRH

ANALYSIS_SESSION_TIME:  04-09-2019 20:10:07.0748

ANALYSIS_VERSION: 10.0.18317.1001 amd64fre

LAST_CONTROL_TRANSFER:  from fffff80755241a83 to fffff807551c86a0

STACK_TEXT:  
fffff807`570d0308 fffff807`55241a83 : 00000000`00000133 00000000`00000001 00000000`00001e00 fffff807`55557380 : nt!KeBugCheckEx
fffff807`570d0310 fffff807`550f98bf : 0000be3f`046c70f2 fffff807`532c2180 00000000`00000286 00000000`00425d98 : nt!KeAccumulateTicks+0x144b53
fffff807`570d0370 fffff807`55a8647c : 00000000`00000000 fffff807`55aeb750 fffff807`570d07e0 fffff807`55aeb800 : nt!KeClockInterruptNotify+0xcf
fffff807`570d0690 fffff807`55148195 : fffff807`55aeb750 fffff807`55148195 fffff807`55aeb750 fffff807`532c2180 : hal!HalpTimerClockIpiRoutine+0x1c
fffff807`570d06c0 fffff807`551ca09a : fffff807`570d07e0 fffff807`55aeb750 00000000`00000000 fffff807`551c4ab0 : nt!KiCallInterruptServiceRoutine+0xa5
fffff807`570d0710 fffff807`551ca5e7 : 00000000`00000007 fffff800`8ef4a734 fffff807`570d0b30 fffff807`551c4ab0 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
fffff807`570d0760 fffff800`8ef2a3a5 : fffff800`8ef4d9e0 ffffe488`a6c6a000 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatchNoLockNoEtw+0x37
fffff807`570d08f8 fffff800`8ef4d9e0 : ffffe488`a6c6a000 00000000`00000000 00000000`00000000 fffff800`8efc6ea0 : nvlddmkm+0x1ea3a5
fffff807`570d0900 ffffe488`a6c6a000 : 00000000`00000000 00000000`00000000 fffff800`8efc6ea0 ffffe488`00000020 : nvlddmkm+0x20d9e0
fffff807`570d0908 00000000`00000000 : 00000000`00000000 fffff800`8efc6ea0 ffffe488`00000020 fffff807`570d0940 : 0xffffe488`a6c6a000


THREAD_SHA1_HASH_MOD_FUNC:  588f764390cf3a7f40526e2ad7327176945e52e0

THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  5d43bf8a09a8a4779f38975c02632def96effb3a

THREAD_SHA1_HASH_MOD:  b65638c546229c8b9c3693c2a82ed471ad9bee32

FOLLOWUP_IP: 
nvlddmkm+1ea3a5
fffff800`8ef2a3a5 c3              ret

FAULT_INSTR_CODE:  8accccc3

SYMBOL_STACK_INDEX:  7

SYMBOL_NAME:  nvlddmkm+1ea3a5

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: nvlddmkm

IMAGE_NAME:  nvlddmkm.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  5c8de56a

STACK_COMMAND:  .thread ; .cxr ; kb

BUCKET_ID_FUNC_OFFSET:  1ea3a5

FAILURE_BUCKET_ID:  0x133_ISR_nvlddmkm!unknown_function

BUCKET_ID:  0x133_ISR_nvlddmkm!unknown_function

PRIMARY_PROBLEM_CLASS:  0x133_ISR_nvlddmkm!unknown_function

TARGET_TIME:  2019-04-09T17:31:50.000Z

OSBUILD:  17763

OSSERVICEPACK:  404

SERVICEPACK_NUMBER: 0

OS_REVISION: 0

SUITE_MASK:  272

PRODUCT_TYPE:  1

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

OSEDITION:  Windows 10 WinNt TerminalServer SingleUserTS

OS_LOCALE:  

USER_LCID:  0

OSBUILD_TIMESTAMP:  2022-05-13 04:29:43

BUILDDATESTAMP_STR:  180914-1434

BUILDLAB_STR:  rs5_release

BUILDOSVER_STR:  10.0.17763.1.amd64fre.rs5_release.180914-1434

ANALYSIS_SESSION_ELAPSED_TIME:  16d36

ANALYSIS_SOURCE:  KM

FAILURE_ID_HASH_STRING:  km:0x133_isr_nvlddmkm!unknown_function

FAILURE_ID_HASH:  {f97493a5-ea2b-23ca-a808-8602773c2a86}

Followup:     MachineOwner
---------

hi, I meet almost the same dump error, have you solve the problem?

Unsure if this is related, but it’s worth a shot. I experienced similar issues on an X79 platform with newer NVIDIA drivers a while back (not related to running a CUDA app specifically though), and the solution was to switch from IRQ to MSI-based interrupt handling for the newer drivers. More details here on the difference between both: Windows: Line-Based vs. Message Signaled-Based Interrupts. MSI tool. | guru3D Forums

At least on my X79 platform, NVIDIA drivers default to IRQ handling, vs MSI, but it’s possible to change it. Note that any new driver install will override the changes and you’ll have to go back and change them again.

I’ve used the MSI tool linked above to do the change and it eliminated the similar BSOD messages I was getting. Worth a try.