nvprof error code 139 but memcheck OK

I’m new to cuda programming. I’m trying to use nvprof to test my program in this tutorialhttps://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/.

However, when I ran

nvprof ./add_cuda

I got

==12188== NVPROF is profiling process 12188, command: ./add_cuda
Max error: 0
==12188== Profiling application: ./add_cuda
==12188== Profiling result:
No kernels were profiled.

==12188== API calls:
No API activities were profiled.
==12188== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

I think it is not the nvprof’s fault because I also tested a sample program in nvidia’s examples and things works fine.

So then I try to use memcheck to check whether there’s any problems in my program but memcheck didn’t give any useful information.

cuda-memcheck ./add_cuda
========= CUDA-MEMCHECK
Max error: 0
========= ERROR SUMMARY: 0 errors
zns@zns-gpu:~/Public/test$ cuda-memcheck --leak-check full --error-exitcode ./add_cuda
========= CUDA-MEMCHECK
========= Nothing to check
========= No CUDA-MEMCHECK results found

So, what can I do to make nvprof work?

Here’s my code

#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>

// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  cudaProfilerStop();
  return 0;
}

Hi, penguinnn

I build the sample you gave and do profiling. All works well.
Maybe the build process have some problem.
How do you build the sample ? Also which toolkit/driver/gpu are you using?

Sample output is “Max error: 0”

root@DEVTOOLS-QA76:~/cuda-workspace/test/Debug# nvprof ./test
==16896== NVPROF is profiling process 16896, command: ./test
Max error: 0
==16896== Profiling application: ./test
==16896== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 936.96ms 1 936.96ms 936.96ms 936.96ms add(int, float*, float*)
API calls: 84.96% 936.97ms 1 936.97ms 936.97ms 936.97ms cudaDeviceSynchronize
14.84% 163.65ms 2 81.823ms 134.20us 163.51ms cudaMallocManaged
0.12% 1.2799ms 188 6.8070us 110ns 344.52us cuDeviceGetAttribute
0.04% 401.64us 2 200.82us 198.05us 203.59us cuDeviceTotalMem
0.03% 349.80us 2 174.90us 171.89us 177.90us cudaFree
0.01% 85.688us 1 85.688us 85.688us 85.688us cudaLaunch
0.01% 81.508us 2 40.754us 38.324us 43.184us cuDeviceGetName
0.00% 4.4650us 3 1.4880us 165ns 3.7750us cudaSetupArgument
0.00% 4.2570us 1 4.2570us 4.2570us 4.2570us cudaConfigureCall
0.00% 2.1150us 3 705ns 147ns 1.7270us cuDeviceGetCount
0.00% 825ns 4 206ns 127ns 410ns cuDeviceGet

==16896== Unified Memory profiling result:
Device “GeForce GTX 1070 (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 1.329280ms Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 644.0640us Device To Host
24 - - - - 2.522912ms Gpu page fault groups
Total CPU Page faults: 36

I am having this exact issue, too. Did you find the source of the problem?

Hi, I’m new, just adding a datapoint - still investigating. Some code from online that I slightly modified. When N <=16 nvprof works, when N >= 17 nvprof fails with error 139.

nvprof --unified-memory-profiling per-process-device ./unifiedMemTest

#include
#include <math.h>
#include <cuda_profiler_api.h>

// CUDA kernel to add elements of two arrays
global
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1<<10;
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, Nsizeof(float));
cudaMallocManaged(&y, N
sizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Launch kernel on N elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);
cudaProfilerStop();
cudaDeviceReset();

return 0;
}

It’s a bug with unified memory profiling, the flag

--unified-memory-profiling off

resolves all problems for me.

2 Likes

I’m also seeing the same issue. If we have an integer array larger than 2^15 elements thats allocated via managed memory, nvprof/nvvp craps out.

My setup : CUDA 8.0, Titan Xp (Pascal) with Driver version 381.09.

So much for memory over-subscription if such a tiny array causes a segfault like this! :-(

Can someone from NVIDIA confirm if this is being looked at, or is already fixed in 9.0? I’m kinda stuck here :(

I have an application for which I’d like to analyze the number of GPU page faults, whose working set is larger than this “2^15” limit.

It’s fixed for me in 9.0.

updating to 9.0 fixed the issue for me as well.

(it seemed that nvprof ran fine if there was only one GPU page fault in the code. But choked when there was more than one)

This also worked for me on CUDA8.0. Thanks!

EDIT: Having now read Unified Memory for CUDA Beginners | NVIDIA Technical Blog I think it’s clearer when to turn on/off this flag

1 Like

I got same error in v9.1, any idea to fix that ?

updating to 9.0 fixed the issue for me as well.

it may also help to run nvprof as root (on linux) if you are getting 139 error

--unified-memory-profiling off

Adding this flag worked for me too using 10.1.

FWIW:

I had this exact same problem with a very simple cuBLAS program. Quite strange as such because it was working fine and in between I increased the matrix dim from 1024 to 2048 and then the problem started. And it didnt go even after reverting back to 1024!

I tried the --unified-memory-profiling off and also --concurrent-kernels off. Nothing helped.
The problem can sometimes be with the unified memory system.

With the following code it works fine:

checkCUBLAS(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, M_SIZE, M_SIZE, M_SIZE, &alpha, A, M_SIZE, B, M_SIZE, &beta, C, M_SIZE));

// Added these two lines after kernel execution
checkCuda(cudaStreamAttachMemAsync(NULL, C, 0, cudaMemAttachHost));
checkCuda(cudaStreamSynchronize(NULL));

// Now we can access C without seg fault!