Unified memory and concurrent C++ objects

Hello everyone,

Here’s a very simple piece of code having concurrent C++ objects executing a kernel modifying a private variable, then synchronizing, then printing the result. In each object, the private variable is allocated using the cudaMallocManaged function. I’m compiling this code with:

/usr/local/cuda-8.0/bin/nvcc -std=c++11 -gencode arch=compute_62,code=sm_62 example.cu

and running it on a Jetson TX2.

I could not make this run on my Jetson TX1 for a good reason:

“Simultaneous access to managed memory from the CPU and GPUs of compute capability lower than 6.0 is not possible. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault.”

But according to this same quote, this code should work on my TX2, but I also get segmentation faults.

What am I missing?

Thanks in advance!

#include <thread>
#include <vector>

#define SIZE 100
#define N_THREADS 10

__global__ void kernel(int *x) {
  int i = threadIdx.x;
  x[i] = x[i] + i;
}

class Foo 
{
 public:
  Foo(int id) 
  {
    cudaMallocManaged(&_bar, SIZE * sizeof(int));
    for (int i=0; i < SIZE; i++)
      _bar[i] = id; 
  }

  void work()
  {
    kernel<<<1, SIZE>>>(_bar);
    cudaDeviceSynchronize();
    printf("%d %d %d\n", _bar[0], _bar[1], _bar[2]);
  }

 private:
  int *_bar;
}; 

int main()
{
  int i;
  std::thread t[N_THREADS];
  std::vector<Foo*> f = {}; 
  
  for (i=0; i < N_THREADS; i++)
  {
    f.push_back(new Foo(i));  
  }

  for (i=0; i < N_THREADS; i++)
  {
    t[i] = std::thread([&](int n)
      {   
      f[n]->work();
      }, i); 
  }

  for (i=0; i < N_THREADS; i++)
  {
    t[i].join();
  }

  return 0;
}

Hi,

Guess that this error is not caused by concurrent access.
We can reproduce the segmentation fault error even with no GPU r/w code:

__global__ void kernel() {
 // int i = threadIdx.x;
//  x[i] = x[i] + i;
}

class Foo
{
  ...
  void work()
  {
    kernel<<<1, SIZE>>>();
    cudaDeviceSynchronize();
    printf("%d %d %d\n", _bar[0], _bar[1], _bar[2]);
  }
  ...
};

We are discussing this internally and will update information to you later.
Thanks.

Hi,

Sorry for the late reply.

After checking, we also find the behavior is strange when reading the unified memory.
And the failure rate will be much lower if we execute it with cuda-memcheck.

We have passed this issue to our internal CUDA team.
Will update information to you once we get further information.

Thanks.

Great!

I hope you guys can work this out without hurting performance, because unified memory is way faster than zero-copy memory (at least in our application).

Hi,

Does the question is the same like my topic [url]https://devtalk.nvidia.com/default/topic/1028582/why-can-t-i-use-unified-memory-in-two-cpu-thread-/[/url]

Thanks!

Hi, ClancyLian

Could you share your source to topic1028582 for us further checking?
https://devtalk.nvidia.com/default/topic/1028582/why-can-t-i-use-unified-memory-in-two-cpu-thread-/

Thanks.

Yes ClanyLian, I haven’t seen your code, but it sounds very much like it.

Hi, AastaLLL and mescarra

It just the same question ! When I set

N_THREADS 1

It would not happen segmentation faults.

And because your calculation in kernel is not complex, so it may not happen error with 2 threads or 3 threads or…

Thanks.

Hi,

Thanks for the confirm.

This issue is already passed to our internal team for checking.
We will update information here if any finding.

Update:

The source shared in comment #1 is incorrect.

It doesn’t attach managed memory to any stream so all the managed memory in the app is global. All threads are doing launch, synchronize and read. If one thread launches kernel while other is reading on CPU then it is bound to fail as the kernel launch would make all global managed memory inaccessible on CPU.

All these usages are well documented in the unified memory programming guide:
[url]Programming Guide :: CUDA Toolkit Documentation

Most importantly, if an allocation is not associated with a specific stream, it is visible to all running kernels regardless of their stream. This is the default visibility for a cudaMallocManaged() allocation or a managed variable; hence, the simple-case rule that the CPU may not touch the data while any kernel is running.

For using unified memory in multiple CPU threads, you should use multiple streams and attach the memory to stream using cudaStreamAttachMemAsync. Using per thread streams feature would also simplify the programming.

Thanks