an illegal memory access was encountered ?

so I have something like, when copy from GPU-hostmem using cudaMemcpy I get an illegal memory access was encountered.

float test,host_test;
cudaMalloc((void**)&test,12sizeof(float)) ;
cudaMallochost((void**)&host_test,12
sizeof(float)) ;

then something like this

void tt(float *test, float *host_test){

//copy1
cudaMemcpy(host_test,test,8*sizeof(float),cudaMemcpyDeviceToHost);

tt2(test,host_test);

}

tt2( float *test,float *host_test){

//copy2
cudaMemcpy(host_test,test,8*sizeof(float),cudaMemcpyDeviceToHost);

}

copy1 run fine but copy2 has an illegal memory access was encountered

is this due to double pointer or something else?

It looks like you might wanna switch up your arguments in the second call to the copy.

I’d also recommend just dropping C-style CUDA and just using Thrust. I only say this because this is literally trivial in Thrust.

by “switch up” you mean put the 2nd copy into top rather than a function of a function with double pointer? ill take look at thrust.

is thrust performance generally same as cuda C?

also for syntax

thrust::copy(mystruct->test(1),mystruct->test(6),mystruct->host_test(2));// correct?

does thrust automatically free the ram after its done?

Oh, maybe I read your code wrong…

As for Thrust, Thrust might be faster than your code because host_vector likely uses pinned host memory instead of the normal malloc’d memory.

I’m not sure about your copy either… I’d need to see what the structure is.

But basically with Thrust you can do:

thrust::host_vector<int> host_vals{1000, -1};
thrust::device_vector<int> device_vals{host_vals}; // copies host values to device

This is good because it opens you up to all the Thrust algorithms as well.

There is thrust::copy but it’s not “better” than native vector copy operations. Instead, copy is a bit cooler when you use it over fancy iterators like zip_iterator.

ok thanks, since thrust is C++, is that mean I can use C++ class/object in CUDA? I thought cuda always been C.

if I already declare struct with pointer and allocate the mem using the cuda C method, anyway I can still using the thrust to copy device pointer content back to host? for example

typedef struct
{
float *test;
float *host_test;
}mystruct

void main(){

mystruct *test_strct;

cudaMalloc(&test_strct->test,16sizeof(float)) ;
cudaMallochost(&test_strct->host_test,12
sizeof(float)) ;

//declare some content in device using cudamemset

if I want to copy device content (test_strct->test) to host, how would I do that(try thrust::copy but getting all 0s) or I have to declare device_vector 1st?

}

Very early versions of CUDA were indeed based on C. Ever since GPUs have provided the necessary hardware features (compute capability >= 2.0) has CUDA beein in the C++ family. The switch happened around CUDA 3.0 if I recall correctly, around 2008. If you look at the CUDA documentation you can see that at this time (CUDA 8.0) supports numerous C++11 feature.

try to figure out the best way to copy raw device pts to host using thrust that doesn’t has any performance hit

This can’t work:

float test;
cudaMalloc((void**)&test,12*sizeof(float)) ;

cudaMalloc expects to be passed the address of a pointer. If successful, it will deposit a pointer to the allocated memory at that location. Use something like this instead:

#define TEST_ARRAY_LEN (12)
float *testD = 0;
cudaError_t status;
status = cudaMalloc ((void**)&testD, TEST_ARRAY_LEN * sizeof (testD[0]));
if (status != cudaSuccess) ...

Ah, this sounds like the C programmer’s fallacy. C++ is often times home to what’s known as zero-cost abstractions. Classes are types and more often than not, types are used as thin wrappers for things like overload resolution.

These kinds of abstractions are either zero-cost or are so small in cost that it’s not worth eschewing them when it comes to making your code faster.

In fact, copying arrays will likely not be your bottleneck. Instead, I can say that it’ll almost be an unnecessary copy or algorithmic design flaw that will be your bottleneck.

txbob is like a Thrust wizard and when I was starting to learn CUDA, I’d be shocked by what he’d come up with and how it actually sped my code up. Again, the improvements were largely algorithmic because using the framework I was able to better express my problem.

CUDA C is incredibly ugly and full of boilerplate, just like real C. You can always eschew Thrust, you don’t have to use it. There’s other libraries as well like ArrayFire and CUB. You can teach yourself for education purposes how to transfer memory and stuff but I wouldn’t structure an entire project out of it.

CUDA is a programming language in the C++ family, and like standard C++ itself allows programmers to choose whatever style they prefer, from very high-level code using the latest C++11 features to the low-level C subset of C++.

For performance work, C++ is great if the programmer understands the hidden performance implications of that they are doing. In my experience, that is rarely the case, and in many cases the answer to “why is my code so slow” is “because it is a wild jumble of inappropriately used C++ features and excessive abstractions”. C-subset C++ code at least makes it obvious where work is done, there are usually no hidden gotchas. C interfaces can also be valuable for interfacing to other programming languages.

So the choice of CUDA programming style is yours and Thrust is an excellent way to harness the collective expertise of its authors in a library that provides both excellent C++ abstraction and high performance. This has allowed many CUDA programmers to successfully get their applications implemented on GPUs with just a week of work and excellent speedups to the CPU version.