cudaSafeCall() Runtime API error 11
Hi,

almost 24h stucked with this error, which I can't solve out.

here is the situation:

page-locked host memory allocation
[code]
bytesS = numOfStreams * size;
o_bytesS = numOfStreams * otherSize;

T* h_idata = NULL; // h_input
T* h_odata = NULL; // h_output
cutilSafeCall( cudaMallocHost((void**) &h_idata, bytesS) );
cutilSafeCall( cudaMallocHost((void**) &h_odata, o_bytesS) );
[/code]

device memory allocation
[code]
T* d_idata = NULL; // d_input
T* d_odata = NULL; // d_output
cutilSafeCallNoSync( cudaMalloc((void**) &d_idata, bytesS) );
cutilSafeCallNoSync( cudaMalloc((void**) &d_odata, o_bytesS) );
[/code]

streams creation
[code]
cudaStream_t *stream = (cudaStream_t*) malloc( numOfStreams*sizeof(cudaStream_t) );
for(int i=0; i<numOfStreams; i++){
cutilSafeCall( cudaStreamCreate(&(stream[i])) );
}
[/code]

running...
[code]
for(int i=0; i<numOfStreams; i++){
// here I get the error
cutilSafeCall( cudaMemcpyAsync(d_idata + i*bytesS, h_idata + i*bytesS, bytesS, cudaMemcpyHostToDevice, stream[i]) );

.... // launch kernel, etc.
}
[/code]

Since the structure seems to me quite correct, and as well it follows the instructions given in the programming guide concerning stream handling and async copies, I don't really understand what is wrong here.

Device: GeForce GTX 285

could somebody explain what the problem is
Hi,



almost 24h stucked with this error, which I can't solve out.



here is the situation:



page-locked host memory allocation



bytesS = numOfStreams * size;

o_bytesS = numOfStreams * otherSize;



T* h_idata = NULL; // h_input

T* h_odata = NULL; // h_output

cutilSafeCall( cudaMallocHost((void**) &h_idata, bytesS) );

cutilSafeCall( cudaMallocHost((void**) &h_odata, o_bytesS) );




device memory allocation



T* d_idata = NULL; // d_input

T* d_odata = NULL; // d_output

cutilSafeCallNoSync( cudaMalloc((void**) &d_idata, bytesS) );

cutilSafeCallNoSync( cudaMalloc((void**) &d_odata, o_bytesS) );




streams creation



cudaStream_t *stream = (cudaStream_t*) malloc( numOfStreams*sizeof(cudaStream_t) );

for(int i=0; i<numOfStreams; i++){

cutilSafeCall( cudaStreamCreate(&(stream[i])) );

}




running...



for(int i=0; i<numOfStreams; i++){

// here I get the error

cutilSafeCall( cudaMemcpyAsync(d_idata + i*bytesS, h_idata + i*bytesS, bytesS, cudaMemcpyHostToDevice, stream[i]) );



.... // launch kernel, etc.

}




Since the structure seems to me quite correct, and as well it follows the instructions given in the programming guide concerning stream handling and async copies, I don't really understand what is wrong here.



Device: GeForce GTX 285



could somebody explain what the problem is

#1
Posted 03/08/2012 10:43 AM   
Hi,
Just out of curiosity, did you check which value i holds when you encounter the error?
Could it be due to a previously launched kernel inside your loop?
Hi,

Just out of curiosity, did you check which value i holds when you encounter the error?

Could it be due to a previously launched kernel inside your loop?

#2
Posted 03/08/2012 10:54 AM   
[quote name='Gilles_C' date='08 March 2012 - 05:54 AM' timestamp='1331204083' post='1380140']
Hi,
Just out of curiosity, did you check which value i holds when you encounter the error?
Could it be due to a previously launched kernel inside your loop?
[/quote]

uhm, what do you mean with "which value it holds"?

this is how I "initialize" the input data:
[code]
for(int i=0; i<size; i++) {
// Keep the numbers small
h_idata[i] = (rand() & 0xFF) / (T)RAND_MAX;
}
[/code]

'size' is the number of values upon which I have to operate

no other kernel is launched inside the loop. anyway, to avoid problems I already put a cudaDeviceSynchronize() right before the loop.
[quote name='Gilles_C' date='08 March 2012 - 05:54 AM' timestamp='1331204083' post='1380140']

Hi,

Just out of curiosity, did you check which value i holds when you encounter the error?

Could it be due to a previously launched kernel inside your loop?





uhm, what do you mean with "which value it holds"?



this is how I "initialize" the input data:



for(int i=0; i<size; i++) {

// Keep the numbers small

h_idata[i] = (rand() & 0xFF) / (T)RAND_MAX;

}




'size' is the number of values upon which I have to operate



no other kernel is launched inside the loop. anyway, to avoid problems I already put a cudaDeviceSynchronize() right before the loop.

#3
Posted 03/08/2012 11:33 AM   
I was referring to the loop counter "i". What is its value when you get the error?
Basically, what I want to know is if it's at the first attempt to call cudaMemcpyAsync() that the error is detected, or if it's during a subsequent attempt.
In which case, the error detection might not be due to the cudaMemcpyAsync() itself, but rather to a previous and not yet detected error. For example, it could be the case that one of your kernels (as launched inside the loop right after the call to cudaMemcpyAsync()) could have trigger the error.
Does that make sense?
I was referring to the loop counter "i". What is its value when you get the error?

Basically, what I want to know is if it's at the first attempt to call cudaMemcpyAsync() that the error is detected, or if it's during a subsequent attempt.

In which case, the error detection might not be due to the cudaMemcpyAsync() itself, but rather to a previous and not yet detected error. For example, it could be the case that one of your kernels (as launched inside the loop right after the call to cudaMemcpyAsync()) could have trigger the error.

Does that make sense?

#4
Posted 03/08/2012 12:57 PM   
[quote name='Gilles_C' date='08 March 2012 - 01:57 PM' timestamp='1331211428' post='1380162']
I was referring to the loop counter "i". What is its value when you get the error?
Basically, what I want to know is if it's at the first attempt to call cudaMemcpyAsync() that the error is detected, or if it's during a subsequent attempt.
In which case, the error detection might not be due to the cudaMemcpyAsync() itself, but rather to a previous and not yet detected error. For example, it could be the case that one of your kernels (as launched inside the loop right after the call to cudaMemcpyAsync()) could have trigger the error.
Does that make sense?
[/quote]

ok, I totally misunderstood your first reply!

anyway, this is a slightly modified version, which aims to allow each stream to process its portion of input data ( recall [font="Courier New"][size="2"]bytes = size * sizeof(T);[/size][/font] )
[code]
for(int i=0; i<numOfStreams; i++){
cudaStreamSynchronize(stream[i]);
printf("i_i: %d\n;", i);
// here I get the error
cutilSafeCall( cudaMemcpyAsync(d_idata + i*bytes, h_idata + i*bytes, bytes, cudaMemcpyHostToDevice, stream[i]) );

// ARE dimensions right? recall I allocated 'numOfStreams*bytes' space. so copying 'bytes' amount of data for
// each stream should do the work. isn't it?


.... // launch kernel, etc.
}
[/code]

adding some prints, it turns out that the first attempt works fine (prints "i_i: 0"); the error is detected at the second attempt, because it prints "i_i: 1;" and then reports the error.

the sync on the stream doesn't change much!
[quote name='Gilles_C' date='08 March 2012 - 01:57 PM' timestamp='1331211428' post='1380162']

I was referring to the loop counter "i". What is its value when you get the error?

Basically, what I want to know is if it's at the first attempt to call cudaMemcpyAsync() that the error is detected, or if it's during a subsequent attempt.

In which case, the error detection might not be due to the cudaMemcpyAsync() itself, but rather to a previous and not yet detected error. For example, it could be the case that one of your kernels (as launched inside the loop right after the call to cudaMemcpyAsync()) could have trigger the error.

Does that make sense?





ok, I totally misunderstood your first reply!



anyway, this is a slightly modified version, which aims to allow each stream to process its portion of input data ( recall bytes = size * sizeof(T); )



for(int i=0; i<numOfStreams; i++){

cudaStreamSynchronize(stream[i]);

printf("i_i: %d\n;", i);

// here I get the error

cutilSafeCall( cudaMemcpyAsync(d_idata + i*bytes, h_idata + i*bytes, bytes, cudaMemcpyHostToDevice, stream[i]) );



// ARE dimensions right? recall I allocated 'numOfStreams*bytes' space. so copying 'bytes' amount of data for

// each stream should do the work. isn't it?





.... // launch kernel, etc.

}




adding some prints, it turns out that the first attempt works fine (prints "i_i: 0"); the error is detected at the second attempt, because it prints "i_i: 1;" and then reports the error.



the sync on the stream doesn't change much!

#5
Posted 03/08/2012 03:04 PM   
So now that you know that i==1 when the error is detected, you might check if the error is already set prior to call cudaMemcpyAsync().
Just call cudaGetLastError() straight and check its result. This way you'll know if the error actually comes from your call to cudaMemcpyAsync().
Does that make sense to you?
So now that you know that i==1 when the error is detected, you might check if the error is already set prior to call cudaMemcpyAsync().

Just call cudaGetLastError() straight and check its result. This way you'll know if the error actually comes from your call to cudaMemcpyAsync().

Does that make sense to you?

#6
Posted 03/08/2012 03:18 PM   
[quote name='Gilles_C' date='08 March 2012 - 04:18 PM' timestamp='1331219903' post='1380202']
So now that you know that i==1 when the error is detected, you might check if the error is already set prior to call cudaMemcpyAsync().
Just call cudaGetLastError() straight and check its result. This way you'll know if the error actually comes from your call to cudaMemcpyAsync().
Does that make sense to you?
[/quote]

uhm, that's the reason why I use the 'cudaSafeCall' wrapper: it tells me exactly that at row X (the row above) there's an error.

Precisely, it says:
"[i]file.cpp(321) : cudaSafeCall() Runtime API error 11: invalid argument.[/i]"

so I know that the error comes from there.

and actually, error 11 is a "[i]cudaErrorInvalidValue[/i]", which means that "[i]...one or more of the parameters passed to the API call is not within an acceptable range of values.[/i]"

I do understand what that error means, but still cannot figure out where is the mistake.


btw, thanks for your concern.
[quote name='Gilles_C' date='08 March 2012 - 04:18 PM' timestamp='1331219903' post='1380202']

So now that you know that i==1 when the error is detected, you might check if the error is already set prior to call cudaMemcpyAsync().

Just call cudaGetLastError() straight and check its result. This way you'll know if the error actually comes from your call to cudaMemcpyAsync().

Does that make sense to you?





uhm, that's the reason why I use the 'cudaSafeCall' wrapper: it tells me exactly that at row X (the row above) there's an error.



Precisely, it says:

"file.cpp(321) : cudaSafeCall() Runtime API error 11: invalid argument."



so I know that the error comes from there.



and actually, error 11 is a "cudaErrorInvalidValue", which means that "...one or more of the parameters passed to the API call is not within an acceptable range of values."



I do understand what that error means, but still cannot figure out where is the mistake.





btw, thanks for your concern.

#7
Posted 03/08/2012 04:39 PM   
[quote name='faabiioo' date='08 March 2012 - 04:39 PM' timestamp='1331224746' post='1380227']
uhm, that's the reason why I use the 'cudaSafeCall' wrapper: it tells me exactly that at row X (the row above) there's an error.
[/quote]
Not necessarily, as Gilles_C has been trying to point out in the past few posts. If you look at the [url="http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__MEMORY_g732efed5ab5cb184c920a21eb36e8ce4.html#g732efed5ab5cb184c920a21eb36e8ce4"]documentation for cudaMemcpyAsync()[/url], it says "Note that this function may also return error codes from previous, asynchronous launches."
[quote name='faabiioo' date='08 March 2012 - 04:39 PM' timestamp='1331224746' post='1380227']

uhm, that's the reason why I use the 'cudaSafeCall' wrapper: it tells me exactly that at row X (the row above) there's an error.



Not necessarily, as Gilles_C has been trying to point out in the past few posts. If you look at the documentation for cudaMemcpyAsync(), it says "Note that this function may also return error codes from previous, asynchronous launches."

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#8
Posted 03/08/2012 04:51 PM   
[quote name='tera' date='08 March 2012 - 05:51 PM' timestamp='1331225504' post='1380231']
Not necessarily, as Gilles_C has been trying to point out in the past few posts. If you look at the [url="http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__MEMORY_g732efed5ab5cb184c920a21eb36e8ce4.html#g732efed5ab5cb184c920a21eb36e8ce4"]documentation for cudaMemcpyAsync()[/url], it says "Note that this function may also return error codes from previous, asynchronous launches."
[/quote]

Ok, you're right.
I tried with cudaGetLastError, indeed, but it did not report anything.
[quote name='tera' date='08 March 2012 - 05:51 PM' timestamp='1331225504' post='1380231']

Not necessarily, as Gilles_C has been trying to point out in the past few posts. If you look at the documentation for cudaMemcpyAsync(), it says "Note that this function may also return error codes from previous, asynchronous launches."





Ok, you're right.

I tried with cudaGetLastError, indeed, but it did not report anything.

#9
Posted 03/08/2012 05:09 PM   
ok, now that's even more annoying because I solved it by trials and errors but I'm not really aware of what I've done. Or better, I know what I've done but I'm not sure I understand what the problem was.

so, to recall something:
[code]
// host side memory size
unsigned int bytes = size * sizeof(T);
unsigned int bytesS = numOfStreams * bytes;

// device memory size
unsigned int o_bytes = numBlocks*sizeof(T);
unsigned int o_bytesS = numOfStreams*(o_bytes);

// allocate page-locked host memory
T* h_idata = NULL;
T* h_odata = NULL;
cutilSafeCall( cudaMallocHost((void**) &h_idata, bytesS) );
cutilSafeCall( cudaMallocHost((void**) &h_odata, o_bytesS) );

// allocate device memory and data
T* d_idata = NULL;
T* d_odata = NULL;
cutilSafeCall( cudaMalloc((void**) &d_idata, bytesS) );
cutilSafeCall( cudaMalloc((void**) &d_odata, o_bytesS) );

// array of streams handles
cudaStream_t *stream = (cudaStream_t*) malloc(numOfStreams*sizeof(cudaStream_t));
for(int i=0; i<numOfStreams; i++){
cutilSafeCall( cudaStreamCreate(&(stream[i])) );
}

// run kernels on streams - NOW WORKING!
for(int i=0; i<numOfStreams; i++){
cudaStreamSynchronize(stream[i]);

cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, size, cudaMemcpyHostToDevice, stream[i]) );
cutilSafeCall( cudaMemcpyAsync(d_odata + i * numBlocks, h_odata + i * numBlocks, numBlocks, cudaMemcpyHostToDevice, stream[i]) );

reduceS<T>(size, numThreads, numBlocks, d_idata, d_odata, stream[i]);

.... // rest of the code
}
[/code]

basically what I changed is the quantity of bytes copied and the offsets, which are now ruled by 'size', that is the number of elements upon which I work. I don't understand why the previous quantities were wrong and caused the problem!

maybe it's a dumb question but...
ok, now that's even more annoying because I solved it by trials and errors but I'm not really aware of what I've done. Or better, I know what I've done but I'm not sure I understand what the problem was.



so, to recall something:



// host side memory size

unsigned int bytes = size * sizeof(T);

unsigned int bytesS = numOfStreams * bytes;



// device memory size

unsigned int o_bytes = numBlocks*sizeof(T);

unsigned int o_bytesS = numOfStreams*(o_bytes);



// allocate page-locked host memory

T* h_idata = NULL;

T* h_odata = NULL;

cutilSafeCall( cudaMallocHost((void**) &h_idata, bytesS) );

cutilSafeCall( cudaMallocHost((void**) &h_odata, o_bytesS) );



// allocate device memory and data

T* d_idata = NULL;

T* d_odata = NULL;

cutilSafeCall( cudaMalloc((void**) &d_idata, bytesS) );

cutilSafeCall( cudaMalloc((void**) &d_odata, o_bytesS) );



// array of streams handles

cudaStream_t *stream = (cudaStream_t*) malloc(numOfStreams*sizeof(cudaStream_t));

for(int i=0; i<numOfStreams; i++){

cutilSafeCall( cudaStreamCreate(&(stream[i])) );

}



// run kernels on streams - NOW WORKING!

for(int i=0; i<numOfStreams; i++){

cudaStreamSynchronize(stream[i]);



cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, size, cudaMemcpyHostToDevice, stream[i]) );

cutilSafeCall( cudaMemcpyAsync(d_odata + i * numBlocks, h_odata + i * numBlocks, numBlocks, cudaMemcpyHostToDevice, stream[i]) );



reduceS<T>(size, numThreads, numBlocks, d_idata, d_odata, stream[i]);



.... // rest of the code

}




basically what I changed is the quantity of bytes copied and the offsets, which are now ruled by 'size', that is the number of elements upon which I work. I don't understand why the previous quantities were wrong and caused the problem!



maybe it's a dumb question but...

#10
Posted 03/09/2012 04:25 PM   
The problem is that you mix-up between the size of data to transfer in bytes, and the number of elements to transfer.
The pointer arithmetic should be expressed using number of elements, and the argument giving the size to transfer should be expressed in bytes.
The line should become (if I read correctly the code, which I'm not so sure since the various names look misleading to me):

[code]
cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, bytes, cudaMemcpyHostToDevice, stream[i]) );
[/code]
The problem is that you mix-up between the size of data to transfer in bytes, and the number of elements to transfer.

The pointer arithmetic should be expressed using number of elements, and the argument giving the size to transfer should be expressed in bytes.

The line should become (if I read correctly the code, which I'm not so sure since the various names look misleading to me):





cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, bytes, cudaMemcpyHostToDevice, stream[i]) );

#11
Posted 03/09/2012 07:44 PM   
[quote name='Gilles_C' date='09 March 2012 - 08:44 PM' timestamp='1331322282' post='1380691']
The problem is that you mix-up between the size of data to transfer in bytes, and the number of elements to transfer.
The pointer arithmetic should be expressed using number of elements, and the argument giving the size to transfer should be expressed in bytes.
The line should become (if I read correctly the code, which I'm not so sure since the various names look misleading to me):

[code]
cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, bytes, cudaMemcpyHostToDevice, stream[i]) );
[/code]
[/quote]

actually it works either ways, but I do understand what you say. and that could also be the solution for another problem I encountered.
I mean it works fine either with size as 'count' parameter or with bytes

thanks a lot
[quote name='Gilles_C' date='09 March 2012 - 08:44 PM' timestamp='1331322282' post='1380691']

The problem is that you mix-up between the size of data to transfer in bytes, and the number of elements to transfer.

The pointer arithmetic should be expressed using number of elements, and the argument giving the size to transfer should be expressed in bytes.

The line should become (if I read correctly the code, which I'm not so sure since the various names look misleading to me):





cutilSafeCall( cudaMemcpyAsync(d_idata + i * size, h_idata +i * size, bytes, cudaMemcpyHostToDevice, stream[i]) );






actually it works either ways, but I do understand what you say. and that could also be the solution for another problem I encountered.

I mean it works fine either with size as 'count' parameter or with bytes



thanks a lot

#12
Posted 03/09/2012 09:28 PM   
[quote name='faabiioo' date='09 March 2012 - 09:28 PM' timestamp='1331328522' post='1380715']
I mean it works fine either with size as 'count' parameter or with bytes
[/quote]
It certainly doesn't, unless you happen to test with a type whose size happens to be one byte ('char' comes to mind).
[quote name='faabiioo' date='09 March 2012 - 09:28 PM' timestamp='1331328522' post='1380715']

I mean it works fine either with size as 'count' parameter or with bytes



It certainly doesn't, unless you happen to test with a type whose size happens to be one byte ('char' comes to mind).

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#13
Posted 03/09/2012 09:39 PM   
Scroll To Top