The purpose of this question is I need to prevent threads to load the shared data from caches.
I know that there is no explicit method of atomicLoad in CUDA. Two similar methods are using exsited atomic Read-Modify-Write functions and volatile key word. But I don’t want to bear the extra performance overhead through RMW functions. The volatile is not compatible with my current codes. As a result, I decide to use inline PTX ISA codes.
However, I know nothing about assembly codes. Could anybody tell me how to use the PTX ISA to implement the function of atomicLoad? Thank you very much.
Perhaps the prototype is like this. In addtion, the shared data may locate in both the same or different warps.
unsigned long long int atomicLoad(unsigned long long int* addr){
asm("
"
);
}
Thanks, txbob. I will try to solve it by myself.
Hello, txbob,
finally, I think these codes is suitable for my occastion, but there is a strange compiling error I don’t know why.
__device__ unsigned int ptxLoad(unsigned int* global) {
unsigned int local;
asm("ld.global.cg.u32 %0, %1;"
:"=r"(local) : "r"(global));
return local;
}
Severity Code Description Project File Line Suppression State
Error asm operand type size(8) does not match type/size implied by constraint 'r'
a pointer ( unsigned int *global) is an item that is 8 bytes on a 64-bit architecture.
You’re using an incorrect constraint letter (r) for an 8-byte quantity:
[url]Inline PTX Assembly :: CUDA Toolkit Documentation
I think the proper solution is like this:
__device__ inline unsigned long long int ptxAtomicLoad(unsigned long long int* global) {
unsigned long long int local;
asm("ld.global.cg.u64 %0, [%1];"
:"=l"(local) : "l"(global));
return local;
}