atomicLoad in CUDA through PTX ISA

The purpose of this question is I need to prevent threads to load the shared data from caches.
I know that there is no explicit method of atomicLoad in CUDA. Two similar methods are using exsited atomic Read-Modify-Write functions and volatile key word. But I don’t want to bear the extra performance overhead through RMW functions. The volatile is not compatible with my current codes. As a result, I decide to use inline PTX ISA codes.
However, I know nothing about assembly codes. Could anybody tell me how to use the PTX ISA to implement the function of atomicLoad? Thank you very much.
Perhaps the prototype is like this. In addtion, the shared data may locate in both the same or different warps.

unsigned long long int atomicLoad(unsigned long long int* addr){
    asm("
    "
    );
}

perhaps now is the time to learn

You can do a global load (or store) while bypassing the L1 cache using the cache modifier .cg on a PTX ld (or st) instruction:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators

To build an inline PTX instruction, I suggest reading the inline PTX manual:

http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#axzz4opBHyGs7

There are examples of inline PTX assembly routines in many places if you care to look:

https://stackoverflow.com/questions/37149662/how-to-write-lop3-based-instructions-for-maxwell-and-up-nvidia-architecture

https://stackoverflow.com/questions/28881491/how-can-i-find-out-which-thread-is-getting-executed-on-which-core-of-the-gpu

as well as many example here on these devtalk forums.

Thanks, txbob. I will try to solve it by myself.

Hello, txbob,
finally, I think these codes is suitable for my occastion, but there is a strange compiling error I don’t know why.

__device__ unsigned int ptxLoad(unsigned int* global) {
	unsigned int local;
	asm("ld.global.cg.u32 %0, %1;"
		:"=r"(local) : "r"(global));
	return local;
}
Severity	Code	Description	Project	File	Line	Suppression State
Error		asm operand type size(8) does not match type/size implied by constraint 'r'

a pointer ( unsigned int *global) is an item that is 8 bytes on a 64-bit architecture.

You’re using an incorrect constraint letter (r) for an 8-byte quantity:

[url]Inline PTX Assembly :: CUDA Toolkit Documentation

I think the proper solution is like this:

__device__ inline unsigned long long int ptxAtomicLoad(unsigned long long int* global) {
	unsigned long long int local;
	asm("ld.global.cg.u64 %0, [%1];"
		:"=l"(local) : "l"(global));
	return local;
}