There are three new load intrinsics in CUDA 8.0 RC that save you from writing PTX:
- __ldg() : ld.global.nc : load via non-coherent cache — first seen in sm_35
- __ldca() : ld.global.ca : load and cache at all levels
- __ldcg() : ld.global.cg : load and cache at global level (L2+ but not L1)
- __ldcs() : ld.global.cs : load and evict first expecting to access once
I’ll guess that these intrinsics and the new ATOM/RED scope modifiers are important for interacting with “distant” GPUs over NVLink or another fabric.