I’m in the need to move memory on the device, but there is no function like cudaMemmove() available to do so. As far as I’ve read the docs, cudaMempy() isn’t save, regardless if moving upward or downward.
Generally: you don’t. Just copy it to a temporary buffer and back again.
But if you really have to, you can implement it by implementing a rotate. A rotate can (unless I miss something) be implemented in CUDA by mutiple “local” rotates (for each of which the amount of elements that are rotated fit into the shared memory), requiring multiple kernel calls.
You can of course implement it with lots of cudaMemcpy, though that is almost certainly pointless (you need vector length / shift distance mempy calls).
Reimar is correct. If you use a temp space (and can afford the space) you typically get better performance as it can coalesce reads and writes to a given bank better.