Is the instruction "mov.b32 {r,g,b,a},%r1;" even supported??

cheinger · July 16, 2016, 1:04pm

Hey experts!

I am having trouble with splitting an integer into 4 uint8_ts.

I can’t seem to get this to work. It just says Arguments mismatch for instruction ‘mov’ but I can’t find the constraint character for a uint8_t or b8.

U32 t = 0x00112233;
U32 a, b, c, d;
asm("mov.b32 {%0,%1,%2,%3}, %4; " : "=r"(a), "=r"(b), "=r"(c), "=r"(d) : "r"(t));

Any thoughts?

njuffa · July 16, 2016, 3:33pm

See: [url]Missing CUDA inline PTX constraint letter for 8 bit variables in order to disable L1 cache for 8 bit variable (bool) - Stack Overflow

scottgray · July 16, 2016, 5:50pm

If you provided a little context about what you’re trying to do I could probably suggest a way around the ptx limitation.

For example, you could use bfe (bit field extract) to unpack each element into integer values (this is likely what the mov instruction would get converted to anyway if it worked since you don’t need sign extension.)

Or you could look at the scalar video instructions (vmad, vadd, vabsdiff, etc) and perhaps do what you need to do in place.

njuffa · July 16, 2016, 9:03pm

Instead of using the video instructions at PTX level, I would suggest using CUDA’s SIMD intrinsics instead, otherwise you could find yourself in a world of hurt when working on a non-Kepler platform.

scottgray · July 17, 2016, 1:05am

I think the scalar video ptx instructions are fine. It’s the SIMD video instructions you want to avoid (vadd2, vadd4, etc). Maxwell and Pascal have native support for VADD, VMAD, VABSDIFF, VMNMX, and VSET.

And of course sm_61 now has the dp4a and dp2a instructions as well.

njuffa · July 17, 2016, 2:54am

Since it seemed to me that the OP was interested in packed-byte operation I assumed interest would be in the SIMD video instructions. When or where are the scalar video instructions useful, in your experience? I have never had a need to use them, nor did I come across a customer use case where these seemed beneficial.

The utility of the SIMD video instructions for various image processing and bioscience applications is immediately obvious, and I wasn’t happy when I learned that NVIDIA decided to largely rip them out post Kepler (I think support for byte-wise absolute differences was retained).

For just re-interpreting the data in two different ways, a union of ‘unsigned int’ and ‘uchar4’ should work well enough, I don’t see the need to drop to PTX inline assembly for that.

scottgray · July 17, 2016, 8:12pm

njuffa’s point about the union is a good way to avoid having to use explicit bfe instructions. Just let the compiler work that out for you. My thinking tends to be biased to the low level implementation side of things.

As far as scalar video instructions… they can be handy for doing things in fixed point math. But vmad is actually pretty useful for multiplying two integers when you know the operands are within 16 bits. On Maxwell/Pascal The compiler typically generates at least 3 XMADs for an integer multiple of any integer data types. You can use something like this to get that down to a single instruction and reduced register pressure:

__device__ __forceinline__ int mad16(int a, int b, int c)
{
    int res;
    asm("vmad.s32.u32.u32 %0, %1.h0, %2.h0, %3;" : "=r"(res) : "r"(a), "r"(b), "r"(c));
    return res;
}

It would be nice if the compiler were a little smarter about using the minimal number of XMADs for integer multiplies of various data types.

njuffa · July 17, 2016, 10:48pm

Thanks for the insight into the utility of scalar video instructions.

The number of integer multiply variants (at PTX level) that need to be emulated by XMAD sequences on Maxwell and following architectures is quite high, around twenty if I recall correctly. For wider data types, one can usually construct multiple variants with different throughput, latency, and register pressure making it difficult to pick one that is “optimal”.

Although I was not directly involved, as I had grown tired of cooking up emulation sequences, I am under the impression that considerable effort went into the construction of fast integer multiply emulation code for Maxwell, which however does not mean better code cannot be found. Optimization is very much an iterative process, with additional improvements possible at each iteration.

If you know of any emulation sequences that appear particularly deficient, or have ideas for improved variants, I would suggest filing RFEs to get those improvements into PTXAS for future versions of CUDA.

scottgray · July 18, 2016, 4:16am

I basically just want variants of these two (the first can be achieved with vmad):

signed and unsigned 16bit * 16bit:

XMAD.U16.U16 d, a, b, c;
XMAD.S16.S16 d, a, b, c;

signed and unsigned 16bit (a) * 32bit (b):

XMAD.U16.U16     d, a, b,    c;
XMAD.U16.U16.PSL d, a, b.H1, d;

XMAD.S16.S16     d, a, b,    c;
XMAD.S16.S16.PSL d, a, b.H1, d;

This second one is particularly handy for pointer offset arithmetic. It requires no additional registers and is 1 instruction shorter than the standard 3 instructions used.