Missing CUDA inline PTX constraint letter for 8 bit

I’ve seen in the documentation that you can use:

“h” = .u16 reg
“r” = .u32 reg
“l” = .u64 reg
“f” = .f32 reg
“d” = .f64 reg

Constraint letters when inlining PTX in C code, I was wondering if you know why there isn’t a u8 constraint, for example if I want to split a register into 2 parts I can do:

asm(“mov.b32 {%0,%1}, %2;” : “=h”(u1), “=h”(u2) : “r”(s));

But if I want to split it into 4 parts, how can I do it?

asm(“mov.b32 {%0,%1,%2,%3}, %4;” : “=?”(u1), “=?”(u2), “=?”(u3), “=?”(u4) : “r”(s));

I don’t understand why there isn’t a constraint letter for 8 bits.

I am not sure but I believe this simply reflects hardware capabilities. The sm_1x architecture supported 16-bit register halves that were individually accessible. As far as I know, no NVIDIA GPU architecture has ever offered 8-bit registers (or register portions) as addressable entities (in contrast with x86 where one can access the lower eight bits of registers as AL, BL, etc).

The thing is I want to split a register into 4 bytes and I currently do it like this:

device forceinline void uint32_to_uchars(const uint32_t s, int *u1, int *u2, int *u3, int *u4) {
*u1 = s & 0x000000ff;
*u2 = (s>>8) & 0x000000ff;
*u3 = (s>>16) & 0x000000ff;
*u4 = (s>>24) & 0x000000ff;
}

device forceinline uint32_t uchars_to_uint32(int u1, int u2, int u3, int u4) {
return u1 | (u2<<8) | (u3<<16) | (u4<<24);
}

I wanted to use this mov instruction because I’ve seen it translates to i2i.u32.u16 instructions and I’d like use i2i.u32.u8 but instead it’s is generating more instructions (and, ors and shifts).

I see nothing wrong with your current implementations, not sure why you want to change them? Do these inline functions show up as bottlenecks when you profile your code?

Given that the hardware has no support for directly addressing individual 8-bit portions of a register, how would you propose that the compiler translate “mov.b32 {%0,%1,%2,%3}, %4;” into machine code? Seems to me it would have to emulate it, by extracting the bytes via SHR/AND, or PRMT, or BFE.

Side remark: For uchars_to_uint32(), it may be slightly faster to use ‘+’ instead of ‘|’, but I have not tried it. Assuming u{1,2,3,4} are in [0,255], meaning the four fields are non-overlapping, these operations are interchangeable. The compiler probably cannot know whether the transformation is safe here, and as a consequence cannot change one into the other. You could also try using three calls to __byte_perm(), which maps to the PRMT instruction. You could also try to use __byte_perm() to extract individual bytes inside uint32_to_uchars(), again not sure whether that would be faster.

Yet another approach would be to create a union of a uint32_t and a uchar4, then use structure member access for extraction and compositing:

__device__ __forceinline__ void uint32_to_uchars (uint32_t s, int *u1, int *u2, int *u3, int *u4) 
{
   union {
      uint32_t i;
      uchar4   c;
   } xtr;
   xtr.i = s;
   *u1 = xtr.c.x;
   *u2 = xtr.c.y;
   *u3 = xtr.c.z;
   *u4 = xtr.c.w;
}

No guarantees that this would be any faster than the code the compiler generates now.

I am a heavy-user of unions in my kernels and am generally pleased with the generated code.

Although it’s tempting to bit-manipulate words, I would suggest creating unions that list the largest native type first (which may be a vector type), followed by any other native vector types you need, followed by one or more structs of subword types and/or bitfields.

CUDA C++ supports anonymous unions and structs and placing a supported CUDA native type first in your union makes them somewhat easier to work with when loading and storing.

Also note that single instruction bfi/bfe bit field ops are often generated for subword operations.

I’m trying to reduce the executed instructions because that’s what is limiting this kernel. I think I misunderstood i2i instructions. Your suggestions are really helpful I’ll try to see what works better. Thank you!

One other point… the PTX pack/unpack operations are usually erased when you actually generate SASS code.

If you’re interested in counting instructions you should also dump the SASS with cuobjdump or nvdisasm.