Hi,
I am a newbie with using decuda and I tried disassembling a simple CUDA kernel as shown below:
// Kernel that executes on the CUDA device global void twice_array(float a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = 2a[idx];
}
Now, the second last instruction uses an f32 data type. I am wondering if this is an error because both the source operands are u32 type. AFIAK, the value loaded from the global memory should be going into a floating register. Or is it that decuda just uses the same register name for all data type-floating and unsigned? The destination data type is floating though.
Also, I know that the first instruction loads the block dimension blockDim.x. But why is it not represented as a register or a memory location as opposed to some variable name?
Registers are just registers, there are no specialized registers for floating point or integer on the hardware level. Don’t get misled by the fact that decuda tries to follow PTX syntax as closely as possible. Decuda output is not PTX. It is all based on reverse engineering of the hardware.
Decuda is kind of outdated as well. Because of decuda’s existence, Nvidia has made their own disassembler publicly available. However it disassembles to its own syntax which is not documented.
Yes, I am aware that Decuda disassembles to SASS and not PTX. What are your thoughts on the other question with the %ntid.y parameter? Mind you, s[0x00c] represents blockIdx.x. So at the least it should have been %ntid.x or in a better sense, the register or memory location where it is stored rather than some variable.
I am also wondering on one more issue with the assembly generated. The set.le.s32 $p0|$o127, s[0x0018], $r0 instruction is basically doing the comparison of idx and N. According to the instruction, if N<idx, the predicate $p0 is set to 1. However the next predicate instruction @$p0.ne return disables all threads that have the predicate set to 0, which actually correspond to threads with their idx < N!
Shouldn’t it be @$p0.eq return, so that all threads with idx > N are disabled?