Decuda error?
Hi,
I am a newbie with using [i]decuda [/i]and I tried disassembling a simple CUDA kernel as shown below:

// Kernel that executes on the CUDA device
__global__ void twice_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = 2*a[idx];
}

The cubin disassembled output is as follows:

// Disassembling _Z11twice_arrayPfi
000000: 10000205 40004780 mov.u16 $r0.hi, %ntid.y
000008: a0000005 04000780 cvt.rn.u32.u16 $r1, $r0.lo
000010: 60014c01 00204780 mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1
000018: 3000cdfd 6c20c7c8 set.le.s32 $p0|$o127, s[0x0018], $r0
000020: 30000003 00000280 @$p0.ne return
000028: 30020001 c4100780 shl.u32 $r0, $r0, 0x00000002
000030: 2000c805 04200780 add.u32 $r1, s[0x0010], $r0
000038: d00e0201 80c00780 mov.u32 $r0, g[$r1]
[b]000040: b0000001 00000780 add.rn.f32 $r0, $r0, $r0[/b]
000048: d00e0201 a0c00781 mov.end.u32 g[$r1], $r0

Now, the second last instruction uses an f32 data type. I am wondering if this is an error because both the source operands are u32 type. AFIAK, the value loaded from the global memory should be going into a floating register. Or is it that decuda just uses the same register name for all data type-floating and unsigned? The destination data type is floating though.

Your help is greatly appreciated.

Thanks.

Murtaza
Hi,

I am a newbie with using decuda and I tried disassembling a simple CUDA kernel as shown below:



// Kernel that executes on the CUDA device

__global__ void twice_array(float *a, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx<N) a[idx] = 2*a[idx];

}



The cubin disassembled output is as follows:



// Disassembling _Z11twice_arrayPfi

000000: 10000205 40004780 mov.u16 $r0.hi, %ntid.y

000008: a0000005 04000780 cvt.rn.u32.u16 $r1, $r0.lo

000010: 60014c01 00204780 mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1

000018: 3000cdfd 6c20c7c8 set.le.s32 $p0|$o127, s[0x0018], $r0

000020: 30000003 00000280 @$p0.ne return

000028: 30020001 c4100780 shl.u32 $r0, $r0, 0x00000002

000030: 2000c805 04200780 add.u32 $r1, s[0x0010], $r0

000038: d00e0201 80c00780 mov.u32 $r0, g[$r1]

000040: b0000001 00000780 add.rn.f32 $r0, $r0, $r0

000048: d00e0201 a0c00781 mov.end.u32 g[$r1], $r0



Now, the second last instruction uses an f32 data type. I am wondering if this is an error because both the source operands are u32 type. AFIAK, the value loaded from the global memory should be going into a floating register. Or is it that decuda just uses the same register name for all data type-floating and unsigned? The destination data type is floating though.



Your help is greatly appreciated.



Thanks.



Murtaza

#1
Posted 03/23/2012 11:39 PM   
Also, I know that the first instruction loads the block dimension [i]blockDim.x[/i]. But why is it not represented as a register or a memory location as opposed to some variable name?

Thanks.
Also, I know that the first instruction loads the block dimension blockDim.x. But why is it not represented as a register or a memory location as opposed to some variable name?



Thanks.

#2
Posted 03/24/2012 12:31 AM   
Hey guys,

Anyone please?

Thanks.
Hey guys,



Anyone please?



Thanks.

#3
Posted 03/24/2012 01:45 PM   
Registers are just registers, there are no specialized registers for floating point or integer on the hardware level. Don't get misled by the fact that decuda tries to follow PTX syntax as closely as possible. Decuda output is not PTX. It is all based on reverse engineering of the hardware.

Decuda is kind of outdated as well. Because of decuda's existence, Nvidia has made their own disassembler publicly available. However it disassembles to its own syntax which is not documented.
Registers are just registers, there are no specialized registers for floating point or integer on the hardware level. Don't get misled by the fact that decuda tries to follow PTX syntax as closely as possible. Decuda output is not PTX. It is all based on reverse engineering of the hardware.



Decuda is kind of outdated as well. Because of decuda's existence, Nvidia has made their own disassembler publicly available. However it disassembles to its own syntax which is not documented.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#4
Posted 03/24/2012 02:48 PM   
Thanks tera for your response.

Yes, I am aware that Decuda disassembles to SASS and not PTX. What are your thoughts on the other question with the %ntid.y parameter? Mind you, s[0x00c] represents blockIdx.x. So at the least it should have been %ntid.x or in a better sense, the register or memory location where it is stored rather than some variable.


Regards.

-Murtaza
Thanks tera for your response.



Yes, I am aware that Decuda disassembles to SASS and not PTX. What are your thoughts on the other question with the %ntid.y parameter? Mind you, s[0x00c] represents blockIdx.x. So at the least it should have been %ntid.x or in a better sense, the register or memory location where it is stored rather than some variable.





Regards.



-Murtaza

#5
Posted 03/24/2012 03:09 PM   
Anyone on the %ntid issue please?

Thanks.
Anyone on the %ntid issue please?



Thanks.

#6
Posted 04/02/2012 09:53 PM   
Seems so. cuobjdump outputs [font="Courier New"]MOV.U16 R0H, g [0x1].U16;[/font] which is ntid.x.
Seems so. cuobjdump outputs MOV.U16 R0H, g [0x1].U16; which is ntid.x.

Always check return codes of CUDA calls for errors. Do not use __syncthreads() in conditional code unless the condition is guaranteed to evaluate identically for all threads of each block. Run your program under cuda-memcheck to detect stray memory accesses. If your kernel dies for larger problem sizes, it might exceed the runtime limit and trigger the watchdog timer.

#7
Posted 04/02/2012 10:51 PM   
[quote name='tera' date='02 April 2012 - 05:51 PM' timestamp='1333407098' post='1391198']
Seems so. cuobjdump outputs [font="Courier New"]MOV.U16 R0H, g [0x1].U16;[/font] which is ntid.x.
[/quote]


Thank you Tera. I appreciate your response.

Regards,
[quote name='tera' date='02 April 2012 - 05:51 PM' timestamp='1333407098' post='1391198']

Seems so. cuobjdump outputs MOV.U16 R0H, g [0x1].U16; which is ntid.x.







Thank you Tera. I appreciate your response.



Regards,

#8
Posted 04/02/2012 10:59 PM   
Hi,

I am also wondering on one more issue with the assembly generated. The [b]set.le.s32 $p0|$o127, s[0x0018], $r0[/b] instruction is basically doing the comparison of idx and N. According to the instruction, if N<idx, the predicate $p0 is set to 1. However the next predicate instruction [b]@$p0.ne return[/b] disables all threads that have the predicate set to 0, which actually correspond to threads with their idx < N!

Shouldn't it be [b]@$p0.eq return[/b], so that all threads with idx > N are disabled?

Thanks.

Murtaza
Hi,



I am also wondering on one more issue with the assembly generated. The set.le.s32 $p0|$o127, s[0x0018], $r0 instruction is basically doing the comparison of idx and N. According to the instruction, if N<idx, the predicate $p0 is set to 1. However the next predicate instruction @$p0.ne return disables all threads that have the predicate set to 0, which actually correspond to threads with their idx < N!



Shouldn't it be @$p0.eq return, so that all threads with idx > N are disabled?



Thanks.



Murtaza

#9
Posted 07/02/2012 08:11 PM   
[quote name='murtazam' date='02 July 2012 - 03:11 PM' timestamp='1341259908' post='1429098']
Hi,

I am also wondering on one more issue with the assembly generated. The [b]set.le.s32 $p0|$o127, s[0x0018], $r0[/b] instruction is basically doing the comparison of idx and N. According to the instruction, if N<idx, the predicate $p0 is set to 1. However the next predicate instruction [b]@$p0.ne return[/b] disables all threads that have the predicate set to 0, which actually correspond to threads with their idx < N!

Shouldn't it be [b]@$p0.eq return[/b], so that all threads with idx > N are disabled?

Thanks.

Murtaza
[/quote]


Any comments please?

Thanks.
[quote name='murtazam' date='02 July 2012 - 03:11 PM' timestamp='1341259908' post='1429098']

Hi,



I am also wondering on one more issue with the assembly generated. The set.le.s32 $p0|$o127, s[0x0018], $r0 instruction is basically doing the comparison of idx and N. According to the instruction, if N<idx, the predicate $p0 is set to 1. However the next predicate instruction @$p0.ne return disables all threads that have the predicate set to 0, which actually correspond to threads with their idx < N!



Shouldn't it be @$p0.eq return, so that all threads with idx > N are disabled?



Thanks.



Murtaza







Any comments please?



Thanks.

#10
Posted 07/02/2012 09:32 PM   
Scroll To Top