Performance goes down when our kernel function runs on Xavier compared to GeForce 940MX

Our kernel function costs approximate 2.90ms on GeForce 940MX.

Howerver,on Xavier time-cost remain between 5.45ms-5.78ms no matter how the grid and block size arrangement change

It’s seems that Xavier is more powerful than GeForce 940MX , so it does’t make sense to get this result. What’s could be wrong?

GeForce 940MX device properties:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 940MX"
  CUDA Driver Version / Runtime Version          9.1 / 9.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2004 MBytes (2101870592 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1189 MHz (1.19 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Xavier device properties:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Xavier"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    7.2
  Total amount of global memory:                 15819 MBytes (16587653120 bytes)
  ( 8) Multiprocessors, ( 64) CUDA Cores/MP:     512 CUDA Cores
  GPU Max Clock rate:                            1500 MHz (1.50 GHz)
  Memory Clock rate:                             1500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

Hi,

Have you maximized the device performance first?

sudo jetson_clocks

Thanks.

Yes,we did.

We found that it’s may be caused by the low efficiency of the SIMD instruction “__vsadu4” that we used in the code.

We did the the following substitution test:

We replaced the __vsadu4 instruction with __vsub4 instruction and got a reversal.

6.41 ms on GeForce 940MX

1.76 ms on Xavier

The result seems to be inconsistent with the instruction-throughput table [url]Programming Guide :: CUDA Toolkit Documentation

Please check this issue.

Hi,

Another possible issue the compiling option.
Have you specified the Xavier capacity when compiling? Xavier is sm=72.

Thanks.

Yes,we have.

Actually, We digged out more details using cuobjdump tool.

Firstly, Let’s see the SAD dump

__vsadu4 code:

__global__ void usad4Cmbn(unsigned int *A, unsigned int *B, unsigned int *C)
  {
             C[threadIdx.x]=__vsadu4(A[threadIdx.x],B[threadIdx.x]);
  }

__vsadu4 dump:

code for sm_30
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)""
                                                                                 /* 0x22c04282c2804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x2c00000084001c04 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x180000001001dde2 */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x4001400500009c43 */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x208e80051000dc43 */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x4001400520011c43 */
        /*0038*/                   LD.E R2, [R2];                                /* 0x8400000000209c85 */
                                                                                 /* 0x22f04283f2c04287 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x208e800530015c43 */
        /*0050*/                   LD.E R4, [R4];                                /* 0x8400000000411c85 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x4001400540019c43 */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x208e80055001dc43 */
        /*0068*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ;           /* 0x89fec44c10201c04 */
        /*0070*/                   ST.E [R6], R0;                                /* 0x9400000000601c85 */
        /*0078*/                   EXIT;                                         /* 0x8000000000001de7 */
        /*0080*/                   BRA 0x80;                                     /* 0x4003ffffe0001de7 */
        /*0088*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0090*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0098*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00a0*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00a8*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00b0*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00b8*/                   NOP;                                          /* 0x4000000000001de4 */
  .............................



Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_35
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)""
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x74000000021fc01e */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x92101c00289c000e */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x60c40800291c0012 */
        /*0038*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
                                                                                 /* 0x08bc10a0fcb010a0 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x92101c00299c0016 */
        /*0050*/                   LD.E R4, [R4];                                /* 0xc4800000001c1010 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c001a */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x92101c002a9c001e */
        /*0068*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ;           /* 0x81c3fda2021c0802 */
        /*0070*/                   ST.E [R6], R0;                                /* 0xe4800000001c1800 */
        /*0078*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0080*/                   BRA 0x80;                                     /* 0x12007ffffc1c003c */
        /*0088*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0090*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0098*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                                          /* 0x85800000001c3c02 */
  .............................



Fatbin elf code:
================
arch = sm_37
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_37
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM37 EF_CUDA_PTX_SM(EF_CUDA_SM37)""
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x74000000021fc01e */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x92101c00289c000e */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x60c40800291c0012 */
        /*0038*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
                                                                                 /* 0x08bc10a0fcb010a0 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x92101c00299c0016 */
        /*0050*/                   LD.E R4, [R4];                                /* 0xc4800000001c1010 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c001a */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x92101c002a9c001e */
        /*0068*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ;           /* 0x81c3fda2021c0802 */
        /*0070*/                   ST.E [R6], R0;                                /* 0xe4800000001c1800 */
        /*0078*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0080*/                   BRA 0x80;                                     /* 0x12007ffffc1c003c */
        /*0088*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0090*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0098*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                                          /* 0x85800000001c3c02 */
  .............................



Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_50
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)""
                                                                          /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                 /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                     /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                /* 0x3848000000270006 */
                                                                          /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R0, R0, 0x1e ;                 /* 0x3828000001e70000 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;  /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R0.reuse, c[0x0][0x144] ;   /* 0x4c10080005170003 */
                                                                          /* 0x001f8800fec007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;        /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                          /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R0, c[0x0][0x14c] ;         /* 0x4c10080005370005 */
                                                                          /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                       /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;        /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R0, c[0x0][0x154] ;         /* 0x4c10080005570007 */
                                                                          /* 0x001ffc00fe2107f2 */
        /*0088*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ ;   /* 0x50387fb440470200 */
        /*0090*/                   STG.E [R6], R0 ;                       /* 0xeedc200000070600 */
        /*0098*/                   EXIT ;                                 /* 0xe30000000007000f */
                                                                          /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                             /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                   /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                   /* 0x50b0000000070f00 */
  .............................



Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_52
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)""
                                                                          /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                 /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                     /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                /* 0x3848000000270006 */
                                                                          /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R0, R0, 0x1e ;                 /* 0x3828000001e70000 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;  /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R0.reuse, c[0x0][0x144] ;   /* 0x4c10080005170003 */
                                                                          /* 0x001f8800fec007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;        /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                          /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R0, c[0x0][0x14c] ;         /* 0x4c10080005370005 */
                                                                          /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                       /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;        /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R0, c[0x0][0x154] ;         /* 0x4c10080005570007 */
                                                                          /* 0x001ffc00fe2107f2 */
        /*0088*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ ;   /* 0x50387fb440470200 */
        /*0090*/                   STG.E [R6], R0 ;                       /* 0xeedc200000070600 */
        /*0098*/                   EXIT ;                                 /* 0xe30000000007000f */
                                                                          /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                             /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                   /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                   /* 0x50b0000000070f00 */
  .............................



Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_60
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)""
                                                                         /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                    /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;               /* 0x3848000000270006 */
                                                                         /* 0x001f8800fcc007e5 */
        /*0028*/                   SHR.U32 R0, R0, 0x1e ;                /* 0x3828000001e70000 */
        /*0030*/                   IADD R2.CC, R6, c[0x0][0x140] ;       /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R0, c[0x0][0x144] ;        /* 0x4c10080005170003 */
                                                                         /* 0x001f8800fec007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;       /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                         /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R0, c[0x0][0x14c] ;        /* 0x4c10080005370005 */
                                                                         /* 0x001fdc00fcc007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                      /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;       /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R0, c[0x0][0x154] ;        /* 0x4c10080005570007 */
                                                                         /* 0x001ffc00fe2107e2 */
        /*0088*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ ;  /* 0x50387fb440470200 */
        /*0090*/                   STG.E [R6], R0 ;                      /* 0xeedc200000070600 */
        /*0098*/                   EXIT ;                                /* 0xe30000000007000f */
                                                                         /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                            /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                  /* 0x50b0000000070f00 */
  .............................



Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_61
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)""
                                                                         /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                    /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;               /* 0x3848000000270006 */
                                                                         /* 0x001f8800fcc007e5 */
        /*0028*/                   SHR.U32 R0, R0, 0x1e ;                /* 0x3828000001e70000 */
        /*0030*/                   IADD R2.CC, R6, c[0x0][0x140] ;       /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R0, c[0x0][0x144] ;        /* 0x4c10080005170003 */
                                                                         /* 0x001f8800fec007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;       /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                         /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R0, c[0x0][0x14c] ;        /* 0x4c10080005370005 */
                                                                         /* 0x001fdc00fcc007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                      /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;       /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R0, c[0x0][0x154] ;        /* 0x4c10080005570007 */
                                                                         /* 0x001ffc00fe2107e2 */
        /*0088*/                   VABSDIFF4.U8.U8.ACC R0, R2, R4, RZ ;  /* 0x50387fb440470200 */
        /*0090*/                   STG.E [R6], R0 ;                      /* 0xeedc200000070600 */
        /*0098*/                   EXIT ;                                /* 0xe30000000007000f */
                                                                         /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                            /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                  /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                  /* 0x50b0000000070f00 */
  .............................



Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_70
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)""
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                /* 0x00000a00ff017624 */
                                                                                          /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                          /* 0x000000fffffff389 */
                                                                                          /* 0x000fe200000e00ff */
        /*0020*/                   S2R R2, SR_TID.X ;                                     /* 0x0000000000027919 */
                                                                                          /* 0x000e220000002100 */
        /*0030*/                   IMAD.MOV.U32 R0, RZ, RZ, 0x4 ;                         /* 0x00000004ff007424 */
                                                                                          /* 0x000fc800078e00ff */
        /*0040*/                   IMAD.WIDE.U32 R4, R2.reuse, R0.reuse, c[0x0][0x160] ;  /* 0x0000580002047625 */
                                                                                          /* 0x0c1fe400078e0000 */
        /*0050*/                   IMAD.WIDE.U32 R6, R2, R0, c[0x0][0x168] ;              /* 0x00005a0002067625 */
                                                                                          /* 0x000fd000078e0000 */
        /*0060*/                   LDG.E.SYS R4, [R4] ;                                   /* 0x0000000004047381 */
                                                                                          /* 0x000ea800001ee900 */
        /*0070*/                   LDG.E.SYS R6, [R6] ;                                   /* 0x0000000006067381 */
                                                                                          /* 0x000ea400001ee900 */
        /*0080*/                   PRMT R3, R4.reuse, 0x3210, R6.reuse ;                  /* 0x0000321004037816 */
                                                                                          /* 0x144fe40000000006 */
        /*0090*/                   PRMT R8, R4, 0x7654, R6 ;                              /* 0x0000765404087816 */
                                                                                          /* 0x000fe40000000006 */
        /*00a0*/                   PRMT R9, R3, 0x7650, RZ ;                              /* 0x0000765003097816 */
                                                                                          /* 0x000fc400000000ff */
        /*00b0*/                   PRMT R12, R8.reuse, 0x7650, RZ ;                       /* 0x00007650080c7816 */
                                                                                          /* 0x040fe400000000ff */
        /*00c0*/                   PRMT R10, R3.reuse, 0x7651, RZ ;                       /* 0x00007651030a7816 */
                                                                                          /* 0x040fe400000000ff */
        /*00d0*/                   PRMT R11, R3, 0x7652, RZ ;                             /* 0x00007652030b7816 */
                                                                                          /* 0x000fe200000000ff */
        /*00e0*/                   IMAD.IADD R9, R9, 0x1, -R12 ;                          /* 0x0000000109097824 */
                                                                                          /* 0x000fe200078e0a0c */
        /*00f0*/                   PRMT R13, R8.reuse, 0x7651, RZ ;                       /* 0x00007651080d7816 */
                                                                                          /* 0x040fe400000000ff */
        /*0100*/                   PRMT R14, R8, 0x7652, RZ ;                             /* 0x00007652080e7816 */
                                                                                          /* 0x000fe400000000ff */
        /*0110*/                   PRMT R3, R3, 0x7653, RZ ;                              /* 0x0000765303037816 */
                                                                                          /* 0x000fe200000000ff */
        /*0120*/                   IMAD.IADD R10, R10, 0x1, -R13 ;                        /* 0x000000010a0a7824 */
                                                                                          /* 0x000fe200078e0a0d */
        /*0130*/                   PRMT R8, R8, 0x7653, RZ ;                              /* 0x0000765308087816 */
                                                                                          /* 0x000fe200000000ff */
        /*0140*/                   IMAD.IADD R11, R11, 0x1, -R14 ;                        /* 0x000000010b0b7824 */
                                                                                          /* 0x000fe200078e0a0e */
        /*0150*/                   IABS R9, R9 ;                                          /* 0x0000000900097213 */
                                                                                          /* 0x000fc40000000000 */
        /*0160*/                   IABS R5, R10 ;                                         /* 0x0000000a00057213 */
                                                                                          /* 0x000fe20000000000 */
        /*0170*/                   IMAD.IADD R3, R3, 0x1, -R8 ;                           /* 0x0000000103037824 */
                                                                                          /* 0x000fe200078e0a08 */
        /*0180*/                   IABS R4, R11 ;                                         /* 0x0000000b00047213 */
                                                                                          /* 0x000fc80000000000 */
        /*0190*/                   IABS R6, R3 ;                                          /* 0x0000000300067213 */
                                                                                          /* 0x000fe20000000000 */
        /*01a0*/                   IMAD.MOV.U32 R3, RZ, RZ, R9 ;                          /* 0x000000ffff037224 */
                                                                                          /* 0x000fca00078e0009 */
        /*01b0*/                   IADD3 R4, R4, R3, R5 ;                                 /* 0x0000000304047210 */
                                                                                          /* 0x000fe20007ffe005 */
        /*01c0*/                   IMAD.WIDE.U32 R2, R2, R0, c[0x0][0x170] ;              /* 0x00005c0002027625 */
                                                                                          /* 0x000fc800078e0000 */
        /*01d0*/                   IMAD.IADD R4, R4, 0x1, R6 ;                            /* 0x0000000104047824 */
                                                                                          /* 0x000fd000078e0206 */
        /*01e0*/                   STG.E.SYS [R2], R4 ;                                   /* 0x0000000402007386 */
                                                                                          /* 0x000fe2000010e900 */
        /*01f0*/                   EXIT ;                                                 /* 0x000000000000794d */
                                                                                          /* 0x000fea0003800000 */
        /*0200*/                   BRA 0x200;                                             /* 0xfffffff000007947 */
                                                                                          /* 0x000fc0000383ffff */
        /*0210*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
        /*0220*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
        /*0230*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
        /*0240*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
        /*0250*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
        /*0260*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
        /*0270*/                   NOP;                                                   /* 0x0000000000007918 */
                                                                                          /* 0x000fc00000000000 */
  .............................



Fatbin elf code:
================
arch = sm_72
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

 code for sm_72
  Function : _Z9usad4CmbnPjS_S_
 .headerflags    @""EF_CUDA_SM72 EF_CUDA_PTX_SM(EF_CUDA_SM72)""
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                /* 0x00000a00ff017624 */
                                                                                          /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                          /* 0x000000fffffff389 */
                                                                                          /* 0x000fe200000e00ff */
        /*0020*/                   S2R R2, SR_TID.X ;                                     /* 0x0000000000027919 */
                                                                                          /* 0x000e220000002100 */
        /*0030*/                   IMAD.MOV.U32 R0, RZ, RZ, 0x4 ;                         /* 0x00000004ff007424 */
                                                                                          /* 0x000fc800078e00ff */
        /*0040*/                   IMAD.WIDE.U32 R4, R2.reuse, R0.reuse, c[0x0][0x160] ;  /* 0x0000580002047625 */
                                                                                          /* 0x0c1fe400078e0000 */
        /*0050*/                   IMAD.WIDE.U32 R6, R2, R0, c[0x0][0x168] ;              /* 0x00005a0002067625 */
                                                                                          /* 0x000fd000078e0000 */
        /*0060*/                   LDG.E.SYS R4, [R4] ;                                   /* 0x0000000004047381 */
                                                                                          /* 0x000ea800001ee900 */
        /*0070*/                   LDG.E.SYS R6, [R6] ;                                   /* 0x0000000006067381 */
                                                                                          /* 0x000ea400001ee900 */
        /*0080*/                   PRMT R3, R4.reuse, 0x3210, R6.reuse ;                  /* 0x0000321004037816 */
                                                                                          /* 0x144fe40000000006 */
        /*0090*/                   PRMT R8, R4, 0x7654, R6 ;                              /* 0x0000765404087816 */
                                                                                          /* 0x000fe40000000006 */
        /*00a0*/                   PRMT R9, R3, 0x7650, RZ ;                              /* 0x0000765003097816 */
                                                                                          /* 0x000fc400000000ff */
        /*00b0*/                   PRMT R12, R8.reuse, 0x7650, RZ ;                       /* 0x00007650080c7816 */
                                                                                          /* 0x040fe400000000ff */
        /*00c0*/                   PRMT R10, R3.reuse, 0x7651, RZ ;                       /* 0x00007651030a7816 */
                                                                                          /* 0x040fe400000000ff */
        /*00d0*/                   PRMT R11, R3, 0x7652, RZ ;                             /* 0x00007652030b7816 */
                                                                                          /* 0x000fe200000000ff */
        /*00e0*/                   IMAD.IADD R9, R9, 0x1, -R12 ;                          /* 0x0000000109097824 */
                                                                                          /* 0x000fe200078e0a0c */
        /*00f0*/                   PRMT R13, R8.reuse, 0x7651, RZ ;                       /* 0x00007651080d7816 */
                                                                                          /* 0x040fe400000000ff */
        /*0100*/

Secondly ,I’d like to decompose the SAD operation into three SIMD instructions to share more info.

Subtraction code:

__global__ void usad4Cmbn(unsigned int *A, unsigned int *B, unsigned int *C)
  {
              C[threadIdx.x]=__vsub4(A[threadIdx.x],B[threadIdx.x]);
  }

Subtraction dump:

arch = sm_30
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_30
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                                 /* 0x22c04282c2804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x2c00000084001c04 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x180000001001dde2 */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x4001400500009c43 */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x208e80051000dc43 */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x4001400520011c43 */
        /*0038*/                   LD.E R2, [R2];                                /* 0x8400000000209c85 */
                                                                                 /* 0x22f04283f2c04287 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x208e800530015c43 */
        /*0050*/                   LD.E R4, [R4];                                /* 0x8400000000411c85 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x4001400540019c43 */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x208e80055001dc43 */
        /*0068*/                   VADD4.UD.U8.U8 R0, R2, -R4, RZ;               /* 0x81fe844c10201c84 */
        /*0070*/                   ST.E [R6], R0;                                /* 0x9400000000601c85 */
        /*0078*/                   EXIT;                                         /* 0x8000000000001de7 */
        /*0080*/                   BRA 0x80;                                     /* 0x4003ffffe0001de7 */
        /*0088*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0090*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0098*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00a0*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00a8*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00b0*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00b8*/                   NOP;                                          /* 0x4000000000001de4 */
		.............................

Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_35
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x74000000021fc01e */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x92101c00289c000e */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x60c40800291c0012 */
        /*0038*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
                                                                                 /* 0x08bc10a0fcb010a0 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x92101c00299c0016 */
        /*0050*/                   LD.E R4, [R4];                                /* 0xc4800000001c1010 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c001a */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x92101c002a9c001e */
        /*0068*/                   VADD4.UD.U8.U8 R0, R2, -R4, RZ;               /* 0xd4c3fda2021c0801 */
        /*0070*/                   ST.E [R6], R0;                                /* 0xe4800000001c1800 */
        /*0078*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0080*/                   BRA 0x80;                                     /* 0x12007ffffc1c003c */
        /*0088*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0090*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0098*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                                          /* 0x85800000001c3c02 */
		.............................

Fatbin elf code:
================
arch = sm_37
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_37
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM37 EF_CUDA_PTX_SM(EF_CUDA_SM37)"
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x74000000021fc01e */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x92101c00289c000e */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x60c40800291c0012 */
        /*0038*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
                                                                                 /* 0x08bc10a0fcb010a0 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x92101c00299c0016 */
        /*0050*/                   LD.E R4, [R4];                                /* 0xc4800000001c1010 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c001a */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x92101c002a9c001e */
        /*0068*/                   VADD4.UD.U8.U8 R0, R2, -R4, RZ;               /* 0xd4c3fda2021c0801 */
        /*0070*/                   ST.E [R6], R0;                                /* 0xe4800000001c1800 */
        /*0078*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0080*/                   BRA 0x80;                                     /* 0x12007ffffc1c003c */
        /*0088*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0090*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0098*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                                          /* 0x85800000001c3c02 */
		.............................

Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_50
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9400fe2307f1 */
        /*0088*/                   LOP32I.OR R0, R2.reuse, 0x80808080 ;      /* 0x0428080808070200 */
        /*0090*/                   LOP32I.AND R9, R4, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R2, c[0x2][0x0], R4, 0x84 ;  /* 0x0284020800070208 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, -R9 ;                        /* 0x5c11000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_52
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9400fe2307f1 */
        /*0088*/                   LOP32I.OR R0, R2.reuse, 0x80808080 ;      /* 0x0428080808070200 */
        /*0090*/                   LOP32I.AND R9, R4, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R2, c[0x2][0x0], R4, 0x84 ;  /* 0x0284020800070208 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, -R9 ;                        /* 0x5c11000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_60
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9400fe2307f1 */
        /*0088*/                   LOP32I.OR R0, R2.reuse, 0x80808080 ;      /* 0x0428080808070200 */
        /*0090*/                   LOP32I.AND R9, R4, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R2, c[0x2][0x0], R4, 0x84 ;  /* 0x0284020800070208 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, -R9 ;                        /* 0x5c11000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_61
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9400fe2307f1 */
        /*0088*/                   LOP32I.OR R0, R2.reuse, 0x80808080 ;      /* 0x0428080808070200 */
        /*0090*/                   LOP32I.AND R9, R4, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R2, c[0x2][0x0], R4, 0x84 ;  /* 0x0284020800070208 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, -R9 ;                        /* 0x5c11000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................

Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_70
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                 /* 0x00000a00ff017624 */
                                                                                           /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                           /* 0x000000fffffff389 */
                                                                                           /* 0x000fe200000e00ff */
        /*0020*/                   S2R R7, SR_TID.X ;                                      /* 0x0000000000077919 */
                                                                                           /* 0x000e220000002100 */
        /*0030*/                   MOV R10, 0x4 ;                                          /* 0x00000004000a7802 */
                                                                                           /* 0x000fca0000000f00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R7.reuse, R10.reuse, c[0x0][0x160] ;  /* 0x0000580007027625 */
                                                                                           /* 0x0c1fe400078e000a */
        /*0050*/                   IMAD.WIDE.U32 R4, R7, R10, c[0x0][0x168] ;              /* 0x00005a0007047625 */
                                                                                           /* 0x000fd000078e000a */
        /*0060*/                   LDG.E.SYS R2, [R2] ;                                    /* 0x0000000002027381 */
                                                                                           /* 0x000ea800001ee900 */
        /*0070*/                   LDG.E.SYS R4, [R4] ;                                    /* 0x0000000004047381 */
                                                                                           /* 0x000ee200001ee900 */
        /*0080*/                   LOP3.LUT R0, R2, 0x80808080, RZ, 0xfc, !PT ;            /* 0x8080808002007812 */
                                                                                           /* 0x004fe400078efcff */
        /*0090*/                   LOP3.LUT R6, R4, 0x7f7f7f7f, RZ, 0xc0, !PT ;            /* 0x7f7f7f7f04067812 */
                                                                                           /* 0x008fe400078ec0ff */
        /*00a0*/                   LOP3.LUT R8, R2, 0x80808080, R4, 0x84, !PT ;            /* 0x8080808002087812 */
                                                                                           /* 0x000fc600078e8404 */
        /*00b0*/                   IMAD.IADD R9, R0, 0x1, -R6 ;                            /* 0x0000000100097824 */
                                                                                           /* 0x000fe400078e0a06 */
        /*00c0*/                   IMAD.WIDE.U32 R6, R7, R10, c[0x0][0x170] ;              /* 0x00005c0007067625 */
                                                                                           /* 0x000fc600078e000a */
        /*00d0*/                   LOP3.LUT R8, R9, R8, RZ, 0x3c, !PT ;                    /* 0x0000000809087212 */
                                                                                           /* 0x000fd000078e3cff */
        /*00e0*/                   STG.E.SYS [R6], R8 ;                                    /* 0x0000000806007386 */
                                                                                           /* 0x000fe2000010e900 */
        /*00f0*/                   EXIT ;                                                  /* 0x000000000000794d */
                                                                                           /* 0x000fea0003800000 */
        /*0100*/                   BRA 0x100;                                              /* 0xfffffff000007947 */
                                                                                           /* 0x000fc0000383ffff */
        /*0110*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0120*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0130*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0140*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0150*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0160*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0170*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
		.............................

Fatbin elf code:
================
arch = sm_72
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_72
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM72 EF_CUDA_PTX_SM(EF_CUDA_SM72)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                 /* 0x00000a00ff017624 */
                                                                                           /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                           /* 0x000000fffffff389 */
                                                                                           /* 0x000fe200000e00ff */
        /*0020*/                   S2R R7, SR_TID.X ;                                      /* 0x0000000000077919 */
                                                                                           /* 0x000e220000002100 */
        /*0030*/                   MOV R10, 0x4 ;                                          /* 0x00000004000a7802 */
                                                                                           /* 0x000fca0000000f00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R7.reuse, R10.reuse, c[0x0][0x160] ;  /* 0x0000580007027625 */
                                                                                           /* 0x0c1fe400078e000a */
        /*0050*/                   IMAD.WIDE.U32 R4, R7, R10, c[0x0][0x168] ;              /* 0x00005a0007047625 */
                                                                                           /* 0x000fd000078e000a */
        /*0060*/                   LDG.E.SYS R2, [R2] ;                                    /* 0x0000000002027381 */
                                                                                           /* 0x000ea800001ee900 */
        /*0070*/                   LDG.E.SYS R4, [R4] ;                                    /* 0x0000000004047381 */
                                                                                           /* 0x000ee200001ee900 */
        /*0080*/                   LOP3.LUT R0, R2, 0x80808080, RZ, 0xfc, !PT ;            /* 0x8080808002007812 */
                                                                                           /* 0x004fe400078efcff */
        /*0090*/                   LOP3.LUT R6, R4, 0x7f7f7f7f, RZ, 0xc0, !PT ;            /* 0x7f7f7f7f04067812 */
                                                                                           /* 0x008fe400078ec0ff */
        /*00a0*/                   LOP3.LUT R8, R2, 0x80808080, R4, 0x84, !PT ;            /* 0x8080808002087812 */
                                                                                           /* 0x000fc600078e8404 */
        /*00b0*/                   IMAD.IADD R9, R0, 0x1, -R6 ;                            /* 0x0000000100097824 */
                                                                                           /* 0x000fe400078e0a06 */
        /*00c0*/                   IMAD.WIDE.U32 R6, R7, R10, c[0x0][0x170] ;              /* 0x00005c0007067625 */
                                                                                           /* 0x000fc600078e000a */
        /*00d0*/                   LOP3.LUT R8, R9, R8, RZ, 0x3c, !PT ;                    /* 0x0000000809087212 */
                                                                                           /* 0x000fd000078e3cff */
        /*00e0*/                   STG.E.SYS [R6], R8 ;                                    /* 0x0000000806007386 */
                                                                                           /* 0x000fe2000010e900 */
        /*00f0*/                   EXIT ;                                                  /* 0x000000000000794d */
                                                                                           /* 0x000fea0003800000 */
        /*0100*/                   BRA 0x100;                                              /* 0xfffffff000007947 */
                                                                                           /* 0x000fc0000383ffff */
        /*0110*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0120*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0130*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0140*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0150*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0160*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0170*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
		.............................

Abs. code:

__global__ void usad4Cmbn(unsigned int *A, unsigned int *B, unsigned int *C)
  {
              C[threadIdx.x]=__vabs4(A[threadIdx.x],B[threadIdx.x]);
  }

Abs. dump:

Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_30
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                                 /* 0x22c04282c2804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x2c00000084001c04 */
        /*0018*/                   MOV32I R5, 0x4;                               /* 0x1800000010015de2 */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x4001400500009c43 */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R5, c[0x0][0x144];  /* 0x208a80051000dc43 */
        /*0030*/                   LD.E R2, [R2];                                /* 0x8400000000209c85 */
        /*0038*/                   ISCADD R4.CC, R0, c[0x0][0x150], 0x2;         /* 0x4001400540011c43 */
                                                                                 /* 0x20000002f04283f7 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R5, c[0x0][0x154];  /* 0x208a800550015c43 */
        /*0050*/                   VABSDIFF4.SD R0, R2, RZ, RZ;                  /* 0x8bfe844cfc201c64 */
        /*0058*/                   ST.E [R4], R0;                                /* 0x9400000000401c85 */
        /*0060*/                   EXIT;                                         /* 0x8000000000001de7 */
        /*0068*/                   BRA 0x68;                                     /* 0x4003ffffe0001de7 */
        /*0070*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0078*/                   NOP;                                          /* 0x4000000000001de4 */
		.............................



Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_35
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R5, 0x4;                               /* 0x74000000021fc016 */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R5, c[0x0][0x144];  /* 0x92101400289c000e */
        /*0030*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
        /*0038*/                   ISCADD R4.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c0012 */
                                                                                 /* 0x08000000bc10a0fc */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R5, c[0x0][0x154];  /* 0x921014002a9c0016 */
        /*0050*/                   VABSDIFF4.SD R0, R2, RZ, RZ;                  /* 0x80dbffa27f9c0802 */
        /*0058*/                   ST.E [R4], R0;                                /* 0xe4800000001c1000 */
        /*0060*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0068*/                   BRA 0x68;                                     /* 0x12007ffffc1c003c */
        /*0070*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0078*/                   NOP;                                          /* 0x85800000001c3c02 */
		.............................



Fatbin elf code:
================
arch = sm_37
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_37
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM37 EF_CUDA_PTX_SM(EF_CUDA_SM37)"
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R5, 0x4;                               /* 0x74000000021fc016 */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R5, c[0x0][0x144];  /* 0x92101400289c000e */
        /*0030*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
        /*0038*/                   ISCADD R4.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c0012 */
                                                                                 /* 0x08000000bc10a0fc */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R5, c[0x0][0x154];  /* 0x921014002a9c0016 */
        /*0050*/                   VABSDIFF4.SD R0, R2, RZ, RZ;                  /* 0x80dbffa27f9c0802 */
        /*0058*/                   ST.E [R4], R0;                                /* 0xe4800000001c1000 */
        /*0060*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0068*/                   BRA 0x68;                                     /* 0x12007ffffc1c003c */
        /*0070*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0078*/                   NOP;                                          /* 0x85800000001c3c02 */
		.............................



Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_50
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                          /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                 /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                     /* 0xf0c8000002170000 */
        /*0018*/                   SHL R4, R0.reuse, 0x2 ;                /* 0x3848000000270004 */
                                                                          /* 0x001fc840fec007f5 */
        /*0028*/                   SHR.U32 R5, R0, 0x1e ;                 /* 0x3828000001e70005 */
        /*0030*/                   IADD R2.CC, R4.reuse, c[0x0][0x140] ;  /* 0x4c10800005070402 */
        /*0038*/                   IADD.X R3, R5, c[0x0][0x144] ;         /* 0x4c10080005170503 */
                                                                          /* 0x001fdc00fcc007b1 */
        /*0048*/                   LDG.E R2, [R2] ;                       /* 0xeed4200000070202 */
        /*0050*/                   IADD R4.CC, R4, c[0x0][0x150] ;        /* 0x4c10800005470404 */
        /*0058*/                   IADD.X R5, R5, c[0x0][0x154] ;         /* 0x4c10080005570505 */
                                                                          /* 0x001fd800fe2307f6 */
        /*0068*/                   PRMT R0, R2.reuse, 0xba98, RZ ;        /* 0x36c07f8ba9870200 */
        /*0070*/                   LOP.XOR R6, R2, R0 ;                   /* 0x5c47040000070206 */
        /*0078*/                   LOP32I.AND R0, R0, 0x1010101 ;         /* 0x0400101010170000 */
                                                                          /* 0x001ffc00fe2007f2 */
        /*0088*/                   IADD R6, R6, R0 ;                      /* 0x5c10000000070606 */
        /*0090*/                   STG.E [R4], R6 ;                       /* 0xeedc200000070406 */
        /*0098*/                   EXIT ;                                 /* 0xe30000000007000f */
                                                                          /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                             /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                   /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                   /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_52
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                          /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                 /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                     /* 0xf0c8000002170000 */
        /*0018*/                   SHL R4, R0.reuse, 0x2 ;                /* 0x3848000000270004 */
                                                                          /* 0x001fc840fec007f5 */
        /*0028*/                   SHR.U32 R5, R0, 0x1e ;                 /* 0x3828000001e70005 */
        /*0030*/                   IADD R2.CC, R4.reuse, c[0x0][0x140] ;  /* 0x4c10800005070402 */
        /*0038*/                   IADD.X R3, R5, c[0x0][0x144] ;         /* 0x4c10080005170503 */
                                                                          /* 0x001fdc00fcc007b1 */
        /*0048*/                   LDG.E R2, [R2] ;                       /* 0xeed4200000070202 */
        /*0050*/                   IADD R4.CC, R4, c[0x0][0x150] ;        /* 0x4c10800005470404 */
        /*0058*/                   IADD.X R5, R5, c[0x0][0x154] ;         /* 0x4c10080005570505 */
                                                                          /* 0x001fd800fe2307f6 */
        /*0068*/                   PRMT R0, R2.reuse, 0xba98, RZ ;        /* 0x36c07f8ba9870200 */
        /*0070*/                   LOP.XOR R6, R2, R0 ;                   /* 0x5c47040000070206 */
        /*0078*/                   LOP32I.AND R0, R0, 0x1010101 ;         /* 0x0400101010170000 */
                                                                          /* 0x001ffc00fe2007f2 */
        /*0088*/                   IADD R6, R6, R0 ;                      /* 0x5c10000000070606 */
        /*0090*/                   STG.E [R4], R6 ;                       /* 0xeedc200000070406 */
        /*0098*/                   EXIT ;                                 /* 0xe30000000007000f */
                                                                          /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                             /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                                   /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                   /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_60
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                    /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R4, R0.reuse, 0x2 ;          /* 0x3848000000270004 */
                                                                    /* 0x001f8800fcc007e5 */
        /*0028*/                   SHR.U32 R5, R0, 0x1e ;           /* 0x3828000001e70005 */
        /*0030*/                   IADD R2.CC, R4, c[0x0][0x140] ;  /* 0x4c10800005070402 */
        /*0038*/                   IADD.X R3, R5, c[0x0][0x144] ;   /* 0x4c10080005170503 */
                                                                    /* 0x001fdc00fcc007b1 */
        /*0048*/                   LDG.E R2, [R2] ;                 /* 0xeed4200000070202 */
        /*0050*/                   IADD R4.CC, R4, c[0x0][0x150] ;  /* 0x4c10800005470404 */
        /*0058*/                   IADD.X R5, R5, c[0x0][0x154] ;   /* 0x4c10080005570505 */
                                                                    /* 0x001f9800fc2107e6 */
        /*0068*/                   PRMT R0, R2, 0xba98, RZ ;        /* 0x36c07f8ba9870200 */
        /*0070*/                   LOP.XOR R6, R2, R0 ;             /* 0x5c47040000070206 */
        /*0078*/                   LOP32I.AND R0, R0, 0x1010101 ;   /* 0x0400101010170000 */
                                                                    /* 0x001ffc00fe2007e2 */
        /*0088*/                   IADD R6, R6, R0 ;                /* 0x5c10000000070606 */
        /*0090*/                   STG.E [R4], R6 ;                 /* 0xeedc200000070406 */
        /*0098*/                   EXIT ;                           /* 0xe30000000007000f */
                                                                    /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                       /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                             /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_61
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                    /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R4, R0.reuse, 0x2 ;          /* 0x3848000000270004 */
                                                                    /* 0x001f8800fcc007e5 */
        /*0028*/                   SHR.U32 R5, R0, 0x1e ;           /* 0x3828000001e70005 */
        /*0030*/                   IADD R2.CC, R4, c[0x0][0x140] ;  /* 0x4c10800005070402 */
        /*0038*/                   IADD.X R3, R5, c[0x0][0x144] ;   /* 0x4c10080005170503 */
                                                                    /* 0x001fdc00fcc007b1 */
        /*0048*/                   LDG.E R2, [R2] ;                 /* 0xeed4200000070202 */
        /*0050*/                   IADD R4.CC, R4, c[0x0][0x150] ;  /* 0x4c10800005470404 */
        /*0058*/                   IADD.X R5, R5, c[0x0][0x154] ;   /* 0x4c10080005570505 */
                                                                    /* 0x001f9800fc2107e6 */
        /*0068*/                   PRMT R0, R2, 0xba98, RZ ;        /* 0x36c07f8ba9870200 */
        /*0070*/                   LOP.XOR R6, R2, R0 ;             /* 0x5c47040000070206 */
        /*0078*/                   LOP32I.AND R0, R0, 0x1010101 ;   /* 0x0400101010170000 */
                                                                    /* 0x001ffc00fe2007e2 */
        /*0088*/                   IADD R6, R6, R0 ;                /* 0x5c10000000070606 */
        /*0090*/                   STG.E [R4], R6 ;                 /* 0xeedc200000070406 */
        /*0098*/                   EXIT ;                           /* 0xe30000000007000f */
                                                                    /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                       /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                             /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                             /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_70
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;      /* 0x00000a00ff017624 */
                                                                                /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                /* 0x000000fffffff389 */
                                                                                /* 0x000fe200000e00ff */
        /*0020*/                   S2R R4, SR_TID.X ;                           /* 0x0000000000047919 */
                                                                                /* 0x000e220000002100 */
        /*0030*/                   MOV R5, 0x4 ;                                /* 0x0000000400057802 */
                                                                                /* 0x000fca0000000f00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R4, R5, c[0x0][0x160] ;    /* 0x0000580004027625 */
                                                                                /* 0x001fd400078e0005 */
        /*0050*/                   LDG.E.SYS R2, [R2] ;                         /* 0x0000000002027381 */
                                                                                /* 0x000ea200001ee900 */
        /*0060*/                   IMAD.WIDE.U32 R4, R4, R5, c[0x0][0x170] ;    /* 0x00005c0004047625 */
                                                                                /* 0x000fe200078e0005 */
        /*0070*/                   PRMT R0, R2, 0xba98, RZ ;                    /* 0x0000ba9802007816 */
                                                                                /* 0x004fc800000000ff */
        /*0080*/                   LOP3.LUT R6, R2, R0, RZ, 0x3c, !PT ;         /* 0x0000000002067212 */
                                                                                /* 0x000fe400078e3cff */
        /*0090*/                   LOP3.LUT R0, R0, 0x1010101, RZ, 0xc0, !PT ;  /* 0x0101010100007812 */
                                                                                /* 0x000fca00078ec0ff */
        /*00a0*/                   IMAD.IADD R6, R6, 0x1, R0 ;                  /* 0x0000000106067824 */
                                                                                /* 0x000fd000078e0200 */
        /*00b0*/                   STG.E.SYS [R4], R6 ;                         /* 0x0000000604007386 */
                                                                                /* 0x000fe2000010e900 */
        /*00c0*/                   EXIT ;                                       /* 0x000000000000794d */
                                                                                /* 0x000fea0003800000 */
        /*00d0*/                   BRA 0xd0;                                    /* 0xfffffff000007947 */
                                                                                /* 0x000fc0000383ffff */
        /*00e0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
        /*00f0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
		.............................



Fatbin elf code:
================
arch = sm_72
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_72
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM72 EF_CUDA_PTX_SM(EF_CUDA_SM72)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;      /* 0x00000a00ff017624 */
                                                                                /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                /* 0x000000fffffff389 */
                                                                                /* 0x000fe200000e00ff */
        /*0020*/                   S2R R4, SR_TID.X ;                           /* 0x0000000000047919 */
                                                                                /* 0x000e220000002100 */
        /*0030*/                   MOV R5, 0x4 ;                                /* 0x0000000400057802 */
                                                                                /* 0x000fca0000000f00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R4, R5, c[0x0][0x160] ;    /* 0x0000580004027625 */
                                                                                /* 0x001fd400078e0005 */
        /*0050*/                   LDG.E.SYS R2, [R2] ;                         /* 0x0000000002027381 */
                                                                                /* 0x000ea200001ee900 */
        /*0060*/                   IMAD.WIDE.U32 R4, R4, R5, c[0x0][0x170] ;    /* 0x00005c0004047625 */
                                                                                /* 0x000fe200078e0005 */
        /*0070*/                   PRMT R0, R2, 0xba98, RZ ;                    /* 0x0000ba9802007816 */
                                                                                /* 0x004fc800000000ff */
        /*0080*/                   LOP3.LUT R6, R2, R0, RZ, 0x3c, !PT ;         /* 0x0000000002067212 */
                                                                                /* 0x000fe400078e3cff */
        /*0090*/                   LOP3.LUT R0, R0, 0x1010101, RZ, 0xc0, !PT ;  /* 0x0101010100007812 */
                                                                                /* 0x000fca00078ec0ff */
        /*00a0*/                   IMAD.IADD R6, R6, 0x1, R0 ;                  /* 0x0000000106067824 */
                                                                                /* 0x000fd000078e0200 */
        /*00b0*/                   STG.E.SYS [R4], R6 ;                         /* 0x0000000604007386 */
                                                                                /* 0x000fe2000010e900 */
        /*00c0*/                   EXIT ;                                       /* 0x000000000000794d */
                                                                                /* 0x000fea0003800000 */
        /*00d0*/                   BRA 0xd0;                                    /* 0xfffffff000007947 */
                                                                                /* 0x000fc0000383ffff */
        /*00e0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
        /*00f0*/                   NOP;                                         /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
		.............................

Add. code:

__global__ void usad4Cmbn(unsigned int *A, unsigned int *B, unsigned int *C)
  {
              C[threadIdx.x]=__vadd4(A[threadIdx.x],B[threadIdx.x]);
  }

Add. dump:

arch = sm_30
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_30
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                                 /* 0x22c04282c2804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x2c00000084001c04 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x180000001001dde2 */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x4001400500009c43 */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x208e80051000dc43 */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x4001400520011c43 */
        /*0038*/                   LD.E R2, [R2];                                /* 0x8400000000209c85 */
                                                                                 /* 0x22f04283f2c04287 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x208e800530015c43 */
        /*0050*/                   LD.E R4, [R4];                                /* 0x8400000000411c85 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x4001400540019c43 */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x208e80055001dc43 */
        /*0068*/                   VADD4.UD.U8.U8 R0, R2, R4, RZ;                /* 0x81fe844c10201c04 */
        /*0070*/                   ST.E [R6], R0;                                /* 0x9400000000601c85 */
        /*0078*/                   EXIT;                                         /* 0x8000000000001de7 */
        /*0080*/                   BRA 0x80;                                     /* 0x4003ffffe0001de7 */
        /*0088*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0090*/                   NOP;                                          /* 0x4000000000001de4 */
        /*0098*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00a0*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00a8*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00b0*/                   NOP;                                          /* 0x4000000000001de4 */
        /*00b8*/                   NOP;                                          /* 0x4000000000001de4 */
		.............................



Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_35
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x74000000021fc01e */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x92101c00289c000e */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x60c40800291c0012 */
        /*0038*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
                                                                                 /* 0x08bc10a0fcb010a0 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x92101c00299c0016 */
        /*0050*/                   LD.E R4, [R4];                                /* 0xc4800000001c1010 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c001a */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x92101c002a9c001e */
        /*0068*/                   VADD4.UD.U8.U8 R0, R2, R4, RZ;                /* 0xd0c3fda2021c0801 */
        /*0070*/                   ST.E [R6], R0;                                /* 0xe4800000001c1800 */
        /*0078*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0080*/                   BRA 0x80;                                     /* 0x12007ffffc1c003c */
        /*0088*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0090*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0098*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                                          /* 0x85800000001c3c02 */
		.............................



Fatbin elf code:
================
arch = sm_37
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_37
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM37 EF_CUDA_PTX_SM(EF_CUDA_SM37)"
                                                                                 /* 0x08b010a0b0a010c0 */
        /*0008*/                   MOV R1, c[0x0][0x44];                         /* 0x64c03c00089c0006 */
        /*0010*/                   S2R R0, SR_TID.X;                             /* 0x86400000109c0002 */
        /*0018*/                   MOV32I R7, 0x4;                               /* 0x74000000021fc01e */
        /*0020*/                   ISCADD R2.CC, R0, c[0x0][0x140], 0x2;         /* 0x60c40800281c000a */
        /*0028*/                   IMAD.U32.U32.HI.X R3, R0, R7, c[0x0][0x144];  /* 0x92101c00289c000e */
        /*0030*/                   ISCADD R4.CC, R0, c[0x0][0x148], 0x2;         /* 0x60c40800291c0012 */
        /*0038*/                   LD.E R2, [R2];                                /* 0xc4800000001c0808 */
                                                                                 /* 0x08bc10a0fcb010a0 */
        /*0048*/                   IMAD.U32.U32.HI.X R5, R0, R7, c[0x0][0x14c];  /* 0x92101c00299c0016 */
        /*0050*/                   LD.E R4, [R4];                                /* 0xc4800000001c1010 */
        /*0058*/                   ISCADD R6.CC, R0, c[0x0][0x150], 0x2;         /* 0x60c408002a1c001a */
        /*0060*/                   IMAD.U32.U32.HI.X R7, R0, R7, c[0x0][0x154];  /* 0x92101c002a9c001e */
        /*0068*/                   VADD4.UD.U8.U8 R0, R2, R4, RZ;                /* 0xd0c3fda2021c0801 */
        /*0070*/                   ST.E [R6], R0;                                /* 0xe4800000001c1800 */
        /*0078*/                   EXIT;                                         /* 0x18000000001c003c */
        /*0080*/                   BRA 0x80;                                     /* 0x12007ffffc1c003c */
        /*0088*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0090*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*0098*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                                          /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                                          /* 0x85800000001c3c02 */
		.............................



Fatbin elf code:
================
arch = sm_50
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_50
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9440fe2107f1 */
        /*0088*/                   LOP32I.AND R0, R2, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70200 */
        /*0090*/                   LOP32I.AND R9, R4.reuse, 0x7f7f7f7f ;     /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R4, c[0x2][0x0], R2, 0x48 ;  /* 0x0248010800070408 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, R9 ;                         /* 0x5c10000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_52
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9440fe2107f1 */
        /*0088*/                   LOP32I.AND R0, R2, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70200 */
        /*0090*/                   LOP32I.AND R9, R4.reuse, 0x7f7f7f7f ;     /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R4, c[0x2][0x0], R2, 0x48 ;  /* 0x0248010800070408 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, R9 ;                         /* 0x5c10000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_60
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9440fe2107f1 */
        /*0088*/                   LOP32I.AND R0, R2, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70200 */
        /*0090*/                   LOP32I.AND R9, R4.reuse, 0x7f7f7f7f ;     /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R4, c[0x2][0x0], R2, 0x48 ;  /* 0x0248010800070408 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, R9 ;                         /* 0x5c10000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_61
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                             /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                    /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                        /* 0xf0c8000002170000 */
        /*0018*/                   SHL R6, R0.reuse, 0x2 ;                   /* 0x3848000000270006 */
                                                                             /* 0x081fc840fec007f5 */
        /*0028*/                   SHR.U32 R7, R0, 0x1e ;                    /* 0x3828000001e70007 */
        /*0030*/                   IADD R2.CC, R6.reuse, c[0x0][0x140] ;     /* 0x4c10800005070602 */
        /*0038*/                   IADD.X R3, R7.reuse, c[0x0][0x144] ;      /* 0x4c10080005170703 */
                                                                             /* 0x001f8800f6c007f0 */
        /*0048*/         {         IADD R4.CC, R6, c[0x0][0x148] ;           /* 0x4c10800005270604 */
        /*0050*/                   LDG.E R2, [R2]         }
                                                                             /* 0xeed4200000070202 */
        /*0058*/                   IADD.X R5, R7, c[0x0][0x14c] ;            /* 0x4c10080005370705 */
                                                                             /* 0x001fdc00fec007b1 */
        /*0068*/                   LDG.E R4, [R4] ;                          /* 0xeed4200000070404 */
        /*0070*/                   IADD R6.CC, R6, c[0x0][0x150] ;           /* 0x4c10800005470606 */
        /*0078*/                   IADD.X R7, R7, c[0x0][0x154] ;            /* 0x4c10080005570707 */
                                                                             /* 0x001f9440fe2107f1 */
        /*0088*/                   LOP32I.AND R0, R2, 0x7f7f7f7f ;           /* 0x0407f7f7f7f70200 */
        /*0090*/                   LOP32I.AND R9, R4.reuse, 0x7f7f7f7f ;     /* 0x0407f7f7f7f70409 */
        /*0098*/                   LOP3.LUT R8, R4, c[0x2][0x0], R2, 0x48 ;  /* 0x0248010800070408 */
                                                                             /* 0x001fc400fe4007f6 */
        /*00a8*/                   IADD R9, R0, R9 ;                         /* 0x5c10000000970009 */
        /*00b0*/                   LOP.XOR R8, R9, R8 ;                      /* 0x5c47040000870908 */
        /*00b8*/                   STG.E [R6], R8 ;                          /* 0xeedc200000070608 */
                                                                             /* 0x001f8000ffe007ff */
        /*00c8*/                   EXIT ;                                    /* 0xe30000000007000f */
        /*00d0*/                   BRA 0xd0 ;                                /* 0xe2400fffff87000f */
        /*00d8*/                   NOP;                                      /* 0x50b0000000070f00 */
                                                                             /* 0x001f8000fc0007e0 */
        /*00e8*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f0*/                   NOP;                                      /* 0x50b0000000070f00 */
        /*00f8*/                   NOP;                                      /* 0x50b0000000070f00 */
		.............................



Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_70
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                 /* 0x00000a00ff017624 */
                                                                                           /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                           /* 0x000000fffffff389 */
                                                                                           /* 0x000fe200000e00ff */
        /*0020*/                   S2R R7, SR_TID.X ;                                      /* 0x0000000000077919 */
                                                                                           /* 0x000e220000002100 */
        /*0030*/                   MOV R10, 0x4 ;                                          /* 0x00000004000a7802 */
                                                                                           /* 0x000fca0000000f00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R7.reuse, R10.reuse, c[0x0][0x160] ;  /* 0x0000580007027625 */
                                                                                           /* 0x0c1fe400078e000a */
        /*0050*/                   IMAD.WIDE.U32 R4, R7, R10, c[0x0][0x168] ;              /* 0x00005a0007047625 */
                                                                                           /* 0x000fd000078e000a */
        /*0060*/                   LDG.E.SYS R2, [R2] ;                                    /* 0x0000000002027381 */
                                                                                           /* 0x000ea800001ee900 */
        /*0070*/                   LDG.E.SYS R4, [R4] ;                                    /* 0x0000000004047381 */
                                                                                           /* 0x000ee200001ee900 */
        /*0080*/                   LOP3.LUT R0, R2, 0x7f7f7f7f, RZ, 0xc0, !PT ;            /* 0x7f7f7f7f02007812 */
                                                                                           /* 0x004fe400078ec0ff */
        /*0090*/                   LOP3.LUT R6, R4.reuse, 0x7f7f7f7f, RZ, 0xc0, !PT ;      /* 0x7f7f7f7f04067812 */
                                                                                           /* 0x048fe400078ec0ff */
        /*00a0*/                   LOP3.LUT R8, R4, 0x80808080, R2, 0x48, !PT ;            /* 0x8080808004087812 */
                                                                                           /* 0x000fc600078e4802 */
        /*00b0*/                   IMAD.IADD R9, R0, 0x1, R6 ;                             /* 0x0000000100097824 */
                                                                                           /* 0x000fe400078e0206 */
        /*00c0*/                   IMAD.WIDE.U32 R6, R7, R10, c[0x0][0x170] ;              /* 0x00005c0007067625 */
                                                                                           /* 0x000fc600078e000a */
        /*00d0*/                   LOP3.LUT R8, R9, R8, RZ, 0x3c, !PT ;                    /* 0x0000000809087212 */
                                                                                           /* 0x000fd000078e3cff */
        /*00e0*/                   STG.E.SYS [R6], R8 ;                                    /* 0x0000000806007386 */
                                                                                           /* 0x000fe2000010e900 */
        /*00f0*/                   EXIT ;                                                  /* 0x000000000000794d */
                                                                                           /* 0x000fea0003800000 */
        /*0100*/                   BRA 0x100;                                              /* 0xfffffff000007947 */
                                                                                           /* 0x000fc0000383ffff */
        /*0110*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0120*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0130*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0140*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0150*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0160*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0170*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
		.............................



Fatbin elf code:
================
arch = sm_72
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_72
		Function : _Z9usad4CmbnPjS_S_
	.headerflags    @"EF_CUDA_SM72 EF_CUDA_PTX_SM(EF_CUDA_SM72)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                 /* 0x00000a00ff017624 */
                                                                                           /* 0x000fd000078e00ff */
        /*0010*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                           /* 0x000000fffffff389 */
                                                                                           /* 0x000fe200000e00ff */
        /*0020*/                   S2R R7, SR_TID.X ;                                      /* 0x0000000000077919 */
                                                                                           /* 0x000e220000002100 */
        /*0030*/                   MOV R10, 0x4 ;                                          /* 0x00000004000a7802 */
                                                                                           /* 0x000fca0000000f00 */
        /*0040*/                   IMAD.WIDE.U32 R2, R7.reuse, R10.reuse, c[0x0][0x160] ;  /* 0x0000580007027625 */
                                                                                           /* 0x0c1fe400078e000a */
        /*0050*/                   IMAD.WIDE.U32 R4, R7, R10, c[0x0][0x168] ;              /* 0x00005a0007047625 */
                                                                                           /* 0x000fd000078e000a */
        /*0060*/                   LDG.E.SYS R2, [R2] ;                                    /* 0x0000000002027381 */
                                                                                           /* 0x000ea800001ee900 */
        /*0070*/                   LDG.E.SYS R4, [R4] ;                                    /* 0x0000000004047381 */
                                                                                           /* 0x000ee200001ee900 */
        /*0080*/                   LOP3.LUT R0, R2, 0x7f7f7f7f, RZ, 0xc0, !PT ;            /* 0x7f7f7f7f02007812 */
                                                                                           /* 0x004fe400078ec0ff */
        /*0090*/                   LOP3.LUT R6, R4.reuse, 0x7f7f7f7f, RZ, 0xc0, !PT ;      /* 0x7f7f7f7f04067812 */
                                                                                           /* 0x048fe400078ec0ff */
        /*00a0*/                   LOP3.LUT R8, R4, 0x80808080, R2, 0x48, !PT ;            /* 0x8080808004087812 */
                                                                                           /* 0x000fc600078e4802 */
        /*00b0*/                   IMAD.IADD R9, R0, 0x1, R6 ;                             /* 0x0000000100097824 */
                                                                                           /* 0x000fe400078e0206 */
        /*00c0*/                   IMAD.WIDE.U32 R6, R7, R10, c[0x0][0x170] ;              /* 0x00005c0007067625 */
                                                                                           /* 0x000fc600078e000a */
        /*00d0*/                   LOP3.LUT R8, R9, R8, RZ, 0x3c, !PT ;                    /* 0x0000000809087212 */
                                                                                           /* 0x000fd000078e3cff */
        /*00e0*/                   STG.E.SYS [R6], R8 ;                                    /* 0x0000000806007386 */
                                                                                           /* 0x000fe2000010e900 */
        /*00f0*/                   EXIT ;                                                  /* 0x000000000000794d */
                                                                                           /* 0x000fea0003800000 */
        /*0100*/                   BRA 0x100;                                              /* 0xfffffff000007947 */
                                                                                           /* 0x000fc0000383ffff */
        /*0110*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0120*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0130*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0140*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0150*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0160*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
        /*0170*/                   NOP;                                                    /* 0x0000000000007918 */
                                                                                           /* 0x000fc00000000000 */
		.............................

From the dump information provided above,it’s seem that you are gradually removing SIMD hardware resources since Maxwell arch.

Can I confirm that or it’s problem of improper use of the nvcc compiler.

Hi,

We want to reproduce this issue first.
Would you mind to share a complete sample with us so we can pass it to our internal team for comment directly.

Thanks.

A practice sample
Code:

#include <cuda_runtime.h>
#include <stdio.h>
__global__ void usad4Cmbn(unsigned int *A, unsigned int *B, unsigned int *C) 
{
          C[threadIdx.x]=__vsub4(A[threadIdx.x],B[threadIdx.x]);
        //C[threadIdx.x]=__vabs4(A[threadIdx.x]);
        //C[threadIdx.x]=__vadd4(A[threadIdx.x],B[threadIdx.x]);
        //C[threadIdx.x]=__vadd4(__vabs4(__vsub4(A[threadIdx.x],B[threadIdx.x])),0);
        //C[threadIdx.x]=__vsadu4(A[threadIdx.x],B[threadIdx.x]);
}
int main()
{
        printf("1.select one of the instruction row in the kernel function above to test the accordingly simd instruction \n");
        printf("2.generate fatbin using nvcc options -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72\n");
        printf("3.dump the app[cuobjdump -saas app] compiled in step2 and you will get the specified GPU archs compound assembly codes \n");
}

Compile:

nvcc -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 vsadcuda.cu -o app

Dump:

cuobjdump -saas app

Finally , you can check the difference among the compound assembly codes that include specified GPU archs.
Thanks

Hi,

Thanks for the sample.
We will share more information with you once we got feedback from the developer team.

Thanks

Is there any update on this?

Hi,

Sorry that we are still checking this issue.
Will update more information with you once we got feedback from the CUDA team.

Thanks.

Hi,

Could you help to provide the runnable that showing perf numbers for us to compare?
Thanks.