Release 1.1 of SIMD-in-a-word functions posted

Release 1.1 of the SIMD-in-word functions was posted March 19, 4 pm PDT. New new release is currently only available from the new registsred developer website at https://developer.nvidia.com/user/register. Log in, then click green link “CUDA/GPU Computing Registered Developer Program”, then clock green link “Download” following “CUDA SIMD-within-a-word functions”. The downloaded file is named simd_functions_v1_1.tar

Those currently at GTC may be interested to learn how to speed up life sciences applications with the Kepler SIMD video instructions, which these functions provide convenient access to. Check out the life sciences track here: http://registration.gputechconf.com/quicklink/eptj7dX

Release notes [technical difficulties prevent me from posting the entire list of functions, sorry]

/* 
  Release 1.1
  
   (1) Use of incorrect symbol in multiple-inclusion guard has been corrected.
   (2) 44 additional functions were added to the initial set of 38 functions.
   (3) The emulation paths for many existing functions were optimized for sm_2x
 
  This header file contains inline functions that implement intra-word SIMD
  operations, that are hardware accelerated on sm_3x (Kepler) GPUs. Efficient
  emulation code paths are provided for earlier architectures (sm_1x, sm_2x)
  to make the code portable across all GPUs supported by CUDA. The following 
  functions are currently implemented:

  vabs2(a)        per-halfword absolute value, with wrap-around: |a|
  vabsdiffs2(a,b) per-halfword absolute difference of signed integer: |a - b|
  vabsdiffu2(a,b) per-halfword absolute difference of unsigned integer: |a - b|
  vabsss2(a)      per-halfword abs. value, with signed saturation: sat.s16(|a|)
  vadd2(a,b)      per-halfword (un)signed addition, with wrap-around: a + b
  vaddss2(a,b)    per-halfword addition with signed saturation: sat.s16 (a + b)
  vaddus2(a,b)    per-halfword addition with unsigned saturation: sat.u16 (a+b)
  vavgs2(a,b)     per-halfword signed rounded average: (a+b+((a+b)>=0)) >> 1
  vavgu2(a,b)     per-halfword unsigned rounded average: (a + b + 1) / 2
  vcmpeq2(a,b)    per-halfword (un)signed comparison: a == b ? 0xffff : 0
  vcmpges2(a,b)   per-halfword signed comparison: a >= b ? 0xffff : 0
  vcmpgeu2(a,b)   per-halfword unsigned comparison: a >= b ? 0xffff : 0
  vcmpgts2(a,b)   per-halfword signed comparison: a > b ? 0xffff : 0
  vcmpgtu2(a,b)   per-halfword unsigned comparison: a > b ? 0xffff : 0
  vcmples2(a,b)   per-halfword signed comparison: a <= b ? 0xffff : 0
  vcmpleu2(a,b)   per-halfword unsigned comparison: a <= b ? 0xffff : 0
  vcmplts2(a,b)   per-halfword signed comparison: a < b ? 0xffff : 0
  vcmpltu2(a,b)   per-halfword unsigned comparison: a < b ? 0xffff : 0
  vcmpne2(a,b)    per-halfword (un)signed comparison: a != b ? 0xffff : 0
  vhaddu2(a,b)    per-halfword unsigned average: (a + b) / 2
  vmaxs2(a,b)     per-halfword signed maximum: max(a, b)
  vmaxu2(a,b)     per-halfword unsigned maximum: max(a, b)
  vmins2(a,b)     per-halfword signed minimum: min(a, b)
  vminu2(a,b)     per-halfword unsigned minimum: min(a, b)
  vneg2(a,b)      per-halfword negation, with wrap-around: -a
  vnegss2(a,b)    per-halfword negation, with signed saturation: sat.s16(-a)
  vsads2(a,b)     per-halfword sum of abs diff of signed: sum{0,1}(|a-b|)
  vsadu2(a,b)     per-halfword sum of abs diff of unsigned: sum{0,1}(|a-b|)
  vseteq2(a,b)    per-halfword (un)signed comparison: a == b ? 1 : 0
  vsetges2(a,b)   per-halfword signed comparison: a >= b ? 1 : 0

  [...]
*/

Function list continued.

/*
  vsetgeu2(a,b)   per-halfword unsigned comparison: a >= b ? 1 : 0
  vsetgts2(a,b)   per-halfword signed comparison: a > b ? 1 : 0
  vsetgtu2(a,b)   per-halfword unsigned comparison: a > b ? 1 : 0
  vsetles2(a,b)   per-halfword signed comparison: a <= b ? 1 : 0 
  vsetleu2(a,b)   per-halfword unsigned comparison: a <= b ? 1 : 0 
  vsetlts2(a,b)   per-halfword signed comparison: a < b ? 1 : 0
  vsetltu2(a,b)   per-halfword unsigned comparison: a < b ? 1 : 0
  vsetne2(a,b)    per-halfword (un)signed comparison: a != b ? 1 : 0
  vsub2(a,b)      per-halfword (un)signed subtraction, with wrap-around: a - b
  vsubss2(a,b)    per-halfword subtraction with signed saturation: sat.s16(a-b)
  vsubus2(a,b)    per-halfword subtraction w/ unsigned saturation: sat.u16(a-b)
  
  vabs4(a)        per-byte absolute value, with wrap-around: |a|
  vabsdiffs4(a,b) per-byte absolute difference of signed integer: |a - b|
  vabsdiffu4(a,b) per-byte absolute difference of unsigned integer: |a - b|
  vabsss4(a)      per-byte absolute value, with signed saturation: sat.s8(|a|)
  vadd4(a,b)      per-byte (un)signed addition, with wrap-around: a + b
  vaddss4(a,b)    per-byte addition with signed saturation: sat.s8 (a + b)
  vaddus4(a,b)    per-byte addition with unsigned saturation: sat.u8 (a + b)
  vavgs4(a,b)     per-byte signed rounded average: (a + b + ((a+b) >= 0)) >> 1
  vavgu4(a,b)     per-byte unsigned rounded average: (a + b + 1) / 2
  vcmpeq4(a,b)    per-byte (un)signed comparison: a == b ? 0xff : 0
  vcmpges4(a,b)   per-byte signed comparison: a >= b ? 0xff : 0
  vcmpgeu4(a,b)   per-byte unsigned comparison: a >= b ? 0xff : 0
  vcmpgts4(a,b)   per-byte signed comparison: a > b ? 0xff : 0
  vcmpgtu4(a,b)   per-byte unsigned comparison: a > b ? 0xff : 0
  vcmples4(a,b)   per-byte signed comparison: a <= b ? 0xff : 0
  vcmpleu4(a,b)   per-byte unsigned comparison: a <= b ? 0xff : 0
  vcmplts4(a,b)   per-byte signed comparison: a < b ? 0xff : 0
  vcmpltu4(a,b)   per-byte unsigned comparison: a < b ? 0xff : 0
  vcmpne4(a,b)    per-byte (un)signed comparison: a != b ? 0xff: 0
  vhaddu4(a,b)    per-byte unsigned average: (a + b) / 2
  vmaxs4(a,b)     per-byte signed maximum: max(a, b)
  vmaxu4(a,b)     per-byte unsigned maximum: max(a, b)
  vmins4(a,b)     per-byte signed minimum: min(a, b)
  vminu4(a,b)     per-byte unsigned minimum: min(a, b)
  vneg4(a,b)      per-byte negation, with wrap-around: -a
  vnegss4(a,b)    per-byte negation, with signed saturation: sat.s8(-a)
  vsads4(a,b)     per-byte sum of abs difference of signed: sum{0,3}(|a-b|)
  vsadu4(a,b)     per-byte sum of abs difference of unsigned: sum{0,3}(|a-b|)
  vseteq4(a,b)    per-byte (un)signed comparison: a == b ? 1 : 0
  vsetges4(a,b)   per-byte signed comparison: a >= b ? 1 : 0
  vsetgeu4(a,b)   per-byte unsigned comparison: a >= b ? 1 : 0
  vsetgts4(a,b)   per-byte signed comparison: a > b ? 1 : 0
  vsetgtu4(a,b)   per-byte unsigned comparison: a > b ? 1 : 0
  vsetles4(a,b)   per-byte signed comparison: a <= b ? 1 : 0
  vsetleu4(a,b)   per-byte unsigned comparison: a <= b ? 1 : 0
  vsetlts4(a,b)   per-byte signed comparison: a < b ? 1 : 0
  vsetltu4(a,b)   per-byte unsigned comparison: a < b ? 1 : 0
  vsetne4(a,b)    per-byte (un)signed comparison: a != b ? 1: 0
  vsub4(a,b)      per-byte (un)signed subtraction, with wrap-around: a - b
  vsubss4(a,b)    per-byte subtraction with signed saturation: sat.s8 (a - b)
  vsubus4(a,b)    per-byte subtraction with unsigned saturation: sat.u8 (a - b)
*/

This recent paper provides an interesting example of significant performance improvements achieved with the help of Kepler’s SIMD instructions:

Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
BMC Bioinformatics 2013, 14:117
[url]http://www.biomedcentral.com/content/pdf/1471-2105-14-117.pdf[/url]

The download page seems to be unavailable. I keep being redirected to the license page after clicking ‘Agree’.

I am unable to reproduce this download issue using my registered developer account. The problem may have been transient.

I want this, I get

Access Denied
You don’t have permission to access “http://developer.nvidia.com/user/register” on this server.
Reference #18.44240ac3.1369092315.fda50f

when I try to register!

There was an hardware failure , please try again it should be fixed now.

I think not:

Access Denied
You don’t have permission to access “http://developer.nvidia.com/user/register” on this server.
Reference #18.44240ac3.1369095616.102081e

There seems to be a technical issue with the site. I cannot reach the CUDA registered developer website at this time. I will notify the relevant team.

The technical problems appear to be fixed. At this point I am able to log into the registered developer website, and I successfully downloaded the file. Please try again.

It’s still broken:

Access Denied

You don’t have permission to access “http://developer.nvidia.com/user/register” on this server.
Reference #18.44240ac3.1369137378.1662e45

Sorry to hear the site is still not accessible.
Can you try this location:
https://developer.nvidia.com/registered-developer-programs

And follow the links to login or register for the CUDA Registered Developer Program.
If you still experience an access problem - try a different browser and let me know the results. You are welcome to message be directly since I may need some additional information.

Thanks again for your help getting to the bottom of this problem.

Tried it with Opera:

Access Denied
You don’t have permission to access “http://developer.nvidia.com/user/register” on this server.

Reference #18.69ff4317.1369183190.436fdf3

Access Denied
You don’t have permission to access “http://developer.nvidia.com/user/register” on this server.
Reference #18.17a9645f.1369721532.15525d0e

@birdwes: Have you tried contacting Nadeem through a PM, as he suggested above?

I can’t find any documentation regarding SIMD instruction sets. Can anyone enlighten me for this? or it is just not there yet?

[url]CUDA Toolkit Documentation

8.7.13 SIMD Video Instructions
The SIMD video instructions operate on pairs of 16-bit values and quads of 8-bit values.
The SIMD video instructions are:
 vadd2, vadd4
 vsub2, vsub4
 vavrg2, vavrg4
 vabsdiff2, vabsdiff4
 vmin2, vmin4
 vmax2, vmax4
 vset2, vset4

Thanks!

Just happened to look at this file for the first time (I’ve just been doing PTX asms by hand most of the time I need a fancier PTX call.)

The big surprise… there’s a huge amount of code in this header to provide pre-Kepler equivalent code paths for each intrinsic! That’s awesome! I admit I just expected it to be only a C wrapper for the PTX asm statements and therefore be Kepler-only.

Kudos for that considerable extra effort!