Release 1.1 of the SIMD-in-word functions was posted March 19, 4 pm PDT. New new release is currently only available from the new registsred developer website at [url]https://developer.nvidia.com/user/register[/url]. Log in, then click green link "CUDA/GPU Computing Registered Developer Program", then clock green link "Download" following "CUDA SIMD-within-a-word functions". The downloaded file is named simd_functions_v1_1.tar
Those currently at GTC may be interested to learn how to speed up life sciences applications with the Kepler SIMD video instructions, which these functions provide convenient access to. Check out the life sciences track here: [url]http://registration.gputechconf.com/quicklink/eptj7dX[/url]
Release notes [technical difficulties prevent me from posting the entire list of functions, sorry]
[code]
/*
Release 1.1
(1) Use of incorrect symbol in multiple-inclusion guard has been corrected.
(2) 44 additional functions were added to the initial set of 38 functions.
(3) The emulation paths for many existing functions were optimized for sm_2x
This header file contains inline functions that implement intra-word SIMD
operations, that are hardware accelerated on sm_3x (Kepler) GPUs. Efficient
emulation code paths are provided for earlier architectures (sm_1x, sm_2x)
to make the code portable across all GPUs supported by CUDA. The following
functions are currently implemented:
vabs2(a) per-halfword absolute value, with wrap-around: |a|
vabsdiffs2(a,b) per-halfword absolute difference of signed integer: |a - b|
vabsdiffu2(a,b) per-halfword absolute difference of unsigned integer: |a - b|
vabsss2(a) per-halfword abs. value, with signed saturation: sat.s16(|a|)
vadd2(a,b) per-halfword (un)signed addition, with wrap-around: a + b
vaddss2(a,b) per-halfword addition with signed saturation: sat.s16 (a + b)
vaddus2(a,b) per-halfword addition with unsigned saturation: sat.u16 (a+b)
vavgs2(a,b) per-halfword signed rounded average: (a+b+((a+b)>=0)) >> 1
vavgu2(a,b) per-halfword unsigned rounded average: (a + b + 1) / 2
vcmpeq2(a,b) per-halfword (un)signed comparison: a == b ? 0xffff : 0
vcmpges2(a,b) per-halfword signed comparison: a >= b ? 0xffff : 0
vcmpgeu2(a,b) per-halfword unsigned comparison: a >= b ? 0xffff : 0
vcmpgts2(a,b) per-halfword signed comparison: a > b ? 0xffff : 0
vcmpgtu2(a,b) per-halfword unsigned comparison: a > b ? 0xffff : 0
vcmples2(a,b) per-halfword signed comparison: a <= b ? 0xffff : 0
vcmpleu2(a,b) per-halfword unsigned comparison: a <= b ? 0xffff : 0
vcmplts2(a,b) per-halfword signed comparison: a < b ? 0xffff : 0
vcmpltu2(a,b) per-halfword unsigned comparison: a < b ? 0xffff : 0
vcmpne2(a,b) per-halfword (un)signed comparison: a != b ? 0xffff : 0
vhaddu2(a,b) per-halfword unsigned average: (a + b) / 2
vmaxs2(a,b) per-halfword signed maximum: max(a, b)
vmaxu2(a,b) per-halfword unsigned maximum: max(a, b)
vmins2(a,b) per-halfword signed minimum: min(a, b)
vminu2(a,b) per-halfword unsigned minimum: min(a, b)
vneg2(a,b) per-halfword negation, with wrap-around: -a
vnegss2(a,b) per-halfword negation, with signed saturation: sat.s16(-a)
vsads2(a,b) per-halfword sum of abs diff of signed: sum{0,1}(|a-b|)
vsadu2(a,b) per-halfword sum of abs diff of unsigned: sum{0,1}(|a-b|)
vseteq2(a,b) per-halfword (un)signed comparison: a == b ? 1 : 0
vsetges2(a,b) per-halfword signed comparison: a >= b ? 1 : 0
[...]
*/
[/code]

Release 1.1 of the SIMD-in-word functions was posted March 19, 4 pm PDT. New new release is currently only available from the new registsred developer website at https://developer.nvidia.com/user/register. Log in, then click green link "CUDA/GPU Computing Registered Developer Program", then clock green link "Download" following "CUDA SIMD-within-a-word functions". The downloaded file is named simd_functions_v1_1.tar

Those currently at GTC may be interested to learn how to speed up life sciences applications with the Kepler SIMD video instructions, which these functions provide convenient access to. Check out the life sciences track here: http://registration.gputechconf.com/quicklink/eptj7dX

Release notes [technical difficulties prevent me from posting the entire list of functions, sorry]

/*
Release 1.1

(1) Use of incorrect symbol in multiple-inclusion guard has been corrected.
(2) 44 additional functions were added to the initial set of 38 functions.
(3) The emulation paths for many existing functions were optimized for sm_2x

This header file contains inline functions that implement intra-word SIMD
operations, that are hardware accelerated on sm_3x (Kepler) GPUs. Efficient
emulation code paths are provided for earlier architectures (sm_1x, sm_2x)
to make the code portable across all GPUs supported by CUDA. The following
functions are currently implemented:

vabs2(a) per-halfword absolute value, with wrap-around: |a|
vabsdiffs2(a,b) per-halfword absolute difference of signed integer: |a - b|
vabsdiffu2(a,b) per-halfword absolute difference of unsigned integer: |a - b|
vabsss2(a) per-halfword abs. value, with signed saturation: sat.s16(|a|)
vadd2(a,b) per-halfword (un)signed addition, with wrap-around: a + b
vaddss2(a,b) per-halfword addition with signed saturation: sat.s16 (a + b)
vaddus2(a,b) per-halfword addition with unsigned saturation: sat.u16 (a+b)
vavgs2(a,b) per-halfword signed rounded average: (a+b+((a+b)>=0)) >> 1
vavgu2(a,b) per-halfword unsigned rounded average: (a + b + 1) / 2
vcmpeq2(a,b) per-halfword (un)signed comparison: a == b ? 0xffff : 0
vcmpges2(a,b) per-halfword signed comparison: a >= b ? 0xffff : 0
vcmpgeu2(a,b) per-halfword unsigned comparison: a >= b ? 0xffff : 0
vcmpgts2(a,b) per-halfword signed comparison: a > b ? 0xffff : 0
vcmpgtu2(a,b) per-halfword unsigned comparison: a > b ? 0xffff : 0
vcmples2(a,b) per-halfword signed comparison: a <= b ? 0xffff : 0
vcmpleu2(a,b) per-halfword unsigned comparison: a <= b ? 0xffff : 0
vcmplts2(a,b) per-halfword signed comparison: a < b ? 0xffff : 0
vcmpltu2(a,b) per-halfword unsigned comparison: a < b ? 0xffff : 0
vcmpne2(a,b) per-halfword (un)signed comparison: a != b ? 0xffff : 0
vhaddu2(a,b) per-halfword unsigned average: (a + b) / 2
vmaxs2(a,b) per-halfword signed maximum: max(a, b)
vmaxu2(a,b) per-halfword unsigned maximum: max(a, b)
vmins2(a,b) per-halfword signed minimum: min(a, b)
vminu2(a,b) per-halfword unsigned minimum: min(a, b)
vneg2(a,b) per-halfword negation, with wrap-around: -a
vnegss2(a,b) per-halfword negation, with signed saturation: sat.s16(-a)
vsads2(a,b) per-halfword sum of abs diff of signed: sum{0,1}(|a-b|)
vsadu2(a,b) per-halfword sum of abs diff of unsigned: sum{0,1}(|a-b|)
vseteq2(a,b) per-halfword (un)signed comparison: a == b ? 1 : 0
vsetges2(a,b) per-halfword signed comparison: a >= b ? 1 : 0

Function list continued.
[code]
/*
vsetgeu2(a,b) per-halfword unsigned comparison: a >= b ? 1 : 0
vsetgts2(a,b) per-halfword signed comparison: a > b ? 1 : 0
vsetgtu2(a,b) per-halfword unsigned comparison: a > b ? 1 : 0
vsetles2(a,b) per-halfword signed comparison: a <= b ? 1 : 0
vsetleu2(a,b) per-halfword unsigned comparison: a <= b ? 1 : 0
vsetlts2(a,b) per-halfword signed comparison: a < b ? 1 : 0
vsetltu2(a,b) per-halfword unsigned comparison: a < b ? 1 : 0
vsetne2(a,b) per-halfword (un)signed comparison: a != b ? 1 : 0
vsub2(a,b) per-halfword (un)signed subtraction, with wrap-around: a - b
vsubss2(a,b) per-halfword subtraction with signed saturation: sat.s16(a-b)
vsubus2(a,b) per-halfword subtraction w/ unsigned saturation: sat.u16(a-b)
vabs4(a) per-byte absolute value, with wrap-around: |a|
vabsdiffs4(a,b) per-byte absolute difference of signed integer: |a - b|
vabsdiffu4(a,b) per-byte absolute difference of unsigned integer: |a - b|
vabsss4(a) per-byte absolute value, with signed saturation: sat.s8(|a|)
vadd4(a,b) per-byte (un)signed addition, with wrap-around: a + b
vaddss4(a,b) per-byte addition with signed saturation: sat.s8 (a + b)
vaddus4(a,b) per-byte addition with unsigned saturation: sat.u8 (a + b)
vavgs4(a,b) per-byte signed rounded average: (a + b + ((a+b) >= 0)) >> 1
vavgu4(a,b) per-byte unsigned rounded average: (a + b + 1) / 2
vcmpeq4(a,b) per-byte (un)signed comparison: a == b ? 0xff : 0
vcmpges4(a,b) per-byte signed comparison: a >= b ? 0xff : 0
vcmpgeu4(a,b) per-byte unsigned comparison: a >= b ? 0xff : 0
vcmpgts4(a,b) per-byte signed comparison: a > b ? 0xff : 0
vcmpgtu4(a,b) per-byte unsigned comparison: a > b ? 0xff : 0
vcmples4(a,b) per-byte signed comparison: a <= b ? 0xff : 0
vcmpleu4(a,b) per-byte unsigned comparison: a <= b ? 0xff : 0
vcmplts4(a,b) per-byte signed comparison: a < b ? 0xff : 0
vcmpltu4(a,b) per-byte unsigned comparison: a < b ? 0xff : 0
vcmpne4(a,b) per-byte (un)signed comparison: a != b ? 0xff: 0
vhaddu4(a,b) per-byte unsigned average: (a + b) / 2
vmaxs4(a,b) per-byte signed maximum: max(a, b)
vmaxu4(a,b) per-byte unsigned maximum: max(a, b)
vmins4(a,b) per-byte signed minimum: min(a, b)
vminu4(a,b) per-byte unsigned minimum: min(a, b)
vneg4(a,b) per-byte negation, with wrap-around: -a
vnegss4(a,b) per-byte negation, with signed saturation: sat.s8(-a)
vsads4(a,b) per-byte sum of abs difference of signed: sum{0,3}(|a-b|)
vsadu4(a,b) per-byte sum of abs difference of unsigned: sum{0,3}(|a-b|)
vseteq4(a,b) per-byte (un)signed comparison: a == b ? 1 : 0
vsetges4(a,b) per-byte signed comparison: a >= b ? 1 : 0
vsetgeu4(a,b) per-byte unsigned comparison: a >= b ? 1 : 0
vsetgts4(a,b) per-byte signed comparison: a > b ? 1 : 0
vsetgtu4(a,b) per-byte unsigned comparison: a > b ? 1 : 0
vsetles4(a,b) per-byte signed comparison: a <= b ? 1 : 0
vsetleu4(a,b) per-byte unsigned comparison: a <= b ? 1 : 0
vsetlts4(a,b) per-byte signed comparison: a < b ? 1 : 0
vsetltu4(a,b) per-byte unsigned comparison: a < b ? 1 : 0
vsetne4(a,b) per-byte (un)signed comparison: a != b ? 1: 0
vsub4(a,b) per-byte (un)signed subtraction, with wrap-around: a - b
vsubss4(a,b) per-byte subtraction with signed saturation: sat.s8 (a - b)
vsubus4(a,b) per-byte subtraction with unsigned saturation: sat.u8 (a - b)
*/
[/code]

/*
vsetgeu2(a,b) per-halfword unsigned comparison: a >= b ? 1 : 0
vsetgts2(a,b) per-halfword signed comparison: a > b ? 1 : 0
vsetgtu2(a,b) per-halfword unsigned comparison: a > b ? 1 : 0
vsetles2(a,b) per-halfword signed comparison: a <= b ? 1 : 0
vsetleu2(a,b) per-halfword unsigned comparison: a <= b ? 1 : 0
vsetlts2(a,b) per-halfword signed comparison: a < b ? 1 : 0
vsetltu2(a,b) per-halfword unsigned comparison: a < b ? 1 : 0
vsetne2(a,b) per-halfword (un)signed comparison: a != b ? 1 : 0
vsub2(a,b) per-halfword (un)signed subtraction, with wrap-around: a - b
vsubss2(a,b) per-halfword subtraction with signed saturation: sat.s16(a-b)
vsubus2(a,b) per-halfword subtraction w/ unsigned saturation: sat.u16(a-b)

vabs4(a) per-byte absolute value, with wrap-around: |a|
vabsdiffs4(a,b) per-byte absolute difference of signed integer: |a - b|
vabsdiffu4(a,b) per-byte absolute difference of unsigned integer: |a - b|
vabsss4(a) per-byte absolute value, with signed saturation: sat.s8(|a|)
vadd4(a,b) per-byte (un)signed addition, with wrap-around: a + b
vaddss4(a,b) per-byte addition with signed saturation: sat.s8 (a + b)
vaddus4(a,b) per-byte addition with unsigned saturation: sat.u8 (a + b)
vavgs4(a,b) per-byte signed rounded average: (a + b + ((a+b) >= 0)) >> 1
vavgu4(a,b) per-byte unsigned rounded average: (a + b + 1) / 2
vcmpeq4(a,b) per-byte (un)signed comparison: a == b ? 0xff : 0
vcmpges4(a,b) per-byte signed comparison: a >= b ? 0xff : 0
vcmpgeu4(a,b) per-byte unsigned comparison: a >= b ? 0xff : 0
vcmpgts4(a,b) per-byte signed comparison: a > b ? 0xff : 0
vcmpgtu4(a,b) per-byte unsigned comparison: a > b ? 0xff : 0
vcmples4(a,b) per-byte signed comparison: a <= b ? 0xff : 0
vcmpleu4(a,b) per-byte unsigned comparison: a <= b ? 0xff : 0
vcmplts4(a,b) per-byte signed comparison: a < b ? 0xff : 0
vcmpltu4(a,b) per-byte unsigned comparison: a < b ? 0xff : 0
vcmpne4(a,b) per-byte (un)signed comparison: a != b ? 0xff: 0
vhaddu4(a,b) per-byte unsigned average: (a + b) / 2
vmaxs4(a,b) per-byte signed maximum: max(a, b)
vmaxu4(a,b) per-byte unsigned maximum: max(a, b)
vmins4(a,b) per-byte signed minimum: min(a, b)
vminu4(a,b) per-byte unsigned minimum: min(a, b)
vneg4(a,b) per-byte negation, with wrap-around: -a
vnegss4(a,b) per-byte negation, with signed saturation: sat.s8(-a)
vsads4(a,b) per-byte sum of abs difference of signed: sum{0,3}(|a-b|)
vsadu4(a,b) per-byte sum of abs difference of unsigned: sum{0,3}(|a-b|)
vseteq4(a,b) per-byte (un)signed comparison: a == b ? 1 : 0
vsetges4(a,b) per-byte signed comparison: a >= b ? 1 : 0
vsetgeu4(a,b) per-byte unsigned comparison: a >= b ? 1 : 0
vsetgts4(a,b) per-byte signed comparison: a > b ? 1 : 0
vsetgtu4(a,b) per-byte unsigned comparison: a > b ? 1 : 0
vsetles4(a,b) per-byte signed comparison: a <= b ? 1 : 0
vsetleu4(a,b) per-byte unsigned comparison: a <= b ? 1 : 0
vsetlts4(a,b) per-byte signed comparison: a < b ? 1 : 0
vsetltu4(a,b) per-byte unsigned comparison: a < b ? 1 : 0
vsetne4(a,b) per-byte (un)signed comparison: a != b ? 1: 0
vsub4(a,b) per-byte (un)signed subtraction, with wrap-around: a - b
vsubss4(a,b) per-byte subtraction with signed saturation: sat.s8 (a - b)
vsubus4(a,b) per-byte subtraction with unsigned saturation: sat.u8 (a - b)
*/

This recent paper provides an interesting example of significant performance improvements achieved with the help of Kepler's SIMD instructions:
Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt
CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
BMC Bioinformatics 2013, 14:117
[url]http://www.biomedcentral.com/content/pdf/1471-2105-14-117.pdf[/url]

I want this, I get
Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.44240ac3.1369092315.fda50f
when I try to register!

I think not:
Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.44240ac3.1369095616.102081e

Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.44240ac3.1369095616.102081e

The technical problems appear to be fixed. At this point I am able to log into the registered developer website, and I successfully downloaded the file. Please try again.

The technical problems appear to be fixed. At this point I am able to log into the registered developer website, and I successfully downloaded the file. Please try again.

It's still broken:
Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.44240ac3.1369137378.1662e45

Sorry to hear the site is still not accessible.
Can you try this location:
https://developer.nvidia.com/registered-developer-programs
And follow the links to login or register for the CUDA Registered Developer Program.
If you still experience an access problem - try a different browser and let me know the results. You are welcome to message be directly since I may need some additional information.
Thanks again for your help getting to the bottom of this problem.

And follow the links to login or register for the CUDA Registered Developer Program.
If you still experience an access problem - try a different browser and let me know the results. You are welcome to message be directly since I may need some additional information.

Thanks again for your help getting to the bottom of this problem.

Tried it with Opera:
Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.69ff4317.1369183190.436fdf3

Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.17a9645f.1369721532.15525d0e

Access Denied
You don't have permission to access "http://developer.nvidia.com/user/register" on this server.
Reference #18.17a9645f.1369721532.15525d0e

Those currently at GTC may be interested to learn how to speed up life sciences applications with the Kepler SIMD video instructions, which these functions provide convenient access to. Check out the life sciences track here: http://registration.gputechconf.com/quicklink/eptj7dX

Release notes [technical difficulties prevent me from posting the entire list of functions, sorry]

Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt

CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions

BMC Bioinformatics 2013, 14:117

http://www.biomedcentral.com/content/pdf/1471-2105-14-117.pdf

Access Denied

You don't have permission to access "http://developer.nvidia.com/user/register" on this server.

Reference #18.44240ac3.1369092315.fda50f

when I try to register!

Access Denied

You don't have permission to access "http://developer.nvidia.com/user/register" on this server.

Reference #18.44240ac3.1369095616.102081e

Access Denied

You don't have permission to access "http://developer.nvidia.com/user/register" on this server.

Reference #18.44240ac3.1369137378.1662e45

Can you try this location:

https://developer.nvidia.com/registered-developer-programs

And follow the links to login or register for the CUDA Registered Developer Program.

If you still experience an access problem - try a different browser and let me know the results. You are welcome to message be directly since I may need some additional information.

Thanks again for your help getting to the bottom of this problem.

Access Denied

You don't have permission to access "http://developer.nvidia.com/user/register" on this server.

Reference #18.69ff4317.1369183190.436fdf3

You don't have permission to access "http://developer.nvidia.com/user/register" on this server.

Reference #18.17a9645f.1369721532.15525d0e