Release 1.1 of the SIMD-in-word functions was posted March 19, 4 pm PDT. New new release is currently only available from the new registsred developer website at https://developer.nvidia.com/user/register. Log in, then click green link “CUDA/GPU Computing Registered Developer Program”, then clock green link “Download” following “CUDA SIMD-within-a-word functions”. The downloaded file is named simd_functions_v1_1.tar
Those currently at GTC may be interested to learn how to speed up life sciences applications with the Kepler SIMD video instructions, which these functions provide convenient access to. Check out the life sciences track here: http://registration.gputechconf.com/quicklink/eptj7dX
Release notes [technical difficulties prevent me from posting the entire list of functions, sorry]
/*
Release 1.1
(1) Use of incorrect symbol in multiple-inclusion guard has been corrected.
(2) 44 additional functions were added to the initial set of 38 functions.
(3) The emulation paths for many existing functions were optimized for sm_2x
This header file contains inline functions that implement intra-word SIMD
operations, that are hardware accelerated on sm_3x (Kepler) GPUs. Efficient
emulation code paths are provided for earlier architectures (sm_1x, sm_2x)
to make the code portable across all GPUs supported by CUDA. The following
functions are currently implemented:
vabs2(a) per-halfword absolute value, with wrap-around: |a|
vabsdiffs2(a,b) per-halfword absolute difference of signed integer: |a - b|
vabsdiffu2(a,b) per-halfword absolute difference of unsigned integer: |a - b|
vabsss2(a) per-halfword abs. value, with signed saturation: sat.s16(|a|)
vadd2(a,b) per-halfword (un)signed addition, with wrap-around: a + b
vaddss2(a,b) per-halfword addition with signed saturation: sat.s16 (a + b)
vaddus2(a,b) per-halfword addition with unsigned saturation: sat.u16 (a+b)
vavgs2(a,b) per-halfword signed rounded average: (a+b+((a+b)>=0)) >> 1
vavgu2(a,b) per-halfword unsigned rounded average: (a + b + 1) / 2
vcmpeq2(a,b) per-halfword (un)signed comparison: a == b ? 0xffff : 0
vcmpges2(a,b) per-halfword signed comparison: a >= b ? 0xffff : 0
vcmpgeu2(a,b) per-halfword unsigned comparison: a >= b ? 0xffff : 0
vcmpgts2(a,b) per-halfword signed comparison: a > b ? 0xffff : 0
vcmpgtu2(a,b) per-halfword unsigned comparison: a > b ? 0xffff : 0
vcmples2(a,b) per-halfword signed comparison: a <= b ? 0xffff : 0
vcmpleu2(a,b) per-halfword unsigned comparison: a <= b ? 0xffff : 0
vcmplts2(a,b) per-halfword signed comparison: a < b ? 0xffff : 0
vcmpltu2(a,b) per-halfword unsigned comparison: a < b ? 0xffff : 0
vcmpne2(a,b) per-halfword (un)signed comparison: a != b ? 0xffff : 0
vhaddu2(a,b) per-halfword unsigned average: (a + b) / 2
vmaxs2(a,b) per-halfword signed maximum: max(a, b)
vmaxu2(a,b) per-halfword unsigned maximum: max(a, b)
vmins2(a,b) per-halfword signed minimum: min(a, b)
vminu2(a,b) per-halfword unsigned minimum: min(a, b)
vneg2(a,b) per-halfword negation, with wrap-around: -a
vnegss2(a,b) per-halfword negation, with signed saturation: sat.s16(-a)
vsads2(a,b) per-halfword sum of abs diff of signed: sum{0,1}(|a-b|)
vsadu2(a,b) per-halfword sum of abs diff of unsigned: sum{0,1}(|a-b|)
vseteq2(a,b) per-halfword (un)signed comparison: a == b ? 1 : 0
vsetges2(a,b) per-halfword signed comparison: a >= b ? 1 : 0
[...]
*/