The following contents was posted to the registered developer website on January 31st, 2013.
(new: https://developer.nvidia.com/user/register; old: https://partners.nvidia.com):
SIMD-in-a-word functions
simd_functions.h contains a collection of inline functions for processing byte and half-word data packed into 32-bit words. The functions are hardware accelerated on Kepler platforms. Efficient emulation code is provided for earlier platform so the functions are portable across all compute capabilities. The functionality provided should be useful for image processing tasks and other application areas.
The list of supported functions is as follows:
vabsdiffu2(a,b) per-halfword unsigned absolute difference: |a - b|
vadd2(a,b) per-halfword (un)signed addition, with wrap-around: a + b
vavgu2(a,b) per-halfword unsigned rounded average: (a + b + 1) / 2
vcmpeq2(a,b) per-halfword (un)signed comparison: a == b ? 0xffff : 0
vcmpgeu2(a,b) per-halfword unsigned comparison: a >= b ? 0xffff : 0
vcmpgtu2(a,b) per-halfword unsigned comparison: a > b ? 0xffff : 0
vcmpleu2(a,b) per-halfword unsigned comparison: a <= b ? 0xffff : 0
vcmpltu2(a,b) per-halfword unsigned comparison: a < b ? 0xffff : 0
vcmpne2(a,b) per-halfword (un)signed comparison: a != b ? 0xffff : 0
vhaddu2(a,b) per-halfword unsigned average: (a + b) / 2
vmaxu2(a,b) per-halfword unsigned maximum: max(a, b)
vminu2(a,b) per-halfword unsigned minimum: min(a, b)
vseteq2(a,b) per-halfword (un)signed comparison: a == b ? 1 : 0
vsetgeu2(a,b) per-halfword unsigned comparison: a >= b ? 1 : 0
vsetgtu2(a,b) per-halfword unsigned comparison: a > b ? 1 : 0
vsetleu2(a,b) per-halfword unsigned comparison: a <= b ? 1 : 0
vsetltu2(a,b) per-halfword unsigned comparison: a < b ? 1 : 0
vsetne2(a,b) per-halfword (un)signed comparison: a != b ? 1 : 0
vsub2(a,b) per-halfword (un)signed subtraction, with wrap-around: a - b
vabsdiffu4(a,b) per-byte unsigned absolute difference: |a - b|
vadd4(a,b) per-byte (un)signed addition, with wrap-around: a + b
vavgu4(a,b) per-byte unsigned rounded average: (a + b + 1) / 2
vcmpeq4(a,b) per-byte (un)signed comparison: a == b ? 0xff : 0
vcmpgeu4(a,b) per-byte unsigned comparison: a >= b ? 0xff : 0
vcmpgtu4(a,b) per-byte unsigned comparison: a > b ? 0xff : 0
vcmpleu4(a,b) per-byte unsigned comparison: a <= b ? 0xff : 0
vcmpltu4(a,b) per-byte unsigned comparison: a < b ? 0xff : 0
vcmpne4(a,b) per-byte (un)signed comparison: a != b ? 0xff: 0
vhaddu4(a,b) per-byte unsigned average: (a + b) / 2
vmaxu4(a,b) per-byte unsigned maximum: max(a, b)
vminu4(a,b) per-byte unsigned minimum: min(a, b)
vseteq4(a,b) per-byte (un)signed comparison: a == b ? 1 : 0
vsetgeu4(a,b) per-byte unsigned comparison: a >= b ? 1 : 0
vsetgtu4(a,b) per-byte unsigned comparison: a > b ? 1 : 0
vsetleu4(a,b) per-byte unsigned comparison: a <= b ? 1 : 0
vsetltu4(a,b) per-byte unsigned comparison: a < b ? 1 : 0
vsetne4(a,b) per-byte (un)signed comparison: a != b ? 1: 0
vsub4(a,b) per-byte (un)signed subtraction, with wrap-around: a - b