Hi all,
we’ve used Kepler’s shuffle intrinsics and warp vote functions to implement some unsigned integer bignum maths. The idea is that one warp holds and processes 32 limbs of a 1024 bit integer (each limb being 32 bits). Extension to 2048 bits would be trivial by changing the limb type to uint64_t. Distributing the data (=registers) across the entire warp keeps the register count per thread nicely low and allows for 100% occupancy in many applications.
What we currently have is add, addc, sub, comparison, left shift by arbitrary bits, lshift1, rshift1, mul with 1024 bit result, mul with 2048 bit result (split into hi and lo part) and division.
Here’s the add code that started it all. simd1024 is typedef’d to be uint32_t. This could certainly be improved to use the PTX addc instruction instead of the carry = r < a check…
// returns carry of msb
__device__ bool simd1024_add(simd1024 &r, simd1024 a, const simd1024 &b)
{
int laneId = threadIdx.x & (warpSize-1);
r = a + b;
bool carry, carry_hi = false;
do {
// carry propagation
carry = r < a;
// accumulate all carries on the high limb
carry_hi |= __shfl((int)carry, 31);
carry = (uint32_t)__shfl_up((int)carry, 1) && laneId > 0;
a = r; r += carry;
} while(__any(carry)); // warp vote
// return highest carry
return carry_hi;
}
Wait, I have more. Here’s a mul returning the lower 1024 bits of the result.
It’s all very straightforward when you do warp synchronous programming.
// 1024 x 1024 bit Multiplication with 1024 bit result
__device__ void simd1024_mul(simd1024 &r, const simd1024 &a, const simd1024 &b)
{
int laneId = threadIdx.x & (warpSize-1);
r = 0;
for (int limb=31; limb >= 0; --limb)
{
simd1024 factor = (simd1024)__shfl((int)a,limb);
simd1024_add(r, r, __umulhi(factor, b));
r = (simd1024)__shfl_up((int)r, 1);
r = (laneId==0) ? 0 : r;
simd1024_add(r, r, factor*b);
}
}
On top of these foundations we’ve implemented
a) modular exponentiation with the exponentiation by squaring method (awfully slow)
b) modular exponentiation (base 2) using Montgomery multiplication
We use it for additional primality testing (Fermat checks) after sieving numbers with wheel factorization + sieve of Eratosthenes, in order to find constellations of primes that are close together (e.g. six-tuples with very small distances). There are some crypto currencies that use this task as a proof of work function - so essentially found prime tuples are worth money ;)
If there is some interest, I could clean up the maths code up and post it.
There’s still some potential for optimization, e.g. by saving some ALU operations in the squaring of numbers (up to 1/2 of the multiplications could be saved) compared to regular multiplication.
Christian