Bignum maths with warp-synchronous programming

Hi all,

we’ve used Kepler’s shuffle intrinsics and warp vote functions to implement some unsigned integer bignum maths. The idea is that one warp holds and processes 32 limbs of a 1024 bit integer (each limb being 32 bits). Extension to 2048 bits would be trivial by changing the limb type to uint64_t. Distributing the data (=registers) across the entire warp keeps the register count per thread nicely low and allows for 100% occupancy in many applications.

What we currently have is add, addc, sub, comparison, left shift by arbitrary bits, lshift1, rshift1, mul with 1024 bit result, mul with 2048 bit result (split into hi and lo part) and division.

Here’s the add code that started it all. simd1024 is typedef’d to be uint32_t. This could certainly be improved to use the PTX addc instruction instead of the carry = r < a check…

// returns carry of msb
__device__ bool simd1024_add(simd1024 &r, simd1024 a, const simd1024 &b)
{
	int laneId = threadIdx.x & (warpSize-1);
	r = a + b;
	bool carry, carry_hi = false;
	do {
		// carry propagation
		carry = r < a;
		// accumulate all carries on the high limb
		carry_hi |= __shfl((int)carry, 31);
		carry = (uint32_t)__shfl_up((int)carry, 1) && laneId > 0;
		a = r; r += carry;
	} while(__any(carry)); // warp vote

	// return highest carry
	return carry_hi;
}

Wait, I have more. Here’s a mul returning the lower 1024 bits of the result.
It’s all very straightforward when you do warp synchronous programming.

// 1024 x 1024 bit Multiplication with 1024 bit result
__device__ void simd1024_mul(simd1024 &r, const simd1024 &a, const simd1024 &b)
{
	int laneId = threadIdx.x & (warpSize-1);

	r = 0;
	for (int limb=31; limb >= 0; --limb)
	{
		simd1024 factor = (simd1024)__shfl((int)a,limb);
		simd1024_add(r, r, __umulhi(factor, b));
		r = (simd1024)__shfl_up((int)r, 1);
		r = (laneId==0) ? 0 : r;
		simd1024_add(r, r, factor*b);
	}
}

On top of these foundations we’ve implemented
a) modular exponentiation with the exponentiation by squaring method (awfully slow)
b) modular exponentiation (base 2) using Montgomery multiplication

We use it for additional primality testing (Fermat checks) after sieving numbers with wheel factorization + sieve of Eratosthenes, in order to find constellations of primes that are close together (e.g. six-tuples with very small distances). There are some crypto currencies that use this task as a proof of work function - so essentially found prime tuples are worth money ;)

If there is some interest, I could clean up the maths code up and post it.

There’s still some potential for optimization, e.g. by saving some ALU operations in the squaring of numbers (up to 1/2 of the multiplications could be saved) compared to regular multiplication.

Christian

I haven’t looked at your code in detail, but requests for big integer (add/multiply) handling come up with some regularity. If it were available in a github repo with usage examples, I’m quite sure there would be interest. CUMP provides GPU extended precision capabillity, but only for floating point AFAIK.

This is a typical example of the requests that I see:

[url]c - large integer multiplication and addition on gpu - Stack Overflow

I have tested an implementation of big integer add here:

[url]c - large integer addition with CUDA - Stack Overflow

and for amusement purposes I was thinking of trying to hack together a naive big integer (1024 or 2048 x32bit) multiply routine.

Using the same idea as proposed in the StackOverflow answer, you may perform the carry propagation in constant time in warp-synchronous style, using code like this:

// returns carry of msb
__device__ bool simd1024_add(simd1024 &r, simd1024 a, const simd1024 &b)
{
    int laneId = threadIdx.x & (warpSize-1);
    r = a + b;
    uint32_t allcarries, p, g;
    bool carry, carry_hi = false;

    // carry propagation
    carry = r < a;
    g = __ballot(carry);	// Carry generate
    p = __ballot(r == 0xffffffff | carry);	// Carry propagate
    allcarries = ((p | g) + g);	// Propagate all carries
    carry_hi = allcarries < g;	// Detect final overflow
    allcarries = (allcarries ^ p) | g;	// Get effective carries
    r += (allcarries << laneId) & 1;

    // return highest carry
    return carry_hi;
}

Note I have not tested it at all, so I may have got the boolean calculations wrong.

This is mostly of academic interest though: in the average case, the original code needs only one iteration to propagate carries, and may still be faster…