Hello everyone. Help with the function of the FNV . I’m tired to fight it .
In its simplest form fnv function looks like this:
fnv4( x, y)
{
return x * 0x01000193 ^ y;
}
I am writing on PTX:
mov.u32 round,0x00;
$LLBfnv1:
ld.global.u64 %rM,[mixzero];
ld.global.u64 %rA,[mixzero+128];
mul.hi.u64 %rt0,%rtM,0x01000193;
shl.b64 %rt1,%rM, 32;
mul.hi.u64 %rt1,%rt1,0x01000193;
shl.b64 %rt0,%rt0, 32;
xor.b64 %rt0,%rt0,%rt1;
xor.b64 %rM,%rt0,%rA;
add.u32 round,round,1;
setp.lt.u32 p,round,64;
@p bra.uni $LLBfnv1;
I need a way to process 128 bytes in 64 rounds. With that, if i calculate the 16 threads in parallel by 2 bytes ,that result after each round to keep. Because %rM change depending on the round results.
If stored in a shared memory. Then it turns out that I can simultaneously run only 49152/128 = 384 threads. It is very small .
At the moment, I got GTX660 6 800 000 execution functions. If parallels are not just the function itself . A 128 bytes calculate sequentially in each thread.
Then we can get rid of conservation as the thread and so will see the results of 128 bytes .
To give you an example to understand purebasic why it is necessary to see the results after each round:
For i = 0 To 63
p=fnv(i ! ValueL(*s), ValueL(*mix+i % w) ) % (n /mixhashes) * mixhashes
fnv64BI(*mix,*Fullarray+p*#HASH_BYTES,*mix)
fnv64BI(*mix+#HASH_BYTES,*Fullarray+(p+1)*#HASH_BYTES,*mix+#HASH_BYTES)
Next i
This is an option I am using . All data are in register 16 %rA0-%rA15
The only data that must be taken to make the XOR operation are in global memory:
.func _fnv(){
mov.u64 %rB0,%rd0; // Pointer to array in global memory
mov.u32 round,0x00;
$LLBfnv1:
ld.global.u64 t,[%rB0];
mul.hi.u64 %rt0,%rA0,0x01000193;
shl.b64 %rt1,%rA0, 32;
mul.hi.u64 %rt1,%rt1,0x01000193;
shl.b64 %rt0,%rt0, 32;
xor.b64 %rt0,%rt0,%rt1;
xor.b64 %rA0,%rt0,t;
ld.global.u64 t,[%rB0+8];
mul.hi.u64 %rt0,%rA1,0x01000193;
shl.b64 %rt1,%rA1, 32;
mul.hi.u64 %rt1,%rt1,0x01000193;
shl.b64 %rt0,%rt0, 32;
xor.b64 %rt0,%rt0,%rt1;
xor.b64 %rA1,%rt0,t;
// and so on
ld.global.u64 t,[%rB0+112];
mul.hi.u64 %rt0,%rA15,0x01000193;
shl.b64 %rt1,%rA15, 32;
mul.hi.u64 %rt1,%rt1,0x01000193;
shl.b64 %rt0,%rt0, 32;
xor.b64 %rt0,%rt0,%rt1;
xor.b64 %rA15,%rt0,t;
add.u32 round,round,1;
setp.lt.u32 p,round,64;
@p bra.uni $LLBfnv1;
ret.uni;
}
But still 6 million perform the functions it is not enough for my list of video cards
Most interesting is that if you do the function for the 1st 64 bit word . the result is 30M function calls for 2 words - 29M . For 3 -25 , and for 4- word immediately drops to 10M