I’m porting a program to CUDA that uses long arrays of 40 bit (5 byte) words.
x86 has no problem and not even a performance penalty for reading/writing 64 bit words
with a 5 byte stride. In order to perform these on CUDA, I have for now resorted to the
following pair of routines:
Not being a CUDA expert, I wonder to what extent these routines can be optimized.
Also, I wonder how much slower these are compared to aligned 64 bit reads/writes.
Please let me know if you can shed light on these questions.
tera: thanks for the SO link! that looks mighty helpful.
i will eventually have separate u32 and u8 arrays. that will be an extensive rewrite.
for now i’m trying to see how well i can do with this “quick fix”.
BulatZiganshin: i’m not doing a memcpy. i’m reading 40 bit values from one place, then generating new 40 bit values
from that and writing that to different new places (in a bucketsort).
obviously you could change the byte count from 5 to 6 or 8 if that is important for your refactoring stage.
Probably best to actually test such a thing before drawing conclusions, but I don’t think the compiler is smart enough to intelligently convert that to “optimized” reads/writes of e.g. u32 or higher quantities.
yeah, i mean copy those 5 bytes into/from local var with memcpy and pray to nvcc to make a miracle :)
small edit for txbob code:
u64 foo;
require initialization:
u64 foo = 0;
John, do you know about radix sort in CUB? taking into account that it’s an order of magnitude faster than naive implementations like the one in Boost.Compute, i believe that it’s very hard to make something even close.
the entire comments section there is great reading
one more comment: if you will copy just 5 bytes as i initially proposed, compiler can’t convert it into optimized code like your one (because compiler doesn’t know that there are accessible bytes after the array end). instead, you may try to copy entire 8 bytes and then mask the result
BulatZiganshin + txbob. ah, i should’ve realized that’s what you suggested. neat!
this results in a roughly 20% speedup of my code. Thanks for elaborating.
BulatZiganshin: what i’m doing is not quite a regular radix sort. i have half a billion (2^29) edges that have random endpoints in a billion (2^30) node bipartite graph. and i repeatedly want to identify edges that have no adjacent edge on one side. i first sort into 2^7 big buckets, and then sort each bucket into 2^7 sub buckets, and finally use a bytemap to count incidences.
i use the random distribution to limit the size of buckets, only provisioning a few extra % of space.
although each edge has 2x29=58 bits of info about its endpoints, i only need to store 40 bits, with the rest implicit in the
location of the bucket and the ordering within the bucket.
This is all part of a proof of work algorithm which will ultimately be optimized to extract every last bit of performance
(by more skilled miner authors).
does the CUB radix sort have source code available somewhere?
mining+graph let’s me think that it’s about mining a cryptocurrency without exhaustive search, i.e. by solving set of linear equations or smth like that?
do you made this 20% speedup by using memcpy or PRMT? if you are used memcpy, do you looked into code? i mean that memcpy code may be suboptimal and hand-optimized PRMT may be even faster
if i understand you right, you apply some hashing function to the (node1,node2) 58-bit value in order to mix bits up. if so, i may propose even better bit mixing function. Look at that: [url]smhasher/MurmurHash3.cpp at master · aappleby/smhasher · GitHub - you can even apply them multiple times. I’ve even seen reversible variant of these functions, if you need one. Another possibility is CRC hashing - it’s reversible, have great distribution, but require one 8-byte load (from shared mem) per 8-12 bits of data which may be slower than the multiplication-based hashing
CUB and ModernGPU are invaluable source of ready-to-use algorithms as well as CUDA optimization techniques and code snippets - just read their docs and enjoy. In particular, they support histogramming and segmented sort (i.e. sorting multiple independent arrays in single operation)
According to my experience, 2^8 buckets is optimal for modern CPUs, and afaik, 2^4 buckets are used on each step of CUB sort. Why you use 2^7?
Why you think that CUB radix sort is inappropriate for your goal? In particular, it seems that your algo is memory-constrained. What about using such technique: histogram src data into 16 buckets, then extract (i.e. partition) and sort each bucket individually before going to the next bucket. This way you will need 16x less extra space for sorting of the entire array.
Overall, GPUs are very efficient in radix sort, so you may end up with simple radix sort procedure followed by trivial linear scan rather than sophisticated approach you are described
Actually, googling “CUB radix sort” is enough :) BTW, it’s also is a part of Thrust, which is a part of CUDA installation and extremely easy to learn/use, although doesn’t provide access to all CUB radix sort features. So you already has it installed :)
The 20% speedup is from replacing my original routines with the memcpy based ones.
I didn’t try PRMT, since it seems limited to constructing 32 bit values, where I need 40 bits.
I did not check what the memcpy compiles to.
No; the 2x29 bits are already the result of a siphash-2-4 applied to a 29-bit edge index and a 1-bit endpoint selector.
I expect a custom sort to outperform a library sort, for several reasons. a) can generate inputs the fly b) can pack data in 5 rather than 8 bytes c) can exploit even distribution of values d) can accommodate the 2 different sorting orders.
2^7 worked slightly better on the CPU implementation. I will test which one is better on GPU.