Ok, if you really wants multi-dimensional-dynamic arrays then I’d like to say two things about it:
-
C/C++ kinda sucks at it, because it has no build in support for that.
-
Thus this leaves the “Delphi” approach.
Which basically means:
An array of pointers pointing to an array of pointers pointing to an array of pointers and so forth.
So your first dimensional array will be pointers.
So your second dimensional array will be pointers.
So your third dimensional array will be pointers.
So your fourth dimensional array will be pointers.
And so forth.
So your last dimensional array will be data.
The added benefit of this approach is that each dimension can also have it’s own length.
Which basically means, each pointer also has it’s own length to indicate the length of that array.
So the first array will have 1 length field.
The second arrray will have 1 pointer + 1 length field per pointer.
To specify the size of the secondary array length.
So this allows each dimension and each array at that dimension to have it’s own length.
In your case this seems to be needed.
Especially at the molecule level… some molecules might have 10 atoms, some might have 100, some might have 1000 and so forth.
The lengths are at the same dimensional level, but still different for each instance.
Ok so long story short:
What you need to do to be able to do this in cuda at least thinking about it at the binary level is to “translate” each pointer to cuda’s “memory space”.
Think of them as “local/host pointers” and “remote/device pointers”.
Each pointer will need to be translated from host to remote.
Then each array/piece of data can be copied from host pointer to remote pointer.
Maybe by now there are common data structures available that can help you with that… maybe some kind of cuda specific vector or some kind of array template… not sure about that…
But at least now you understand hopefully… at the binary level what is required to make this possible.
This is one of the reason why linear is advised… it makes things a whole lot simpler… but it might not always be possible… in your case it may not be possible or it might… it depends on how wildly different molecules can be… and what it’s maximum size could be.
For example if a molecule can never be greater than say N atoms… and if it’s ok to waste space… then a linear approach could still be taking… and then at the expensive perhaps of index calculations.
Again this is where I like Delphi might better than C/C++, since Delphi has properties which can hide index calculations to keep the code much more clean. Some C/C++ compilers support properties… but it is not standardized.
Other solutions could be index operator overloading (not entirely sure about that, could be limited or problematic).
Other ideas could be open array parameters in C, vardiac or something it was called I think.
Or perhaps some inlineable functions to keep code clean but still fast code generation. Though having to pass indexes, and sizes still somewhat sucks, lots of stuff to type… but at least gets rid of calculation code and might offer cleaner 1 line solutions.
So first you will need to determine:
- Is it ok to waste space or not ? What’s the maximum number of atoms ? What’s the minimum number of atoms ? How large is the difference between them.
If answer is no.
- Then “partial arrays”, “sparse arrays”, sparse solutions are needed which is what is describe above…
I am not exactly sure what kind of data processing you want to do… I guess it will be lots and intense… and intermix data from multiple molecules and files…
But if that is not the case… if there is some kind of processing per molecule…
Then maybe it can be done just per molecule which would then simplify it… but I guess you already thought of that or maybe not… and it’s probably not possible/not wanted.
- Once you figure out if it’s 1 or 2… then suppose it’s 2… I like to point out some problems:
Acccessing/retrieving those pointers is what is known as “pointer chasing”. It’s basically a “gate problem”. All your data is only accessible via “gates”. It’s like jumping through hoops.
1D->2D->3D->4D->5D->DATA.
Each dimension represents a gate/problem. To get to the DATA the dimensions have to be traversed… this could be a problem for your algorithm/processing it could introduce “gate/pointer/access delays” so to speak… the GPU will try to hide it by switching to other threads which are stored on chip… but it has limited ammounts of those… so the chance is high it will run into “stalls”.
Then there is also the problem of caching. Limited ammounts of cache… though nowadays more cache available on latest graphics cards… which could help store these pointers, for faster access.
However data needs to be stored as well if it is to be processed multiple times and so forth… so that conflicts with each other… those massive ammount of pointers might start to cache trash the cache… so try and process mocules in such a way that not too many of them are access at the same time… or process linear/one by one as much as possible… and try to scale it up to do parallel processing but in such a way that it does not overflow the resources on the chip. So query the device for it’s resources and try to apply it to any algorithm.
Finally beware of deadlocks… threads cannot wait on each other’s results if it’s fully parallel. It will have to be “batched based”, based on the number of max threads available on the chip.
If the number of threads is exceed, the kernel will either not launch… or if the kernel itself assumes more threads were launched which were actually not launched then it will wait on non-existing threads, which is kinda funny ;) :)
Complex isn’t it ? :)
Just like to give you a heads up on what’s coming towards you and if you still think GPU/CUDA is worth the trouble… compared to perhaps somewhat more easy CPU solutions ;) which at least serialize/single thread case won’t deadlock… though multi threading for cpu can also introduce race conditions/locking etc.
(Also as a side note, nowadays modern GPUs can have support for unified memory, thus the translating could happen automatically, it can have performance loss though).
Also translating is done as follows:
- Allocate memory on host/local.
- Allocate memory on device/remote.
Do it the same way… that way you know how to translate 1 to 2, and 2 back to 1.
In other words, each pointer from 1 (local/host) is associated with pointer from 2 (remote/device).
I’ll give a way a little secret of mine.
You could simply create some kind of unified pointer/variable structure as follows:
TunifiedPointer =
Local : pointer;
Remote : DevicePtr;
end;
Translates roughly to C something like:
struct {
void* Local;
CudaPtr Remote;
} TunifiedPointer;