Is it possible to process multidimensional arrays inside the kernel?

mdsim · March 27, 2015, 10:00am

Hello everyone.

I’m a newbiest newbie in CUDA, and I don’t understand how do I process multidimensional arrays with CUDA. My aim is to process one thousand of files, each of them containing a set of ~ 6000 atom coordinates. I can do it in pure C, but wanna increase performance using CUDA.
I suppose the following scheme:

Read 50 files and fill appropriate multidimensional array at HOST area
array[ file_id ][ molecule_id ][ atom_id ] → ([0] = x_coord, [1] = y_coord, [2] = z_coord).
Allocate DEVICE memory and transfer the array to DEVICE
Launch a kernel using 50 threads (or 50 blocks); each thread (or block) will process appropriate part of the array (kinda array[ blockIdx.x ])
Collect results in HOST area and redo 1) untill all 1000 files processed
I tried to test this scheme using simple test arrays (kinda array[ tID ] → ([0] = 1, [1] = 2, [2] = 3)), but failed. The kernel (which squares input numbers) will always return the same wrong result.
Is there a complete guide of processing multidimensional arrays in CUDA? How can I deal with such data-structures without linearization / serialization?

P.S. When running nvcc I set ARCH option to point that I need compute. cap. 3.0 code for my GPU.

little_jimmy · March 28, 2015, 6:34am

the reason why the code fails may be a number of things, not necessarily the processing of the arrays per se

i think it might be easier if you simply posted your code

otherwise, proper debugging too may point out where actual departs intended

keizof · March 28, 2015, 11:56am

Hi mdsim!

I am a very begginer for CUDA too. -)
On C language , “->” means something special, so someone might be confused.
Alse “multi” means more than 3, so someone might be confused.
Do you want to use 3D arrays in “kernel” and “host”?

This is my first program, when I tried 3D arrays.
I hope this could help you.
I had to check 3 important things.

copying to device from host was fine.
copying to host from device was fine.
3D thread ID was fine.

Today, we do not have to write a code, copying between “host” and “device”.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define N (5)

__device__ int dA[N][N][N];

__global__ void sample()
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int z = threadIdx.z + blockDim.z * blockIdx.z;

    dA[z][y][x] = dA[z][y][x] * 1000 + (z * 100 + y * 10 + x);

}

int main(void)
{
    int i, j, k;
    int A[N][N][N];

    dim3 BlockPerGrid(1, 1, 1);
    dim3 ThreadPerBlock(N, N, N);

    fprintf(stderr, "%d * %d * %d * %d = %d\n", sizeof(int), N, N, N,
            sizeof(int) * N * N * N);

    for (k = 0; k < N; k++) {
        for (j = 0; j < N; j++) {
            for (i = 0; i < N; i++) {
                A[k][j][i] = k * 100 + j * 10 + i;
            }
        }
    }

    printf("Before\n");
    for (k = 0; k < N; k++) {
        for (j = 0; j < N; j++) {
            for (i = 0; i < N; i++) {
                printf(" %d", A[k][j][i]);
            }
            printf("\n");
        }
        printf("\n");
    }
    printf("\n");

    size_t size = N * N * N * sizeof(int);

    cudaMemcpyToSymbol(dA, A, size, 0, cudaMemcpyDefault);
    sample <<< BlockPerGrid, ThreadPerBlock >>> ();
    cudaMemcpyFromSymbol(A, dA, size, 0, cudaMemcpyDefault);

    printf("After\n");
    for (k = 0; k < N; k++) {
        for (j = 0; j < N; j++) {
            for (i = 0; i < N; i++) {
                printf(" %d", A[k][j][i]);
            }
            printf("\n");
        }
        printf("\n");
    }
    printf("\n");

    exit(0);

}

keizof · March 28, 2015, 12:44pm

Hi mdsim again!

This is the newest code of mine. I wrote it 3 minitues ago.
Just few days ago, I did not know a new function, cudaMallocManaged() .
I had to use “typedef” for using 2D or 3D or more than dimensions.
This cause my C source is dirty.

We all know memory-address is 1D. :-)

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define N (5)
typedef struct myarray {
    int A[N][N][N];
} myarray_t;

__global__ void sample(myarray_t * p) // small p, NOT LARGE P
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int z = threadIdx.z + blockDim.z * blockIdx.z;

    p->A[z][y][x] = p->A[z][y][x] * 1000 + (z * 100 + y * 10 + x);

}

int main(void)
{
    int i, j, k;
    myarray_t *P; // Large P, not small p

    dim3 BlockPerGrid(1, 1, 1);
    dim3 ThreadPerBlock(N, N, N);

    size_t size = N * N * N * sizeof(int);

    fprintf(stderr, "%d * %d * %d * %d = %d\n", sizeof(int), N, N, N,
            size);

    cudaMallocManaged(&P, size);        // pointer's address

    for (k = 0; k < N; k++) {
        for (j = 0; j < N; j++) {
            for (i = 0; i < N; i++) {
                P->A[k][j][i] = k * 100 + j * 10 + i;
            }
        }
    }

    printf("Before\n");
    for (k = 0; k < N; k++) {
        for (j = 0; j < N; j++) {
            for (i = 0; i < N; i++) {
                printf(" %d", P->A[k][j][i]);
            }
            printf("\n");
        }
        printf("\n");
    }
    printf("\n");

    sample <<< BlockPerGrid, ThreadPerBlock >>> ((myarray_t *) P);

    cudaDeviceSynchronize();    // We MUST WAIT!!

    printf("After\n");
    for (k = 0; k < N; k++) {
        for (j = 0; j < N; j++) {
            for (i = 0; i < N; i++) {
                printf(" %d", P->A[k][j][i]);
            }
            printf("\n");
        }
        printf("\n");
    }
    printf("\n");

    cudaFree(P);

    exit(0);

}

Skybuck · March 29, 2015, 2:30am

The driver api does have some functionality to create/copy/destroy 1D, 2D, 3D arrays, perhaps that could be of some use… How to actually access those kinds of arrays in a CUDA C kernel I don’t know… but there is probably an example in the SDK somewhere ;)

mdsim · March 29, 2015, 12:45pm

Hi keizof!

Thank you for your reply; nice to see your code examples. I find them very useful.
When I say “multi” I mean that I need array’s dimension >> 3 .
The idea of my program is to read lots of files (~1000) sequentially, 50 items per 1 cycle.
Atom coordinates and molecules’s IDs are stored in these files. The final data-set should look like:

file1_id ---- molecule1_id ---- mol_type
       |                 | ---- atoms' array ---- atom1_id ---- x_coord
       |                                   |             | ---- y_coord
       |                                   |             | ---- z_coord
       |                                   |
       |                                   | ---- atom2_id ---- x_coord
       |                                                 | ---- y_coord
       |                                                 | ---- z_coord
       |
       | ---- molecule2_id  [...]

file2_id [...]
[...]

In other words, I split all files into groups (50 files per group) and for each “small” group of files (file1_id … to … file50_id) I generate a huge array and then I try to copy it on device. Then I want to launch 50 kernel threads simultaneously, so that each thread will process only 1 part of the array, for example, file1_id sub-array or file2_id sub-array.

At the moment I do it by using processor’s cores (each one processes 1 file), so that 8 files are processed simultaneously. But the idea that my 384 CUDA cores can do it faster seems to be more sensible; that’s why I’m trying to learn CUDA; and that’s why I need to understand, how to pass huge multidimensional data-sets to device memory and how to access data from the kernel.

Recently I read some CUDA professional’s comment on this topic. The idea is that data is always transferred to device in linear form, but one can also pass an additional array or a set of arrays, containing pointers. But I can’t find simple examples of doing such tricks.

keizof · March 29, 2015, 3:21pm

Hi mdsim;

I am afraid asking. Which is your goal, using CUDA or faster?
For example,
I was thinking that my GTX960 has 1024 CUDA cores ant it is
faster than my AMD FX-8120 Eight-Core Processor.
On FX-8120, 8 threads can be executed simultaneously.
On GTX960, only 32 threads can be executed simultaneously not 1024.

In my short experience,
GTX960 is not always faster than “FX-8120 with AVX with Pthread”.
If it is easier to represent 2D or 3D array using threadIdx,
blockDim and blockIdx, I might launch more than 32 threads.
But that time, I would NOT expect better performance than 32.

I hope that you will change your design or your approach
which you can or CUDA can.
Please just start from writing a very small thread,
which does add or increment.
When you are sure for performance then you will ready to do your project.
I believe a big issue should be devided to small issues.
“One small step for a thread, one giant leap for CUDA.”

PS. In C language, “>>” means “right bit shifting”,
so someone might be confused. :-)

Skybuck · March 30, 2015, 2:52am

Ok, if you really wants multi-dimensional-dynamic arrays then I’d like to say two things about it:

C/C++ kinda sucks at it, because it has no build in support for that.
Thus this leaves the “Delphi” approach.

Which basically means:

An array of pointers pointing to an array of pointers pointing to an array of pointers and so forth.

So your first dimensional array will be pointers.
So your second dimensional array will be pointers.
So your third dimensional array will be pointers.
So your fourth dimensional array will be pointers.

And so forth.

So your last dimensional array will be data.

The added benefit of this approach is that each dimension can also have it’s own length.

Which basically means, each pointer also has it’s own length to indicate the length of that array.

So the first array will have 1 length field.
The second arrray will have 1 pointer + 1 length field per pointer.

To specify the size of the secondary array length.

So this allows each dimension and each array at that dimension to have it’s own length.

In your case this seems to be needed.

Especially at the molecule level… some molecules might have 10 atoms, some might have 100, some might have 1000 and so forth.

The lengths are at the same dimensional level, but still different for each instance.

Ok so long story short:

What you need to do to be able to do this in cuda at least thinking about it at the binary level is to “translate” each pointer to cuda’s “memory space”.

Think of them as “local/host pointers” and “remote/device pointers”.

Each pointer will need to be translated from host to remote.

Then each array/piece of data can be copied from host pointer to remote pointer.

Maybe by now there are common data structures available that can help you with that… maybe some kind of cuda specific vector or some kind of array template… not sure about that…

But at least now you understand hopefully… at the binary level what is required to make this possible.

This is one of the reason why linear is advised… it makes things a whole lot simpler… but it might not always be possible… in your case it may not be possible or it might… it depends on how wildly different molecules can be… and what it’s maximum size could be.

For example if a molecule can never be greater than say N atoms… and if it’s ok to waste space… then a linear approach could still be taking… and then at the expensive perhaps of index calculations.

Again this is where I like Delphi might better than C/C++, since Delphi has properties which can hide index calculations to keep the code much more clean. Some C/C++ compilers support properties… but it is not standardized.

Other solutions could be index operator overloading (not entirely sure about that, could be limited or problematic).

Other ideas could be open array parameters in C, vardiac or something it was called I think.

Or perhaps some inlineable functions to keep code clean but still fast code generation. Though having to pass indexes, and sizes still somewhat sucks, lots of stuff to type… but at least gets rid of calculation code and might offer cleaner 1 line solutions.

So first you will need to determine:

Is it ok to waste space or not ? What’s the maximum number of atoms ? What’s the minimum number of atoms ? How large is the difference between them.

If answer is no.

Then “partial arrays”, “sparse arrays”, sparse solutions are needed which is what is describe above…

I am not exactly sure what kind of data processing you want to do… I guess it will be lots and intense… and intermix data from multiple molecules and files…

But if that is not the case… if there is some kind of processing per molecule…

Then maybe it can be done just per molecule which would then simplify it… but I guess you already thought of that or maybe not… and it’s probably not possible/not wanted.

Once you figure out if it’s 1 or 2… then suppose it’s 2… I like to point out some problems:

Acccessing/retrieving those pointers is what is known as “pointer chasing”. It’s basically a “gate problem”. All your data is only accessible via “gates”. It’s like jumping through hoops.

1D->2D->3D->4D->5D->DATA.

Each dimension represents a gate/problem. To get to the DATA the dimensions have to be traversed… this could be a problem for your algorithm/processing it could introduce “gate/pointer/access delays” so to speak… the GPU will try to hide it by switching to other threads which are stored on chip… but it has limited ammounts of those… so the chance is high it will run into “stalls”.

Then there is also the problem of caching. Limited ammounts of cache… though nowadays more cache available on latest graphics cards… which could help store these pointers, for faster access.

However data needs to be stored as well if it is to be processed multiple times and so forth… so that conflicts with each other… those massive ammount of pointers might start to cache trash the cache… so try and process mocules in such a way that not too many of them are access at the same time… or process linear/one by one as much as possible… and try to scale it up to do parallel processing but in such a way that it does not overflow the resources on the chip. So query the device for it’s resources and try to apply it to any algorithm.

Finally beware of deadlocks… threads cannot wait on each other’s results if it’s fully parallel. It will have to be “batched based”, based on the number of max threads available on the chip.

If the number of threads is exceed, the kernel will either not launch… or if the kernel itself assumes more threads were launched which were actually not launched then it will wait on non-existing threads, which is kinda funny ;) :)

Complex isn’t it ? :)

Just like to give you a heads up on what’s coming towards you and if you still think GPU/CUDA is worth the trouble… compared to perhaps somewhat more easy CPU solutions ;) which at least serialize/single thread case won’t deadlock… though multi threading for cpu can also introduce race conditions/locking etc.

(Also as a side note, nowadays modern GPUs can have support for unified memory, thus the translating could happen automatically, it can have performance loss though).

Also translating is done as follows:

Allocate memory on host/local.
Allocate memory on device/remote.

Do it the same way… that way you know how to translate 1 to 2, and 2 back to 1.

In other words, each pointer from 1 (local/host) is associated with pointer from 2 (remote/device).

I’ll give a way a little secret of mine.

You could simply create some kind of unified pointer/variable structure as follows:
TunifiedPointer =
Local : pointer;
Remote : DevicePtr;
end;

Translates roughly to C something like:

struct {
void* Local;
CudaPtr Remote;
} TunifiedPointer;

mdsim · March 30, 2015, 12:12pm

Hi, Skybuck.

Thank you for such comprehensive discussion of my problem.
Okay, the approach to multidimensional data storage is really complex. You’re right.
In my case the molecular systems to process are of two kinds of molecules (one consisting of 50 atoms, the other consisting of 20 ones). And of course, I need to run “batch-styled” calculations: retrieve data from first, say, 20 files, generate an array, launch 20 threads, wait for all of them to finish, store the results, repeat the cycle for the next 20 files and so on.
Recently I thought a lot about it, and I think I’d better do it in linear form.
[file1_id]
[molecule1_id]
[mol_type]
[…] and so forth.
But my kernel code would not be clean, as index calculations would be too complex. But I can write special “device” functions for that, right? One for calculating start position for a certain file/thread, one for calculating start position for a certain molecule, etc… And the final expression between [ brackets ] would be a sum of mentioned functions. I’d think about it a bit more…

keizof · March 30, 2015, 2:05pm

Does your kernel threads need those parameters such as file_id, molecule_id or mol_type?
If they do not, I think you should NOT give unnecessary arguments.

keizof · March 30, 2015, 3:38pm

This is my example for you.
I defined nested structures but only the final “array_t” is given
to the kernel thread.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define MAX 3

typedef struct array {
    int v[MAX][MAX][MAX];
} array_t;

typedef struct atom {
    array_t edge[MAX];
} atom_t;

typedef struct mol_type {
    atom_t atom[MAX];
} mol_type_t;

typedef struct molecule {
    mol_type_t moltype[MAX];
} molecule_t;

typedef struct file {
    molecule_t mol[MAX];
} file_t;

__global__ void sample(array_t * a)
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int z = threadIdx.z + blockDim.z * blockIdx.z;

    a->v[z][y][x] = z * 100 + y * 10 + x;
    a->v[z][y][x]++;

}

int main(void)
{
    int i, j, k, l, m;
    int x, y, z;
    dim3 BlockPerGrid(1, 1, 1);
    dim3 ThreadPerBlock(MAX, MAX, MAX);
    size_t size;

    file_t *top;
    size = sizeof(file_t) * MAX;

    cudaMallocManaged(&top, size);

    for (i = 0; i < MAX; i++) {
        for (j = 0; j < MAX; j++) {
            for (k = 0; k < MAX; k++) {
                for (l = 0; l < MAX; l++) {
                    for (m = 0; m < MAX; m++) {
                        sample <<< BlockPerGrid,
                            ThreadPerBlock >>> ((array_t *) top[i].
                                                mol[j].moltype[k].atom[l].
                                                edge[m].v);
                        cudaDeviceSynchronize();
                    }
                }
            }
        }
    }

    for (i = 0; i < MAX; i++) {
        for (j = 0; j < MAX; j++) {
            for (k = 0; k < MAX; k++) {
                for (l = 0; l < MAX; l++) {
                    for (m = 0; m < MAX; m++) {
                        for (z = 0; z < MAX; z++) {
                            for (y = 0; y < MAX; y++) {
                                for (x = 0; x < MAX; x++) {
                                    printf(" %d",
                                           top[i].mol[j].
                                           moltype[k].atom[l].edge[m].
                                           v[z][y][x]);
                                }
                                printf("\n");
                            }
                            printf("\n");
                        }
                        printf("\n");
                    }
                    printf("\n");
                }
                printf("\n");
            }
            printf("\n");
        }
        printf("\n");
    }

    cudaFree(top);

}

Skybuck · March 31, 2015, 2:05am

This data structure seems to be a bit whacky. I will fall back on your drawing.

My advice to you is this:

Stuff all atoms in a single dimensional array, this means all atoms of all molecules of all files and so forth. Simply everything is stuff/stacked on top of each other/next to each other in a single dimensional array.

That takes care of the atom storage problem. By the way, the gpu has a special vector class for x,y,z,w types or something like float4 I think it called, gives best performance… probably ideal for your situation.

Data structure in C/C++ would look something like:

Float4 *AtomArray;

(Accessible via):

AtomArray[ … ]

Next you need a way to tell where a molecule starts and where a molecule stops.

In this case I would advice again a linear/single dimension array which contains atom start index and atom stop index, which is related/points towards the first array.

Two arrays could be used one for start and one for stop indexes… for now I will be lazy and use two arrays:

int *MoleculeStart;
int *MoleculeStop;

Both also accessable via .

Now the last thing you need is to tell where a file begins and where a file stops in the molecule array if that is required at all both for now I will assume yes.

int *FileStart;
int *FileStop;

So now you can initialize these indices arrays to point to the correct molecules and the correct atoms.

TMoleculeType *MoleculeType;

Now let’s say you want to access file 5, molecule 4, atom 7, the code would look as follows:

FileID = 5;
MoleculeID = 4;
AtomID = 7;

MoleculeStart = FileStart[ FileID ];
MoleculeStop = FileStop[ FileID ];

MoleculeIndex = MoleculeStart + MoleculeID;

AtomStart = MoleculeStart[ MoleculeIndex ];
AtomStop = MoleculeStop[ MoleculeIndex ];

AtomIndex = AtomStart + AtomID;

Atom[ AtomIndex ].x = 100;
Atom[ AtomIndex ].y = 200;
Atom[ AtomIndex ].z = 300;

And you are done !

Seems good to me ! ;)

I think this will solve many problems for you… at least when it comes to allocation and freeing… much less allocating… much less freeing… much less pointers to translate and so forth.

(The idea behind this data structure is to efficiently used available memory and keep the atoms as close as possible together as well as other data structures, however the indices to present some overhead).

One last alternative datastructure is simply assuming a maximize size for each type… but do have more max defined…

float4 *Atom[MaxFile][MaxMolecule][MaxAtoms];

Then access atoms directly:
Atom[FileID][MoleculeID][AtomID].x = 100

There would be some wasted space but in return faster access speeds, no indices lookups.

Though if this datastructure is possible is somewhat doubtfull… perhaps it’s not to bad… 1 GB of RAM available or so ? Which data structure is best you will ultimately have to decide ;)

Perhaps there is something missing from it… why do you want edges ? and v ? and such… seems whacky… wasn’t in original drawing… I will assume that’s confusion on your part perhaps ?
perhaps it’s of no concern… (data structures seems similiar).

keizof · March 31, 2015, 2:22pm

Is this still ok?

array[ file_id ][ molecule_id ][ atom_id ] → ([0] = x_coord, [1] = y_coord, [2] = z_coord).

Is it 6D array?
In main function, I defined 6D array with “struct”.
And I triggered a kernel function with the 3D array.
In order to use an array in kernel function, I have to define 2nd step’s “struct”.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define FILE_MAX 3
#define MOL_MAX 3
#define ATOM_MAX 3
#define XMAX 5
#define YMAX 5
#define ZMAX 5

typedef struct full {
    int array[FILE_MAX][MOL_MAX][ATOM_MAX][ZMAX][YMAX][XMAX];
} full_t;

typedef struct coord {
    int data[ZMAX][YMAX][XMAX];
} coord_t;

__global__ void sample(coord_t * a)
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int z = threadIdx.z + blockDim.z * blockIdx.z;

    a->data[z][y][x] = z * 100 + y * 10 + x;
    a->data[z][y][x]++;

}

int main(void)
{
    int i, j, k;
    int x, y, z;
    dim3 BlockPerGrid(1, 1, 1);
    dim3 ThreadPerBlock(XMAX, YMAX, ZMAX);
    size_t size;

    full_t *top;
    size = sizeof(full);

    cudaMallocManaged(&top, size);

    for (i = 0; i < FILE_MAX; i++) {
        for (j = 0; j < MOL_MAX; j++) {
            for (k = 0; k < ATOM_MAX; k++) {
                sample <<< BlockPerGrid,
                    ThreadPerBlock >>> ((coord_t *) &
                                        (top->array[i][j][k][0][0][0]));
                cudaDeviceSynchronize();
            }
        }
    }

    for (i = 0; i < FILE_MAX; i++) {
        for (j = 0; j < MOL_MAX; j++) {
            for (k = 0; k < ATOM_MAX; k++) {
                for (z = 0; z < ZMAX; z++) {
                    for (y = 0; y < YMAX; y++) {
                        for (x = 0; x < XMAX; x++) {
                            printf(" %d", top->array[i][j][k][z][y][x]);
                        }
                        printf("\n");
                    }
                    printf("\n");
                }
                printf("\n");
            }
            printf("\n");
        }
        printf("\n");
        printf("\n");
    }

    cudaFree(top);

}

Skybuck · March 31, 2015, 6:42pm

Ok perhaps I miss understand a little bit. I see you split up the dimensions in X, Y, Z arrays too…

From your drawing it seems you want to give each atom a coordinate X,Y,Z ?

But perhaps that’s not what you want ? Or is it ?

Perhaps you want a “3D volume” at the end ?

So then an Atom is described by a 3D Volume ?

That doesn’t sound right to me though… and it also doesn’t really match your drawing.

In the case you want X,Y,Z coordinates then these can be stored in a float4 structure. So 3 dimensions would then be enough for you.

I’ll add to that that I have seen people/code access coordinates in an array like fashion. However then you would just need 4D.

The last array would be: Coordinate[0]=X, Coordinate[1]=Y, Coordinate[2]=Z

I am not sure if a float4 can be accessed that way I would assume yes not sure…

It’s probably possible via a little pointer trick:

float *Coordinate;

Coordinate = &Atom[FileID][MoleculeID][AtomID];

Coordinate[0] = 100;
Coordinate[1] = 200;
Coordinate[2] = 300;

If not then array would have to be redefined a little bit:

float *Atom[MaxFile][MaxMolecule][MaxAtoms][MaxCoordinates];

MaxCoordinates would be 3 in your case, one for x, one for y, one for z ;)