Global load efficiency is not as it should be (100%)

Okay, so I’m trying to do some calculations. I got a good tip here about allocating private data for the sake of global memory access. Sounds like a good idea to me so I tried with a 2d array (a douple float pointer) in some of my code and I did achieve 100% load efficiency.

I was very happy.

But now I’m trying to do the same thing but with only a 1d array.

I’m storing tetrahedron data so the way I arranged it is or at least tried to arrange it is :

x0
y0
z0
x1
y1
z1
x2
y2
z2
x3
y3
z3

,
where each row is defined by the number of tetrahedra or some other parameter, n. So it’s a 12 x n matrix and each coordinate is of course a float.

Points are stored in a similar order,

x0
y0
z0

,
so this time we have a 3 x n matrix.

I’m not sure if I’m just messing this up really badly or not but I have a load efficiency of like 51% but my store efficiency is as it should be, 100%.

Compile code with :

nvcc -gencode arch=compute_50,code=sm_50 -O3 -o test test.cu

Profile with :

nvprof --metrics gld_efficiency,gst_efficiency ./test

test.cu

#include <iostream>

#include <thrust/device_vector.h>
#include <thrust/copy.h>

#include "predicates.h"

// This code is to be used as a time test
// for various calculations used by regulus

const int tpb = 64,
          bpg = 512,
          gs = tpb * bpg;

__global__
void orientation_tests
(
const int num_points,
const float * __restrict__ tet_data,
const float * __restrict__ pt_data,
const float * __restrict__ predConsts,
int * __restrict__ la,
int * __restrict__ fs
)
{
    const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

    // for every point...
    for (int tid = thread_id; tid < num_points; tid += gs)
    {
        const float x0 = tet_data[0 * num_points + tid];
        const float y0 = tet_data[1 * num_points + tid];
        const float z0 = tet_data[2 * num_points + tid];

        const float x1 = tet_data[3 * num_points + tid];
        const float y1 = tet_data[4 * num_points + tid];
        const float z1 = tet_data[5 * num_points + tid];

        const float x2 = tet_data[6 * num_points + tid];
        const float y2 = tet_data[7 * num_points + tid];
        const float z2 = tet_data[8 * num_points + tid];

        const float x3 = tet_data[9 * num_points + tid];
        const float y3 = tet_data[10 * num_points + tid];
        const float z3 = tet_data[11 * num_points + tid];

        const float px = pt_data[0 * num_points + tid];
        const float py = pt_data[1 * num_points + tid];
        const float pz = pt_data[2 * num_points + tid];

        //printf("%.00f %.00f %.00f\n%.00f %.00f %.00f\n%.00f %.00f %.00f\n%.00f %.00f %.00f\nis being tested against\n%.00f %.00f %.00f\n\n", x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, px, py, pz);

        const float a[3] = { x0, y0, z0 };
        const float b[3] = { x1, y1, z1 };
        const float c[3] = { x2, y2, z2 };
        const float d[3] = { x3, y3, z3 };

        const float p[3] = { px, py, pz };

        // orienation of p vs every face
		const int ort0 = orientation(predConsts, d, c, b, p); // 321
		const int ort1 = orientation(predConsts, a, c, d, p); // 023
		const int ort2 = orientation(predConsts, a, d, b, p); // 031
		const int ort3 = orientation(predConsts, a, b, c, p); // 012

		assert(ort0 != -1 && ort1 != -1 && ort2 != -1 && ort3 != -1);

        // write location association
        int x = 0;

        x |= (ort0 << 0);
        x |= (ort1 << 1);
        x |= (ort2 << 2);
        x |= (ort3 << 3);

        la[tid] = x;

        // fracture size = sum of orientations
        // 4 for 1-to-4, 3 for 1-to-3, 2 for 1-to-2
        fs[tid] = ort0 + ort1 + ort2 + ort3;
    }
}

int main(void)
{
    // number of points we are testing
    const int bl = 4;
    const int rl = (bl - 1) * 3;
    const int num_points = bl * bl * bl;

    // Allocate data for tetrahedra
    float *tet_data = 0;
    cudaMallocHost(&tet_data, num_points * 4 * 3 * sizeof(*tet_data));

    // Allocate data for the points
    float *pt_data = 0;
    cudaMallocHost(&pt_data, num_points * 3 * sizeof(*pt_data));

    // For simpler bookkeeping
    int rs[12] = { 0 };

    for (int i = 0; i < 12; ++i)
        rs[i] = num_points * i;

    // Write tetrahedra
    for (int i = 0; i < num_points; ++i)
    {
        tet_data[rs[0] + i] = 0; // x
        tet_data[rs[1] + i] = 0; // y
        tet_data[rs[2] + i] = 0; // z

        tet_data[rs[3] + i] = rl;
        tet_data[rs[4] + i] = 0;
        tet_data[rs[5] + i] = 0;

        tet_data[rs[6] + i] = 0;
        tet_data[rs[7] + i] = rl;
        tet_data[rs[8] + i] = 0;

        tet_data[rs[9] + i] = 0;
        tet_data[rs[10] + i] = 0;
        tet_data[rs[11] + i] = rl;
    }

    // write Cartesian points
    for (int i = 0; i < num_points; ++i)
    {
        const float x = (float ) (i / (bl * bl)),
                    y = (float ) ((i / bl) % bl),
                    z = (float ) (i % bl);

        pt_data[0 * num_points + i] = x;
        pt_data[1 * num_points + i] = y;
        pt_data[2 * num_points + i] = z;
    }

    // copy to device
    thrust::device_vector<float> d_tet_data(num_points * 4 * 3),
                                 d_pt_data(num_points * 3);

    thrust::copy(tet_data, tet_data + 12 * num_points, d_tet_data.begin());
    thrust::copy(pt_data, pt_data + 3 * num_points, d_pt_data.begin());

    // Build predicate data
    PredicateInfo preds;
    initPredicate(preds);

    // allocate storage to write to
    thrust::device_vector<int> fs(num_points, -1),
                               la(num_points, -1);

    // test routine
    orientation_tests<<<bpg, tpb>>>
                     (num_points,
                      thrust::raw_pointer_cast(d_tet_data.data()),
                      thrust::raw_pointer_cast(d_pt_data.data()),
                      preds._consts,
                      thrust::raw_pointer_cast(la.data()),
                      thrust::raw_pointer_cast(fs.data()));
    cudaDeviceSynchronize();

    // visual confirmation
    const bool print0 = true;
    if (print0)
        for (int i = 0; i < num_points; ++i)
            std::cout << "fs : " << fs[i] << ", la : " << la[i] << std::endl;

    cudaFreeHost(pt_data);
    cudaFreeHost(tet_data);

    return 0;
}

predicates.h

/*
Author: Ashwin Nanjappa and Cao Thanh Tung
Filename: GDelShewchukDevice.h

===============================================================================

Copyright (c) 2013, School of Computing, National University of Singapore.
All rights reserved.

Project homepage: http://www.comp.nus.edu.sg/~tants/gdel3d.html

If you use gStar4D and you like it or have comments on its usefulness etc., we
would love to hear from you at <tants@comp.nus.edu.sg>. You may share with us
your experience and any possibilities that we may improve the work/code.

===============================================================================

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer. Redistributions in binary form must reproduce
the above copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution.

Neither the name of the National University of University nor the names of its contributors
may be used to endorse or promote products derived from this software without specific
prior written permission from the National University of Singapore.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE  GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.

*/

#pragma once

#include "GDelShewchukDevice.h"

enum Orient
{
    OrientNeg   = -1,
    OrientZero  = +0,
    OrientPos   = +1
};

__device__ Orient detToOrient( float det )
{
    return ( det > 0 ) ? OrientPos : ( ( det < 0 ) ? OrientNeg : OrientZero );
}

int PredThreadNum       = 32 * 32;

template< typename T >
T* cuNew( int num )
{
    T* loc              = NULL;
    const size_t space  = num * sizeof( T );
    cudaMalloc( &loc, space );

    return loc;
}

void initPredicate(PredicateInfo &DPredicateInfo)
{
    DPredicateInfo.init();

    // Predicate constants
    DPredicateInfo._consts = cuNew< float >( DPredicateBoundNum );

    // Predicate arrays
    DPredicateInfo._data = cuNew< float >( PredicateTotalSize * PredThreadNum );

    // Set predicate constants
    kerInitPredicate<<< 1, 1 >>>( DPredicateInfo._consts );

    return;
}

__device__ Orient orientation
(
const float* predConsts,
const float* p0,
const float* p1,
const float* p2,
const float* p3
)
{
    float det = orient3dfast( predConsts, p0, p1, p2, p3 ); 
//printf("det = %f\n", det);
    // Need exact check
    if ( det == FLT_MAX )
    { //printf("Calling exact routine...\n");
        det = orient3dexact( predConsts, p0, p1, p2, p3 );
    }
//printf("%.00f\n", det);
    return detToOrient( det );
}

GDelShewchukDevice.h is quite large and must be taken from : regulus_v1.5/GDelShewchukDevice.h at master · cmazakas/regulus_v1.5 · GitHub

2d arrays are generally padded 1d arrays; hence, if you register a drop in load efficiency when moving from 2d arrays to 1d arrays, without having really changed array indexing, it may be because the device enjoys ‘puffy’ arrays, in turn because of section 5.3.2. Device Memory Accesses:

A common global memory access pattern is when each thread of index (tx,ty) uses the
following address to access one element of a 2D array of width width, located at address
BaseAddress of type type* (where type meets the requirement described in Maximize
Utilization):
BaseAddress + width * ty + tx

For these accesses to be fully coalesced, both the width of the thread block and the width
of the array must be a multiple of the warp size (or only half the warp size for devices of
compute capability 1.x).
In particular, this means that an array whose width is not a multiple of this size will be
accessed much more efficiently if it is actually allocated with a width rounded up to the
closest multiple of this size and its rows padded accordingly. The cudaMallocPitch()
and cuMemAllocPitch() functions and associated memory copy functions described in
the reference manual enable programmers to write non-hardware-dependent code to
allocate arrays that conform to these constraints.

for example here:

const float x1 = tet_data[3 * num_points + tid];
const float y1 = tet_data[4 * num_points + tid];
const float z1 = tet_data[5 * num_points + tid];

much may depend on num_points in my opinion
you may indeed be able to salvage the situation by ensuring num_points is a multiple of the warp size

I think you’re right.

You know what’s weird? I never used cudaMallocPitch() or anything like that. I handled it all maybe kind of in a cumbersome way. I rewrote the test.cu file, if you’d like to take a look :

#include <iostream>

#include <thrust/device_vector.h>
#include <thrust/copy.h>

#include "predicates.h"

// This code is to be used as a time test
// for various calculations used by regulus

const int tpb = 64,
          bpg = 512,
          gs = tpb * bpg;

__global__
void orientation_tests
(
const int num_points,
const float * __restrict__ tet_data,
const float * __restrict__ pt_data,
const float * __restrict__ predConsts,
int * __restrict__ la,
int * __restrict__ fs
)
{
    const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

    // for every point...
    for (int tid = thread_id; tid < num_points; tid += gs)
    {
        const float x0 = tet_data[0 * num_points + tid];
        const float y0 = tet_data[1 * num_points + tid];
        const float z0 = tet_data[2 * num_points + tid];

        const float x1 = tet_data[3 * num_points + tid];
        const float y1 = tet_data[4 * num_points + tid];
        const float z1 = tet_data[5 * num_points + tid];

        const float x2 = tet_data[6 * num_points + tid];
        const float y2 = tet_data[7 * num_points + tid];
        const float z2 = tet_data[8 * num_points + tid];

        const float x3 = tet_data[9 * num_points + tid];
        const float y3 = tet_data[10 * num_points + tid];
        const float z3 = tet_data[11 * num_points + tid];

        const float px = pt_data[0 * num_points + tid];
        const float py = pt_data[1 * num_points + tid];
        const float pz = pt_data[2 * num_points + tid];

        //printf("%.00f %.00f %.00f\n%.00f %.00f %.00f\n%.00f %.00f %.00f\n%.00f %.00f %.00f\nis being tested against\n%.00f %.00f %.00f\n\n", x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, px, py, pz);

        const float a[3] = { x0, y0, z0 };
        const float b[3] = { x1, y1, z1 };
        const float c[3] = { x2, y2, z2 };
        const float d[3] = { x3, y3, z3 };

        const float p[3] = { px, py, pz };

        // orienation of p vs every face
		const int ort0 = orientation(predConsts, d, c, b, p); // 321
		const int ort1 = orientation(predConsts, a, c, d, p); // 023
		const int ort2 = orientation(predConsts, a, d, b, p); // 031
		const int ort3 = orientation(predConsts, a, b, c, p); // 012

		assert(ort0 != -1 && ort1 != -1 && ort2 != -1 && ort3 != -1);

        // write location association
        int x = 0;

        x |= (ort0 << 0);
        x |= (ort1 << 1);
        x |= (ort2 << 2);
        x |= (ort3 << 3);

        la[tid] = x;

        // fracture size = sum of orientations
        // 4 for 1-to-4, 3 for 1-to-3, 2 for 1-to-2
        fs[tid] = ort0 + ort1 + ort2 + ort3;
    }
}

struct device_matrix
{
    float **data;

    int nr, nc; // number of rows, columns

    device_matrix(const int num_rows, const int num_cols);
    ~device_matrix(void);

    __host__ __device__
    float* operator[](const int x);
    void copy_from_1d_host(const float *h_data);
};

void device_matrix::copy_from_1d_host(const float *h_data)
{
    float **tmp = 0;
    cudaMallocHost(&tmp, nr * sizeof(*tmp));
    cudaMemcpy(tmp, data, nr * sizeof(*tmp), cudaMemcpyDeviceToHost);

    for (int i = 0; i < nr; ++i)
        cudaMemcpy(tmp[i], h_data + (nc * i), nc * sizeof(*data[i]), cudaMemcpyHostToDevice);

    cudaFreeHost(tmp);
}

__host__ __device__
float* device_matrix::operator[](const int x)
{
    return data[x];
}

device_matrix::~device_matrix(void)
{
    float **tmp = 0;
    cudaMallocHost(&tmp, nr * sizeof(*tmp));
    cudaMemcpy(tmp, data, nr * sizeof(*data), cudaMemcpyDeviceToHost);

    cudaFree(data);

    for (int i = 0; i < nr; ++i)
        cudaFree(tmp[i]);

    cudaFreeHost(tmp);
}

device_matrix::device_matrix(const int num_rows, const int num_cols)
{
    nr = num_rows;
    nc = num_cols;

    float **tmp;
    cudaMallocHost(&tmp, nr * sizeof(*tmp));

    for (int i = 0; i < nr; ++i)
        cudaMalloc(&tmp[i], nc * sizeof(*tmp[i]));

    cudaMalloc(&data, nr * sizeof(*data));
    cudaMemcpy(data, tmp, nr * sizeof(*data), cudaMemcpyHostToDevice);

    cudaFreeHost(tmp);
}

__global__
void test_kernel
(
const int num_points, 
device_matrix tet_matrix,
device_matrix pt_matrix,
const float * __restrict__ predConsts,
int * __restrict__ la,
int * __restrict__ fs
)
{
    const int thread_id = threadIdx.x + blockIdx.x * blockDim.x;

    for (int tid = thread_id; tid < num_points; tid += gs)
    {
        const float x0 = tet_matrix[0][tid];
        const float y0 = tet_matrix[1][tid];
        const float z0 = tet_matrix[2][tid];

        const float x1 = tet_matrix[3][tid];
        const float y1 = tet_matrix[4][tid];
        const float z1 = tet_matrix[5][tid];

        const float x2 = tet_matrix[6][tid];
        const float y2 = tet_matrix[7][tid];
        const float z2 = tet_matrix[8][tid];

        const float x3 = tet_matrix[9][tid];
        const float y3 = tet_matrix[10][tid];
        const float z3 = tet_matrix[11][tid];

        const float px = pt_matrix[0][tid];
        const float py = pt_matrix[1][tid];
        const float pz = pt_matrix[2][tid];

        //printf("%.00f %.00f %.00f\n%.00f %.00f %.00f\n%.00f %.00f %.00f\n%.00f %.00f %.00f\nis being tested against\n%.00f %.00f %.00f\n\n", x0, y0, z0, x1, y1, z1, x2, y2, z2, x3, y3, z3, px, py, pz);

        const float a[3] = { x0, y0, z0 };
        const float b[3] = { x1, y1, z1 };
        const float c[3] = { x2, y2, z2 };
        const float d[3] = { x3, y3, z3 };

        const float p[3] = { px, py, pz };

        // orienation of p vs every face
		const int ort0 = orientation(predConsts, d, c, b, p); // 321
		const int ort1 = orientation(predConsts, a, c, d, p); // 023
		const int ort2 = orientation(predConsts, a, d, b, p); // 031
		const int ort3 = orientation(predConsts, a, b, c, p); // 012

		assert(ort0 != -1 && ort1 != -1 && ort2 != -1 && ort3 != -1);

        // write location association
        int x = 0;

        x |= (ort0 << 0);
        x |= (ort1 << 1);
        x |= (ort2 << 2);
        x |= (ort3 << 3);

        la[tid] = x;

        // fracture size = sum of orientations
        // 4 for 1-to-4, 3 for 1-to-3, 2 for 1-to-2
        fs[tid] = ort0 + ort1 + ort2 + ort3;
    }
}

int main(void)
{
    // number of points we are testing
    const int bl = 4;
    const int rl = (bl - 1) * 3;
    const int num_points = bl * bl * bl;

    // Allocate data for tetrahedra
    float *tet_data = 0;
    cudaMallocHost(&tet_data, num_points * 4 * 3 * sizeof(*tet_data));

    // Allocate data for the points
    float *pt_data = 0;
    cudaMallocHost(&pt_data, num_points * 3 * sizeof(*pt_data));

    // For simpler bookkeeping
    int rs[12] = { 0 };

    for (int i = 0; i < 12; ++i)
        rs[i] = num_points * i;

    // Write tetrahedra
    for (int i = 0; i < num_points; ++i)
    {
        tet_data[rs[0] + i] = 0; // x
        tet_data[rs[1] + i] = 0; // y
        tet_data[rs[2] + i] = 0; // z

        tet_data[rs[3] + i] = rl;
        tet_data[rs[4] + i] = 0;
        tet_data[rs[5] + i] = 0;

        tet_data[rs[6] + i] = 0;
        tet_data[rs[7] + i] = rl;
        tet_data[rs[8] + i] = 0;

        tet_data[rs[9] + i] = 0;
        tet_data[rs[10] + i] = 0;
        tet_data[rs[11] + i] = rl;
    }

    // write Cartesian points
    for (int i = 0; i < num_points; ++i)
    {
        const float x = (float ) (i / (bl * bl)),
                    y = (float ) ((i / bl) % bl),
                    z = (float ) (i % bl);

        pt_data[0 * num_points + i] = x;
        pt_data[1 * num_points + i] = y;
        pt_data[2 * num_points + i] = z;
    }

    // copy to device
    thrust::device_vector<float> d_tet_data(num_points * 4 * 3),
                                 d_pt_data(num_points * 3);

    thrust::copy(tet_data, tet_data + 12 * num_points, d_tet_data.begin());
    thrust::copy(pt_data, pt_data + 3 * num_points, d_pt_data.begin());

    // Build predicate data
    PredicateInfo preds;
    initPredicate(preds);

    // allocate storage to write to
    thrust::device_vector<int> fs(num_points, -1),
                               la(num_points, -1);

    // test routine
    /*orientation_tests<<<bpg, tpb>>>
                     (num_points,
                      thrust::raw_pointer_cast(d_tet_data.data()),
                      thrust::raw_pointer_cast(d_pt_data.data()),
                      preds._consts,
                      thrust::raw_pointer_cast(la.data()),
                      thrust::raw_pointer_cast(fs.data()));*/
    cudaDeviceSynchronize();

    device_matrix tet_matrix(12, num_points);
    device_matrix pt_matrix(3, num_points);
    cudaDeviceSynchronize();

    tet_matrix.copy_from_1d_host(tet_data);
    pt_matrix.copy_from_1d_host(pt_data);

    test_kernel<<<bpg, tpb>>>(num_points, 
                              tet_matrix,
                              pt_matrix,
                              preds._consts,
                              thrust::raw_pointer_cast(la.data()),
                              thrust::raw_pointer_cast(fs.data()));
    cudaDeviceSynchronize();

    // visual confirmation
    const bool print0 = true;
    if (print0)
        for (int i = 0; i < num_points; ++i)
            std::cout << "fs : " << fs[i] << ", la : " << la[i] << std::endl;

    cudaFreeHost(pt_data);
    cudaFreeHost(tet_data);

    return 0;
}

Output from nvprof this time :

Kernel: test_kernel(int, device_matrix, device_matrix, float const *, int*, int*)
          1                            gld_efficiency             Global Memory Load Efficiency     100.00%     100.00%     100.00%
          1                            gst_efficiency            Global Memory Store Efficiency     100.00%     100.00%     100.00%

So I’m pretty happy even if I didn’t do anything particularly special.

Let me know what you think of my device_matrix class. I’m trying to make something workable but I don’t know if it’s any good or not, really.

I think I’ll just stick to doing it this way because, well, it works. And I can hide all the implementation details as well so it’s nice.

“I never used cudaMallocPitch() or anything like that”

i do not think it is mandatory to use cudaMallocPitch() to ensure alignment when you offset within a 1d array
you can simply inform the kernel of the number of points num_points, such that it does not attempt to do too much work (move beyond the actual number of points), but also inform the kernel as to the within-array offset it should use, when incrementing/ offsetting within the array

so something like:

for num_points
{
const float x1 = tet_data[3 * num_points_padded + tid];
const float y1 = tet_data[4 * num_points_padded + tid];
const float z1 = tet_data[5 * num_points_padded + tid];
}

regardless, you still have significant redundancy in your code; i shall comment more on this later on

// Write tetrahedra
for (int i = 0; i < num_points; ++i)
{
tet_data[rs[0] + i] = 0; // x
tet_data[rs[1] + i] = 0; // y
tet_data[rs[2] + i] = 0; // z

tet_data[rs[3] + i] = rl;
tet_data[rs[4] + i] = 0;
tet_data[rs[5] + i] = 0;

tet_data[rs[6] + i] = 0;
tet_data[rs[7] + i] = rl;
tet_data[rs[8] + i] = 0;

tet_data[rs[9] + i] = 0;
tet_data[rs[10] + i] = 0;
tet_data[rs[11] + i] = rl;
}

// write Cartesian points
for (int i = 0; i < num_points; ++i)
{
const float x = (float ) (i / (bl * bl)),
y = (float ) ((i / bl) % bl),
z = (float ) (i % bl);

pt_data[0 * num_points + i] = x;
pt_data[1 * num_points + i] = y;
pt_data[2 * num_points + i] = z;
}

perhaps this is just test data, or perhaps i am missing something
but, this seems rather redundant, with very little information entropy
to the point that it may be better to simply send the primary data to the device, and have the device inflate the arrays it is to use, itself
perhaps

This is 100% test data. You’re right, I should just allocate the tetrahedron once but in the “real” thing, there’s going to be many different tetrahedron being written. I just wanted something similar to what I was going to be using and to also practice using the 2d arrays the way I thought.

I know one optimization I can do is write all the tetrahedron I need to test in a contiguous 1d buffer and then align an integer array to my allocated points and access the tetrahedron through that.

Something like,

t0 t1 t2 t3...

0  0  0  1  1  2  2  2  3 ...  // place in tet buffer
p0 p1 p2 p3 p4 p5 p6 p7 p8 ...

Thank you for taking the time to look through my code!

Out of curiosity, say that I wanted to design a new method for my device_matrix class (a host-side one).

I think it was you that suggested I should overlap memory copying and kernel execution. I was thinking, I design a method similar to my copy_from_1d_host() one but this time, it takes a function pointer to a global function.

We can have function pointers to kernels, right? Profiling the time of my code, most of it is spent just reading from memory.

Is there any code that shows how to use kernels with function pointers?

“…to design a new method for my device_matrix class (a host-side one).”

i much prefer low level; hence i am not sure whether i would use classes on the device; and i am not sure whether i would use the device to initiate a lot of memory operations from the host, as is contained in device_matrix
this is perfectly something the host can do, and given that it is likely idle and thus bored, it would or can be made to cost next to nothing; why task the device with something the host can do for free?

“I think it was you that suggested I should overlap memory copying and kernel execution”

yes, but i had streams and juggling from the host in mind
if i am not mistaken, you prep and then launch 1 big kernel (64 * 512) in 1 stream
my idea was to prep smaller blocks and launch several smaller kernels in different streams

We can have function pointers to kernels, right?
i am not sure i understand the question correctly
one has global, device and host; global generally constitutes kernels callable from the host or device
device would imply device functions callable from the device, by the device

“Profiling the time of my code, most of it is spent just reading from memory.”
i am not really surprised, considering the ratio of global memory operations, to other arithmetic, etc operations
you previously noted calculating intersections; the ratio may change if you later on increase functionality
if you have maximized potential data reuse (via shared memory and across functions perhaps) and if you have no redundancy in global memory operations, and global memory operations are ideal, i do not see how you could further improve that aspect

Technically, I don’t think my class even has a device function.

My idea was, once the matrix is allocated, it’s going to be that size until death (i.e. no reallocatons though it would be very possible). I wanted to create a simple to use 2d array because for me, it’s the only thing I got to work correctly.

But my “big” idea was to create a class that would arrange host data, copy host data to the device, execute kernel, copy device data back to the host.

But I wanted to do as you suggested and have this pipeline run in different streams.

for (i = 0; i < num_streams; ++i)
{
  arrange_data_for_device(input, offset[i], ...);
  cudaMemcpyAsync(...);
  /* dereference function pointer for kernel execution */
  cudaMemcpyAsync(...);
}

I want to sort of imitate the thrust functionality of sorting with zip iterators or transform iterators.

If users could take the class and then pass device/host functions to it, I’d be pretty happy. I’m not sure if it’s practical though. But it’s fun to approach CUDA with so much C++.

“But my “big” idea was to create a class that would arrange host data, copy host data to the device, execute kernel, copy device data back to the host.”

arrange_data_for_device(input, offset[i], …);

this should only really be necessary in cases where the input data that the device consumes, is scattered, given that the device favours reading (accessing) global memory in a a) coalesced and b) aligned manner
gathering scattered data is far easier for the host, than the device, as scattered data is more ‘individualistic’, like the host

if the data is not really scattered, one can simply offset the host/ device memory pointers pertaining to the underlying memory copies appropriately

even though the device 1d array likely needs padding when accessed as a 2d array, the corresponding host array hardly does
waste not want not