Arrayfire lib performance among L4T/cuda new versions
Hi everyone, I don't see so many posts about users mentioning usage of arrayfire library (although it's efficient and highly portable), but if someone faces issues with it, this may be of interest. I had some code using arrayfire working fine and fast on TX1 with R23.2 (last L4T 32 bits) and cuda 7.0. When I upgraded to R24.2 and 64 bits and cuda-8.0 I've faced some memory outage and had to call the garbage collector explicitly in the loop. Moving to TX2 and R27.0.1 didn't change much for my case. Upgrading then to R28.1 and cuda-8.0.84 lead to JIT errors, and I've had to add many explicit evaluations and synchronize to get it working. Now, upgrading to R28.2-preview, it's even worse. Full story here: [url]https://github.com/arrayfire/arrayfire/issues/1910[/url] and [url]https://github.com/arrayfire/arrayfire/issues/2028[/url]. Seems using [code]export AF_CUDA_MAX_JIT_LEN=10[/code]instead of default AF value 100 is a workaround, but 1 may also be used... Arrayfire developers think that these regressions are from NVIDIA side. I also have to say that each time I've tried to build the old arrayfire version and the problem looked to be linked to cuda version rather than arrayfire version. Does anyone in NVIDIA knows about this ? If one, please share any insight.
Hi everyone,

I don't see so many posts about users mentioning usage of arrayfire library (although it's efficient and highly portable), but if someone faces issues with it, this may be of interest.

I had some code using arrayfire working fine and fast on TX1 with R23.2 (last L4T 32 bits) and cuda 7.0.

When I upgraded to R24.2 and 64 bits and cuda-8.0 I've faced some memory outage and had to call the garbage collector explicitly in the loop.

Moving to TX2 and R27.0.1 didn't change much for my case. Upgrading then to R28.1 and cuda-8.0.84 lead to JIT errors, and I've had to add many explicit evaluations and synchronize to get it working.

Now, upgrading to R28.2-preview, it's even worse. Full story here: https://github.com/arrayfire/arrayfire/issues/1910 and https://github.com/arrayfire/arrayfire/issues/2028.

Seems using
export AF_CUDA_MAX_JIT_LEN=10
instead of default AF value 100 is a workaround, but 1 may also be used...

Arrayfire developers think that these regressions are from NVIDIA side.
I also have to say that each time I've tried to build the old arrayfire version and the problem looked to be linked to cuda version rather than arrayfire version.

Does anyone in NVIDIA knows about this ? If one, please share any insight.

#1
Posted 12/22/2017 12:38 AM   
Hi, Sorry that we don't have too much experience on arrayfire. If this is a regression, we may need to reproduce this issue before reporting to the internal team. Two recommended reproducing type are: [b]1.[/b] Modify from our CUDA native sample [b]2.[/b] A simple vanilla sample that hit this error Thanks.
Hi,

Sorry that we don't have too much experience on arrayfire.

If this is a regression, we may need to reproduce this issue before reporting to the internal team.

Two recommended reproducing type are:
1. Modify from our CUDA native sample
2. A simple vanilla sample that hit this error

Thanks.

#2
Posted 12/22/2017 06:33 AM   
Hi AastaLLL, You can reproduce something similar this way: 1. Get arrayfire 3.6.0 master from github: [url]https://github.com/arrayfire/arrayfire[/url]. 2. Check/install required dependancies ([url]https://github.com/arrayfire/arrayfire/wiki/Build-Instructions-for-Linux[/url]). I've also downloaded cub-1.7.4 and copied its cub directory into /usr/include. 3. Configure with cmake-gui (all default should be ok, but you may have to click twice on configure to get everything configured, then generate). 4. Build. No need to install for this test. 5. In build directory, launch:[code]./test/jit_cuda --gtest_filter=JIT.ISSUE_1646[/code] Setting AF_CUDA_MAX_JIT_LEN to 19 or more leads to failure, but it succeeds with 18 or less. Here is the failing kernel with value 19:[code]typedef unsigned int uint; typedef long long dim_t; /******************************************************* * Copyright (c) 2014, ArrayFire * All rights reserved. * * This file is distributed under 3-clause BSD license. * The complete license agreement can be obtained at: * http://arrayfire.com/licenses/BSD-3-Clause ********************************************************/ typedef float2 cuFloatComplex; typedef cuFloatComplex cfloat; typedef double2 cuDoubleComplex; typedef cuDoubleComplex cdouble; // ---------------------------------------------- // REAL NUMBER OPERATIONS // ---------------------------------------------- #define sign(in) signbit((in)) #define __noop(a) (a) #define __add(lhs, rhs) (lhs) + (rhs) #define __sub(lhs, rhs) (lhs) - (rhs) #define __mul(lhs, rhs) (lhs) * (rhs) #define __div(lhs, rhs) (lhs) / (rhs) #define __and(lhs, rhs) (lhs) && (rhs) #define __or(lhs, rhs) (lhs) || (rhs) #define __lt(lhs, rhs) (lhs) < (rhs) #define __gt(lhs, rhs) (lhs) > (rhs) #define __le(lhs, rhs) (lhs) <= (rhs) #define __ge(lhs, rhs) (lhs) >= (rhs) #define __eq(lhs, rhs) (lhs) == (rhs) #define __neq(lhs, rhs) (lhs) != (rhs) #define __conj(in) (in) #define __real(in) (in) #define __imag(in) (0) #define __abs(in) abs(in) #define __sigmoid(in) (1.0/(1 + exp(-(in)))) #define __bitor(lhs, rhs) ((lhs) | (rhs)) #define __bitand(lhs, rhs) ((lhs) & (rhs)) #define __bitxor(lhs, rhs) ((lhs) ^ (rhs)) #define __bitshiftl(lhs, rhs) ((lhs) << (rhs)) #define __bitshiftr(lhs, rhs) ((lhs) >> (rhs)) #define __min(lhs, rhs) ((lhs) < (rhs)) ? (lhs) : (rhs) #define __max(lhs, rhs) ((lhs) > (rhs)) ? (lhs) : (rhs) #define __rem(lhs, rhs) ((lhs) % (rhs)) #define __mod(lhs, rhs) ((lhs) % (rhs)) #define __pow(lhs, rhs) fpow((float)lhs, (float)rhs) #define __convert_char(val) (char)((val) != 0) #define fpow(lhs, rhs) pow((lhs), (rhs)) #define frem(lhs, rhs) remainder((lhs), (rhs)) #define iszero(a) ((a) == 0) // ---------------------------------------------- // COMPLEX FLOAT OPERATIONS // ---------------------------------------------- #define __crealf(in) ((in).x) #define __cimagf(in) ((in).y) #define __cabsf(in) hypotf(in.x, in.y) __device__ cfloat __cplx2f(float x, float y) { cfloat res = {x, y}; return res; } __device__ cfloat __cconjf(cfloat in) { cfloat res = {in.x, -in.y}; return res; } __device__ cfloat __caddf(cfloat lhs, cfloat rhs) { cfloat res = {lhs.x + rhs.x, lhs.y + rhs.y}; return res; } __device__ cfloat __csubf(cfloat lhs, cfloat rhs) { cfloat res = {lhs.x - rhs.x, lhs.y - rhs.y}; return res; } __device__ cfloat __cmulf(cfloat lhs, cfloat rhs) { cfloat out; out.x = lhs.x * rhs.x - lhs.y * rhs.y; out.y = lhs.x * rhs.y + lhs.y * rhs.x; return out; } __device__ cfloat __cdivf(cfloat lhs, cfloat rhs) { // Normalize by absolute value and multiply float rhs_abs = __cabsf(rhs); float inv_rhs_abs = 1.0f / rhs_abs; float rhs_x = inv_rhs_abs * rhs.x; float rhs_y = inv_rhs_abs * rhs.y; cfloat out = {lhs.x * rhs_x + lhs.y * rhs_y, lhs.y * rhs_x - lhs.x * rhs_y}; out.x *= inv_rhs_abs; out.y *= inv_rhs_abs; return out; } __device__ cfloat __cminf(cfloat lhs, cfloat rhs) { return __cabsf(lhs) < __cabsf(rhs) ? lhs : rhs; } __device__ cfloat __cmaxf(cfloat lhs, cfloat rhs) { return __cabsf(lhs) > __cabsf(rhs) ? lhs : rhs; } #define __candf(lhs, rhs) __cabsf(lhs) && __cabsf(rhs) #define __corf(lhs, rhs) __cabsf(lhs) || __cabsf(rhs) #define __ceqf(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y)) #define __cneqf(lhs, rhs) !__ceqf((lhs), (rhs)) #define __cltf(lhs, rhs) (__cabsf(lhs) < __cabsf(rhs)) #define __clef(lhs, rhs) (__cabsf(lhs) <= __cabsf(rhs)) #define __cgtf(lhs, rhs) (__cabsf(lhs) > __cabsf(rhs)) #define __cgef(lhs, rhs) (__cabsf(lhs) >= __cabsf(rhs)) #define __convert_cfloat(real) __cplx2f(real, 0) #define __convert_c2c(in) (in) #define __convert_z2c(in) __cplx2f((float)in.x, (float)in.y) // ---------------------------------------------- // COMPLEX DOUBLE OPERATIONS // ---------------------------------------------- #define __creal(in) ((in).x) #define __cimag(in) ((in).y) #define __cabs(in) hypot(in.x, in.y) __device__ cdouble __cplx2(double x, double y) { cdouble res = {x, y}; return res; } __device__ cdouble __cconj(cdouble in) { cdouble res = {in.x, -in.y}; return res; } __device__ cdouble __cadd(cdouble lhs, cdouble rhs) { cdouble res = {lhs.x + rhs.x, lhs.y + rhs.y}; return res; } __device__ cdouble __csub(cdouble lhs, cdouble rhs) { cdouble res = {lhs.x - rhs.x, lhs.y - rhs.y}; return res; } __device__ cdouble __cmul(cdouble lhs, cdouble rhs) { cdouble out; out.x = lhs.x * rhs.x - lhs.y * rhs.y; out.y = lhs.x * rhs.y + lhs.y * rhs.x; return out; } __device__ cdouble __cdiv(cdouble lhs, cdouble rhs) { // Normalize by absolute value and multiply double rhs_abs = __cabs(rhs); double inv_rhs_abs = 1.0 / rhs_abs; double rhs_x = inv_rhs_abs * rhs.x; double rhs_y = inv_rhs_abs * rhs.y; cdouble out = {lhs.x * rhs_x + lhs.y * rhs_y, lhs.y * rhs_x - lhs.x * rhs_y}; out.x *= inv_rhs_abs; out.y *= inv_rhs_abs; return out; } __device__ cdouble __cmin(cdouble lhs, cdouble rhs) { return __cabs(lhs) < __cabs(rhs) ? lhs : rhs; } __device__ cdouble __cmax(cdouble lhs, cdouble rhs) { return __cabs(lhs) > __cabs(rhs) ? lhs : rhs; } #define __cand(lhs, rhs) __cabs(lhs) && __cabs(rhs) #define __cor(lhs, rhs) __cabs(lhs) || __cabs(rhs) #define __ceq(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y)) #define __cneq(lhs, rhs) !__ceq((lhs), (rhs)) #define __clt(lhs, rhs) (__cabs(lhs) < __cabs(rhs)) #define __cle(lhs, rhs) (__cabs(lhs) <= __cabs(rhs)) #define __cgt(lhs, rhs) (__cabs(lhs) > __cabs(rhs)) #define __cge(lhs, rhs) (__cabs(lhs) >= __cabs(rhs)) #define __convert_cdouble(real) __cplx2(real, 0) #define __convert_z2z(in) (in) #define __convert_c2z(in) __cplx2((double)in.x, (double)in.y) template<typename T> struct Param { T *ptr; dim_t dims[4]; dim_t strides[4]; }; extern "C" __global__ void KER3024106651647577914( float *in0_ptr, float *in1_ptr, float *in2_ptr, float *in5_ptr, float *in8_ptr, float *in11_ptr, float *in14_ptr, float *in17_ptr, float *in20_ptr, float *in23_ptr, float *in26_ptr, float *in29_ptr, float *in32_ptr, float *in35_ptr, float *in38_ptr, float *in41_ptr, float *in44_ptr, float *in47_ptr, float *in50_ptr, float *in53_ptr, Param<float> out55, uint blocks_x, uint blocks_y, uint blocks_x_total, uint num_odims) { Param<float> outref = out55; for (int blockIdx_x = blockIdx.x; blockIdx_x < blocks_x_total; blockIdx_x += gridDim.x) { uint threadId = threadIdx.x; int idx = blockIdx_x * blockDim.x * blockDim.y + threadId; if (idx >= outref.dims[3] * outref.strides[3]) return; int idx0 = idx; int idx1 = idx; int idx2 = idx; int idx5 = idx; int idx8 = idx; int idx11 = idx; int idx14 = idx; int idx17 = idx; int idx20 = idx; int idx23 = idx; int idx26 = idx; int idx29 = idx; int idx32 = idx; int idx35 = idx; int idx38 = idx; int idx41 = idx; int idx44 = idx; int idx47 = idx; int idx50 = idx; int idx53 = idx; float val0 = in0_ptr[idx0]; float val1 = in1_ptr[idx1]; float val2 = in2_ptr[idx2]; float val3 = __add(val1, val2); float val4 = __add(val0, val3); float val5 = in5_ptr[idx5]; float val6 = __add(val3, val5); float val7 = __add(val4, val6); float val8 = in8_ptr[idx8]; float val9 = __add(val6, val8); float val10 = __add(val7, val9); float val11 = in11_ptr[idx11]; float val12 = __add(val9, val11); float val13 = __add(val10, val12); float val14 = in14_ptr[idx14]; float val15 = __add(val12, val14); float val16 = __add(val13, val15); float val17 = in17_ptr[idx17]; float val18 = __add(val15, val17); float val19 = __add(val16, val18); float val20 = in20_ptr[idx20]; float val21 = __add(val18, val20); float val22 = __add(val19, val21); float val23 = in23_ptr[idx23]; float val24 = __add(val21, val23); float val25 = __add(val22, val24); float val26 = in26_ptr[idx26]; float val27 = __add(val24, val26); float val28 = __add(val25, val27); float val29 = in29_ptr[idx29]; float val30 = __add(val27, val29); float val31 = __add(val28, val30); float val32 = in32_ptr[idx32]; float val33 = __add(val30, val32); float val34 = __add(val31, val33); float val35 = in35_ptr[idx35]; float val36 = __add(val33, val35); float val37 = __add(val34, val36); float val38 = in38_ptr[idx38]; float val39 = __add(val36, val38); float val40 = __add(val37, val39); float val41 = in41_ptr[idx41]; float val42 = __add(val39, val41); float val43 = __add(val40, val42); float val44 = in44_ptr[idx44]; float val45 = __add(val42, val44); float val46 = __add(val43, val45); float val47 = in47_ptr[idx47]; float val48 = __add(val45, val47); float val49 = __add(val46, val48); float val50 = in50_ptr[idx50]; float val51 = __add(val48, val50); float val52 = __add(val49, val51); float val53 = in53_ptr[idx53]; float val54 = __add(val51, val53); float val55 = __add(val52, val54); out55.ptr[idx] = val55; } }[/code] Unrelated, but you might also be interested by what happens with canny_cuda otsu threshold test:[code]./test/canny_cuda --gtest_filter=CannyEdgeDetector.OtsuThreshold[/code]
Hi AastaLLL,

You can reproduce something similar this way:

1. Get arrayfire 3.6.0 master from github: https://github.com/arrayfire/arrayfire.
2. Check/install required dependancies (https://github.com/arrayfire/arrayfire/wiki/Build-Instructions-for-Linux). I've also downloaded cub-1.7.4 and copied its cub directory into /usr/include.
3. Configure with cmake-gui (all default should be ok, but you may have to click twice on configure to get everything configured, then generate).
4. Build. No need to install for this test.
5. In build directory, launch:
./test/jit_cuda --gtest_filter=JIT.ISSUE_1646

Setting AF_CUDA_MAX_JIT_LEN to 19 or more leads to failure, but it succeeds with 18 or less.

Here is the failing kernel with value 19:
typedef unsigned int uint;
typedef long long dim_t;
/*******************************************************
* Copyright (c) 2014, ArrayFire
* All rights reserved.
*
* This file is distributed under 3-clause BSD license.
* The complete license agreement can be obtained at:
* http://arrayfire.com/licenses/BSD-3-Clause
********************************************************/

typedef float2 cuFloatComplex;
typedef cuFloatComplex cfloat;

typedef double2 cuDoubleComplex;
typedef cuDoubleComplex cdouble;

// ----------------------------------------------
// REAL NUMBER OPERATIONS
// ----------------------------------------------
#define sign(in) signbit((in))
#define __noop(a) (a)
#define __add(lhs, rhs) (lhs) + (rhs)
#define __sub(lhs, rhs) (lhs) - (rhs)
#define __mul(lhs, rhs) (lhs) * (rhs)
#define __div(lhs, rhs) (lhs) / (rhs)
#define __and(lhs, rhs) (lhs) && (rhs)
#define __or(lhs, rhs) (lhs) || (rhs)

#define __lt(lhs, rhs) (lhs) < (rhs)
#define __gt(lhs, rhs) (lhs) > (rhs)
#define __le(lhs, rhs) (lhs) <= (rhs)
#define __ge(lhs, rhs) (lhs) >= (rhs)
#define __eq(lhs, rhs) (lhs) == (rhs)
#define __neq(lhs, rhs) (lhs) != (rhs)

#define __conj(in) (in)
#define __real(in) (in)
#define __imag(in) (0)
#define __abs(in) abs(in)
#define __sigmoid(in) (1.0/(1 + exp(-(in))))

#define __bitor(lhs, rhs) ((lhs) | (rhs))
#define __bitand(lhs, rhs) ((lhs) & (rhs))
#define __bitxor(lhs, rhs) ((lhs) ^ (rhs))
#define __bitshiftl(lhs, rhs) ((lhs) << (rhs))
#define __bitshiftr(lhs, rhs) ((lhs) >> (rhs))

#define __min(lhs, rhs) ((lhs) < (rhs)) ? (lhs) : (rhs)
#define __max(lhs, rhs) ((lhs) > (rhs)) ? (lhs) : (rhs)
#define __rem(lhs, rhs) ((lhs) % (rhs))
#define __mod(lhs, rhs) ((lhs) % (rhs))
#define __pow(lhs, rhs) fpow((float)lhs, (float)rhs)

#define __convert_char(val) (char)((val) != 0)
#define fpow(lhs, rhs) pow((lhs), (rhs))
#define frem(lhs, rhs) remainder((lhs), (rhs))
#define iszero(a) ((a) == 0)

// ----------------------------------------------
// COMPLEX FLOAT OPERATIONS
// ----------------------------------------------

#define __crealf(in) ((in).x)
#define __cimagf(in) ((in).y)
#define __cabsf(in) hypotf(in.x, in.y)

__device__ cfloat __cplx2f(float x, float y)
{
cfloat res = {x, y};
return res;
}

__device__ cfloat __cconjf(cfloat in)
{
cfloat res = {in.x, -in.y};
return res;
}

__device__ cfloat __caddf(cfloat lhs, cfloat rhs)
{
cfloat res = {lhs.x + rhs.x, lhs.y + rhs.y};
return res;
}

__device__ cfloat __csubf(cfloat lhs, cfloat rhs)
{
cfloat res = {lhs.x - rhs.x, lhs.y - rhs.y};
return res;
}

__device__ cfloat __cmulf(cfloat lhs, cfloat rhs)
{
cfloat out;
out.x = lhs.x * rhs.x - lhs.y * rhs.y;
out.y = lhs.x * rhs.y + lhs.y * rhs.x;
return out;
}

__device__ cfloat __cdivf(cfloat lhs, cfloat rhs)
{
// Normalize by absolute value and multiply
float rhs_abs = __cabsf(rhs);
float inv_rhs_abs = 1.0f / rhs_abs;
float rhs_x = inv_rhs_abs * rhs.x;
float rhs_y = inv_rhs_abs * rhs.y;
cfloat out = {lhs.x * rhs_x + lhs.y * rhs_y,
lhs.y * rhs_x - lhs.x * rhs_y};
out.x *= inv_rhs_abs;
out.y *= inv_rhs_abs;
return out;
}

__device__ cfloat __cminf(cfloat lhs, cfloat rhs)
{
return __cabsf(lhs) < __cabsf(rhs) ? lhs : rhs;
}

__device__ cfloat __cmaxf(cfloat lhs, cfloat rhs)
{
return __cabsf(lhs) > __cabsf(rhs) ? lhs : rhs;
}
#define __candf(lhs, rhs) __cabsf(lhs) && __cabsf(rhs)
#define __corf(lhs, rhs) __cabsf(lhs) || __cabsf(rhs)
#define __ceqf(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
#define __cneqf(lhs, rhs) !__ceqf((lhs), (rhs))
#define __cltf(lhs, rhs) (__cabsf(lhs) < __cabsf(rhs))
#define __clef(lhs, rhs) (__cabsf(lhs) <= __cabsf(rhs))
#define __cgtf(lhs, rhs) (__cabsf(lhs) > __cabsf(rhs))
#define __cgef(lhs, rhs) (__cabsf(lhs) >= __cabsf(rhs))
#define __convert_cfloat(real) __cplx2f(real, 0)
#define __convert_c2c(in) (in)
#define __convert_z2c(in) __cplx2f((float)in.x, (float)in.y)

// ----------------------------------------------
// COMPLEX DOUBLE OPERATIONS
// ----------------------------------------------
#define __creal(in) ((in).x)
#define __cimag(in) ((in).y)
#define __cabs(in) hypot(in.x, in.y)

__device__ cdouble __cplx2(double x, double y)
{
cdouble res = {x, y};
return res;
}

__device__ cdouble __cconj(cdouble in)
{
cdouble res = {in.x, -in.y};
return res;
}

__device__ cdouble __cadd(cdouble lhs, cdouble rhs)
{
cdouble res = {lhs.x + rhs.x, lhs.y + rhs.y};
return res;
}

__device__ cdouble __csub(cdouble lhs, cdouble rhs)
{
cdouble res = {lhs.x - rhs.x, lhs.y - rhs.y};
return res;
}

__device__ cdouble __cmul(cdouble lhs, cdouble rhs)
{
cdouble out;
out.x = lhs.x * rhs.x - lhs.y * rhs.y;
out.y = lhs.x * rhs.y + lhs.y * rhs.x;
return out;
}

__device__ cdouble __cdiv(cdouble lhs, cdouble rhs)
{
// Normalize by absolute value and multiply
double rhs_abs = __cabs(rhs);
double inv_rhs_abs = 1.0 / rhs_abs;
double rhs_x = inv_rhs_abs * rhs.x;
double rhs_y = inv_rhs_abs * rhs.y;
cdouble out = {lhs.x * rhs_x + lhs.y * rhs_y,
lhs.y * rhs_x - lhs.x * rhs_y};
out.x *= inv_rhs_abs;
out.y *= inv_rhs_abs;
return out;
}

__device__ cdouble __cmin(cdouble lhs, cdouble rhs)
{
return __cabs(lhs) < __cabs(rhs) ? lhs : rhs;
}

__device__ cdouble __cmax(cdouble lhs, cdouble rhs)
{
return __cabs(lhs) > __cabs(rhs) ? lhs : rhs;
}
#define __cand(lhs, rhs) __cabs(lhs) && __cabs(rhs)
#define __cor(lhs, rhs) __cabs(lhs) || __cabs(rhs)
#define __ceq(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
#define __cneq(lhs, rhs) !__ceq((lhs), (rhs))
#define __clt(lhs, rhs) (__cabs(lhs) < __cabs(rhs))
#define __cle(lhs, rhs) (__cabs(lhs) <= __cabs(rhs))
#define __cgt(lhs, rhs) (__cabs(lhs) > __cabs(rhs))
#define __cge(lhs, rhs) (__cabs(lhs) >= __cabs(rhs))
#define __convert_cdouble(real) __cplx2(real, 0)
#define __convert_z2z(in) (in)
#define __convert_c2z(in) __cplx2((double)in.x, (double)in.y)



template<typename T>
struct Param
{
T *ptr;
dim_t dims[4];
dim_t strides[4];
};

extern "C" __global__ void
KER3024106651647577914(
float *in0_ptr,
float *in1_ptr,
float *in2_ptr,
float *in5_ptr,
float *in8_ptr,
float *in11_ptr,
float *in14_ptr,
float *in17_ptr,
float *in20_ptr,
float *in23_ptr,
float *in26_ptr,
float *in29_ptr,
float *in32_ptr,
float *in35_ptr,
float *in38_ptr,
float *in41_ptr,
float *in44_ptr,
float *in47_ptr,
float *in50_ptr,
float *in53_ptr,
Param<float> out55,
uint blocks_x, uint blocks_y, uint blocks_x_total, uint num_odims)
{

Param<float> outref = out55;

for (int blockIdx_x = blockIdx.x; blockIdx_x < blocks_x_total; blockIdx_x += gridDim.x) {

uint threadId = threadIdx.x;
int idx = blockIdx_x * blockDim.x * blockDim.y + threadId;
if (idx >= outref.dims[3] * outref.strides[3]) return;
int idx0 = idx;
int idx1 = idx;
int idx2 = idx;
int idx5 = idx;
int idx8 = idx;
int idx11 = idx;
int idx14 = idx;
int idx17 = idx;
int idx20 = idx;
int idx23 = idx;
int idx26 = idx;
int idx29 = idx;
int idx32 = idx;
int idx35 = idx;
int idx38 = idx;
int idx41 = idx;
int idx44 = idx;
int idx47 = idx;
int idx50 = idx;
int idx53 = idx;
float val0 = in0_ptr[idx0];
float val1 = in1_ptr[idx1];
float val2 = in2_ptr[idx2];
float val3 = __add(val1, val2);
float val4 = __add(val0, val3);
float val5 = in5_ptr[idx5];
float val6 = __add(val3, val5);
float val7 = __add(val4, val6);
float val8 = in8_ptr[idx8];
float val9 = __add(val6, val8);
float val10 = __add(val7, val9);
float val11 = in11_ptr[idx11];
float val12 = __add(val9, val11);
float val13 = __add(val10, val12);
float val14 = in14_ptr[idx14];
float val15 = __add(val12, val14);
float val16 = __add(val13, val15);
float val17 = in17_ptr[idx17];
float val18 = __add(val15, val17);
float val19 = __add(val16, val18);
float val20 = in20_ptr[idx20];
float val21 = __add(val18, val20);
float val22 = __add(val19, val21);
float val23 = in23_ptr[idx23];
float val24 = __add(val21, val23);
float val25 = __add(val22, val24);
float val26 = in26_ptr[idx26];
float val27 = __add(val24, val26);
float val28 = __add(val25, val27);
float val29 = in29_ptr[idx29];
float val30 = __add(val27, val29);
float val31 = __add(val28, val30);
float val32 = in32_ptr[idx32];
float val33 = __add(val30, val32);
float val34 = __add(val31, val33);
float val35 = in35_ptr[idx35];
float val36 = __add(val33, val35);
float val37 = __add(val34, val36);
float val38 = in38_ptr[idx38];
float val39 = __add(val36, val38);
float val40 = __add(val37, val39);
float val41 = in41_ptr[idx41];
float val42 = __add(val39, val41);
float val43 = __add(val40, val42);
float val44 = in44_ptr[idx44];
float val45 = __add(val42, val44);
float val46 = __add(val43, val45);
float val47 = in47_ptr[idx47];
float val48 = __add(val45, val47);
float val49 = __add(val46, val48);
float val50 = in50_ptr[idx50];
float val51 = __add(val48, val50);
float val52 = __add(val49, val51);
float val53 = in53_ptr[idx53];
float val54 = __add(val51, val53);
float val55 = __add(val52, val54);
out55.ptr[idx] = val55;
}



}



Unrelated, but you might also be interested by what happens with canny_cuda otsu threshold test:
./test/canny_cuda --gtest_filter=CannyEdgeDetector.OtsuThreshold

#3
Posted 12/27/2017 04:52 PM   
Hi, Honey_Patouceul Thanks for sharing detail reproduce steps. We are discussing this issue internally and will update information with you later. Thanks
Hi, Honey_Patouceul

Thanks for sharing detail reproduce steps.
We are discussing this issue internally and will update information with you later.

Thanks

#4
Posted 12/29/2017 06:55 AM   
Hi, Please try to reproduce this issue with JetPack3.2 DP and let us know the results. Thanks.
Hi,

Please try to reproduce this issue with JetPack3.2 DP and let us know the results.
Thanks.

#5
Posted 01/02/2018 01:37 AM   
@AastaLLL, I am a bit confused...as explained in post #1, this happens with JetPack3.2 DP and cuda9.0. [EDIT: Also confirmed the same on my TX1 R28.1/Cuda8. Same max value 18.] Am I missing something ?
@AastaLLL,

I am a bit confused...as explained in post #1, this happens with JetPack3.2 DP and cuda9.0.
[EDIT: Also confirmed the same on my TX1 R28.1/Cuda8. Same max value 18.]
Am I missing something ?

#6
Posted 01/02/2018 06:57 AM   
Hi, Honey Sorry for the missing. In summary, this issue occurs in both rel-28.1 and rel-28.2. We are checking this issue internally. Will update information with you later. Thanks and Happy New Year : )
Hi, Honey

Sorry for the missing.
In summary, this issue occurs in both rel-28.1 and rel-28.2.

We are checking this issue internally. Will update information with you later.
Thanks and Happy New Year : )

#7
Posted 01/03/2018 02:01 AM   
Hi, Could you help to provide complete failure logs?
Hi,

Could you help to provide complete failure logs?

#8
Posted 01/03/2018 06:25 AM   
Hi Honey_Patouceul, I setting AF_CUDA_MAX_JIT_LEN value to 19, 20 and 30, the result are passed. Test on JetPack3.2 DP/TX2. [code]nvidia@tegra-ubuntu:~/arrayfire/build$ ./test/jit_cuda --gtest_filter=JIT.ISSUE_1646 Running main() from gtest_main.cc Note: Google Test filter = JIT.ISSUE_1646 [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from JIT [ RUN ] JIT.ISSUE_1646 [ OK ] JIT.ISSUE_1646 (3834 ms) [----------] 1 test from JIT (3834 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (3835 ms total) [ PASSED ] 1 test.[/code]
Hi Honey_Patouceul,

I setting AF_CUDA_MAX_JIT_LEN value to 19, 20 and 30, the result are passed.
Test on JetPack3.2 DP/TX2.
nvidia@tegra-ubuntu:~/arrayfire/build$ ./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN ] JIT.ISSUE_1646
[ OK ] JIT.ISSUE_1646 (3834 ms)
[----------] 1 test from JIT (3834 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3835 ms total)
[ PASSED ] 1 test.

#9
Posted 01/03/2018 07:24 AM   
Hi AastaLLL and carolyu, Thanks for your help and happy new year. That's interesting...in my case I get:[code]head -1 /etc/nv_tegra_release # R28 (release), REVISION: 2.0, GCID: 10136452, BOARD: t186ref, EABI: aarch64, DATE: Fri Dec 1 14:20:33 UTC 2017 /usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery /usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X2" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 6.2 Total amount of global memory: 7851 MBytes (8232407040 bytes) ( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores GPU Max Clock rate: 1301 MHz (1.30 GHz) Memory Clock rate: 1600 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1 Result = PASS ./test/jit_cuda --gtest_filter=JIT.ISSUE_1646 Running main() from gtest_main.cc Note: Google Test filter = JIT.ISSUE_1646 [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from JIT [ RUN ] JIT.ISSUE_1646 unknown file: Failure C++ exception with description "ArrayFire Exception (Internal error:998): In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float] In file src/backend/cuda/jit.cpp:471 CU Error (701) In function af::array& af::array::operator+=(const af::array&) In file src/api/cpp/array.cpp:814" thrown in the test body. [ FAILED ] JIT.ISSUE_1646 (3129 ms) [----------] 1 test from JIT (3130 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (3130 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] JIT.ISSUE_1646 1 FAILED TEST [/code] So the question is why it fails on my TX1/TX2 but it succeeds on yours ? May you attach your arrayfire build CMakeCache.txt so that I can check if an option is different ? Thanks again
Hi AastaLLL and carolyu,

Thanks for your help and happy new year.

That's interesting...in my case I get:
head -1 /etc/nv_tegra_release 
# R28 (release), REVISION: 2.0, GCID: 10136452, BOARD: t186ref, EABI: aarch64, DATE: Fri Dec 1 14:20:33 UTC 2017

/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.2
Total amount of global memory: 7851 MBytes (8232407040 bytes)
( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores
GPU Max Clock rate: 1301 MHz (1.30 GHz)
Memory Clock rate: 1600 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS


./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN ] JIT.ISSUE_1646
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)


In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[ FAILED ] JIT.ISSUE_1646 (3129 ms)
[----------] 1 test from JIT (3130 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3130 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] JIT.ISSUE_1646

1 FAILED TEST


So the question is why it fails on my TX1/TX2 but it succeeds on yours ?
May you attach your arrayfire build CMakeCache.txt so that I can check if an option is different ?

Thanks again

#10
Posted 01/03/2018 07:56 AM   
Hi Honey_Patouceul, Attached my CMakeCache.txt for you check.
Hi Honey_Patouceul,

Attached my CMakeCache.txt for you check.
Attachments

CMakeCache.txt

#11
Posted 01/03/2018 08:40 AM   
Thanks for providing this. Indeed, and that's good news, the Release build works. You should be able to reproduce the error with CMAKE_BUILD_TYPE=Debug or default (undefined). Is -O3 flag mandatory for properly using cuda JIT ?
Thanks for providing this. Indeed, and that's good news, the Release build works.
You should be able to reproduce the error with CMAKE_BUILD_TYPE=Debug or default (undefined).
Is -O3 flag mandatory for properly using cuda JIT ?

#12
Posted 01/03/2018 10:00 PM   
Seems not related to optimization, but rather linked to option "--device-debug" being passed to nvrtcCompileProgram(). Commenting line 286 in src/backend/cuda/jit.cpp allows a debug build to pass the test.
Seems not related to optimization, but rather linked to option "--device-debug" being passed to nvrtcCompileProgram(). Commenting line 286 in src/backend/cuda/jit.cpp allows a debug build to pass the test.

#13
Posted 01/04/2018 09:02 AM   
Hi, Thanks for looking into this. Here is the nvrtc document: http://docs.nvidia.com/cuda/nvrtc/index.html It looks like the [b][color="green"]--device-debug[/color][/b] flag is to output some debug information only. We will ask more information from our internal team and update to you. Thanks
Hi,

Thanks for looking into this.
Here is the nvrtc document: http://docs.nvidia.com/cuda/nvrtc/index.html


It looks like the --device-debug flag is to output some debug information only.
We will ask more information from our internal team and update to you.

Thanks

#14
Posted 01/05/2018 02:40 AM   
Using the debug build on my TX2-R28.2-DP with cuda-gdb I see this warning before failure:[code]cuda-gdb test/jit_cuda NVIDIA (R) CUDA Debugger 9.0 release Portions Copyright (C) 2007-2017 NVIDIA Corporation GNU gdb (GDB) 7.12 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-elf-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from test/jit_cuda...done. (cuda-gdb) set args --gtest_filter=JIT.ISSUE_1646 (cuda-gdb) run Starting program: /media/nvidia/Data/arrayfire/github/build/TX2-R28.2-DP_Debug/test/jit_cuda --gtest_filter=JIT.ISSUE_1646 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Running main() from gtest_main.cc Note: Google Test filter = JIT.ISSUE_1646 [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from JIT [ RUN ] JIT.ISSUE_1646 [New Thread 0x7f93b35240 (LWP 19089)] [b]warning: Cuda API error detected: cuLaunchKernel returned (0x2bd)[/b] unknown file: Failure C++ exception with description "ArrayFire Exception (Internal error:998): In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float] In file src/backend/cuda/jit.cpp:471 CU Error (701) In function af::array& af::array::operator+=(const af::array&) In file src/api/cpp/array.cpp:814" thrown in the test body. [ FAILED ] JIT.ISSUE_1646 (57409 ms) [----------] 1 test from JIT (57410 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (57411 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] JIT.ISSUE_1646 1 FAILED TEST [Thread 0x7f93b35240 (LWP 19089) exited] [Inferior 1 (process 19078) exited with code 01] [/code]What does 0x2bd return value mean for cuLaunchKernel ? CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701 ?
Using the debug build on my TX2-R28.2-DP with cuda-gdb I see this warning before failure:
cuda-gdb test/jit_cuda 
NVIDIA (R) CUDA Debugger
9.0 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test/jit_cuda...done.
(cuda-gdb) set args --gtest_filter=JIT.ISSUE_1646
(cuda-gdb) run
Starting program: /media/nvidia/Data/arrayfire/github/build/TX2-R28.2-DP_Debug/test/jit_cuda --gtest_filter=JIT.ISSUE_1646
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN ] JIT.ISSUE_1646
[New Thread 0x7f93b35240 (LWP 19089)]
warning: Cuda API error detected: cuLaunchKernel returned (0x2bd)

unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)


In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[ FAILED ] JIT.ISSUE_1646 (57409 ms)
[----------] 1 test from JIT (57410 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (57411 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] JIT.ISSUE_1646

1 FAILED TEST
[Thread 0x7f93b35240 (LWP 19089) exited]
[Inferior 1 (process 19078) exited with code 01]
What does 0x2bd return value mean for cuLaunchKernel ? CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701 ?

#15
Posted 01/06/2018 01:47 AM   
Scroll To Top

Add Reply