Arrayfire lib performance among L4T/cuda new versions

Hi everyone,

I don’t see so many posts about users mentioning usage of arrayfire library (although it’s efficient and highly portable), but if someone faces issues with it, this may be of interest.

I had some code using arrayfire working fine and fast on TX1 with R23.2 (last L4T 32 bits) and cuda 7.0.

When I upgraded to R24.2 and 64 bits and cuda-8.0 I’ve faced some memory outage and had to call the garbage collector explicitly in the loop.

Moving to TX2 and R27.0.1 didn’t change much for my case. Upgrading then to R28.1 and cuda-8.0.84 lead to JIT errors, and I’ve had to add many explicit evaluations and synchronize to get it working.

Now, upgrading to R28.2-preview, it’s even worse. Full story here: https://github.com/arrayfire/arrayfire/issues/1910 and https://github.com/arrayfire/arrayfire/issues/2028.

Seems using

export AF_CUDA_MAX_JIT_LEN=10

instead of default AF value 100 is a workaround, but 1 may also be used…

Arrayfire developers think that these regressions are from NVIDIA side.
I also have to say that each time I’ve tried to build the old arrayfire version and the problem looked to be linked to cuda version rather than arrayfire version.

Does anyone in NVIDIA knows about this ? If one, please share any insight.

Hi,

Sorry that we don’t have too much experience on arrayfire.

If this is a regression, we may need to reproduce this issue before reporting to the internal team.

Two recommended reproducing type are:
1. Modify from our CUDA native sample
2. A simple vanilla sample that hit this error

Thanks.

Hi AastaLLL,

You can reproduce something similar this way:

  1. Get arrayfire 3.6.0 master from github: https://github.com/arrayfire/arrayfire.
  2. Check/install required dependancies (https://github.com/arrayfire/arrayfire/wiki/Build-Instructions-for-Linux). I’ve also downloaded cub-1.7.4 and copied its cub directory into /usr/include.
  3. Configure with cmake-gui (all default should be ok, but you may have to click twice on configure to get everything configured, then generate).
  4. Build. No need to install for this test.
  5. In build directory, launch:
./test/jit_cuda --gtest_filter=JIT.ISSUE_1646

Setting AF_CUDA_MAX_JIT_LEN to 19 or more leads to failure, but it succeeds with 18 or less.

Here is the failing kernel with value 19:

typedef unsigned int uint;
typedef long long dim_t;
/*******************************************************
 * Copyright (c) 2014, ArrayFire
 * All rights reserved.
 *
 * This file is distributed under 3-clause BSD license.
 * The complete license agreement can be obtained at:
 * http://arrayfire.com/licenses/BSD-3-Clause
 ********************************************************/

typedef float2 cuFloatComplex;
typedef cuFloatComplex cfloat;

typedef double2 cuDoubleComplex;
typedef cuDoubleComplex cdouble;

// ----------------------------------------------
// REAL NUMBER OPERATIONS
// ----------------------------------------------
#define sign(in) signbit((in))
#define __noop(a) (a)
#define __add(lhs, rhs) (lhs) + (rhs)
#define __sub(lhs, rhs) (lhs) - (rhs)
#define __mul(lhs, rhs) (lhs) * (rhs)
#define __div(lhs, rhs) (lhs) / (rhs)
#define __and(lhs, rhs) (lhs) && (rhs)
#define __or(lhs, rhs) (lhs) || (rhs)

#define __lt(lhs, rhs) (lhs) < (rhs)
#define __gt(lhs, rhs) (lhs) > (rhs)
#define __le(lhs, rhs) (lhs) <= (rhs)
#define __ge(lhs, rhs) (lhs) >= (rhs)
#define __eq(lhs, rhs) (lhs) == (rhs)
#define __neq(lhs, rhs) (lhs) != (rhs)

#define __conj(in) (in)
#define __real(in) (in)
#define __imag(in) (0)
#define __abs(in) abs(in)
#define __sigmoid(in) (1.0/(1 + exp(-(in))))

#define __bitor(lhs, rhs) ((lhs) | (rhs))
#define __bitand(lhs, rhs) ((lhs) & (rhs))
#define __bitxor(lhs, rhs) ((lhs) ^ (rhs))
#define __bitshiftl(lhs, rhs) ((lhs) << (rhs))
#define __bitshiftr(lhs, rhs) ((lhs) >> (rhs))

#define __min(lhs, rhs) ((lhs) < (rhs)) ? (lhs) : (rhs)
#define __max(lhs, rhs) ((lhs) > (rhs)) ? (lhs) : (rhs)
#define __rem(lhs, rhs) ((lhs) % (rhs))
#define __mod(lhs, rhs) ((lhs) % (rhs))
#define __pow(lhs, rhs) fpow((float)lhs, (float)rhs)

#define __convert_char(val) (char)((val) != 0)
#define fpow(lhs, rhs) pow((lhs), (rhs))
#define frem(lhs, rhs) remainder((lhs), (rhs))
#define iszero(a) ((a) == 0)

// ----------------------------------------------
// COMPLEX FLOAT OPERATIONS
// ----------------------------------------------

#define __crealf(in) ((in).x)
#define __cimagf(in) ((in).y)
#define __cabsf(in) hypotf(in.x, in.y)

__device__ cfloat __cplx2f(float x, float y)
{
    cfloat res = {x, y};
    return res;
}

__device__ cfloat __cconjf(cfloat in)
{
    cfloat res = {in.x, -in.y};
    return res;
}

__device__ cfloat __caddf(cfloat lhs, cfloat rhs)
{
    cfloat res = {lhs.x + rhs.x, lhs.y + rhs.y};
    return res;
}

__device__ cfloat __csubf(cfloat lhs, cfloat rhs)
{
    cfloat res = {lhs.x - rhs.x, lhs.y - rhs.y};
    return res;
}

__device__ cfloat __cmulf(cfloat lhs, cfloat rhs)
{
    cfloat out;
    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
    return out;
}

__device__ cfloat __cdivf(cfloat lhs, cfloat rhs)
{
    // Normalize by absolute value and multiply
    float rhs_abs = __cabsf(rhs);
    float inv_rhs_abs = 1.0f / rhs_abs;
    float rhs_x = inv_rhs_abs * rhs.x;
    float rhs_y = inv_rhs_abs * rhs.y;
    cfloat out = {lhs.x * rhs_x + lhs.y * rhs_y,
                  lhs.y * rhs_x - lhs.x * rhs_y};
    out.x *= inv_rhs_abs;
    out.y *= inv_rhs_abs;
    return out;
}

__device__ cfloat __cminf(cfloat lhs, cfloat rhs)
{
    return __cabsf(lhs) < __cabsf(rhs) ? lhs : rhs;
}

__device__ cfloat __cmaxf(cfloat lhs, cfloat rhs)
{
    return __cabsf(lhs) > __cabsf(rhs) ? lhs : rhs;
}
#define __candf(lhs, rhs) __cabsf(lhs) && __cabsf(rhs)
#define __corf(lhs, rhs) __cabsf(lhs) || __cabsf(rhs)
#define __ceqf(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
#define __cneqf(lhs, rhs) !__ceqf((lhs), (rhs))
#define __cltf(lhs, rhs) (__cabsf(lhs) < __cabsf(rhs))
#define __clef(lhs, rhs) (__cabsf(lhs) <= __cabsf(rhs))
#define __cgtf(lhs, rhs) (__cabsf(lhs) > __cabsf(rhs))
#define __cgef(lhs, rhs) (__cabsf(lhs) >= __cabsf(rhs))
#define __convert_cfloat(real) __cplx2f(real, 0)
#define __convert_c2c(in) (in)
#define __convert_z2c(in) __cplx2f((float)in.x, (float)in.y)

// ----------------------------------------------
// COMPLEX DOUBLE OPERATIONS
// ----------------------------------------------
#define __creal(in) ((in).x)
#define __cimag(in) ((in).y)
#define __cabs(in) hypot(in.x, in.y)

__device__ cdouble __cplx2(double x, double y)
{
    cdouble res = {x, y};
    return res;
}

__device__ cdouble __cconj(cdouble in)
{
    cdouble res = {in.x, -in.y};
    return res;
}

__device__ cdouble __cadd(cdouble lhs, cdouble rhs)
{
    cdouble res = {lhs.x + rhs.x, lhs.y + rhs.y};
    return res;
}

__device__ cdouble __csub(cdouble lhs, cdouble rhs)
{
    cdouble res = {lhs.x - rhs.x, lhs.y - rhs.y};
    return res;
}

__device__ cdouble __cmul(cdouble lhs, cdouble rhs)
{
    cdouble out;
    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
    return out;
}

__device__ cdouble __cdiv(cdouble lhs, cdouble rhs)
{
    // Normalize by absolute value and multiply
    double rhs_abs = __cabs(rhs);
    double inv_rhs_abs = 1.0 / rhs_abs;
    double rhs_x = inv_rhs_abs * rhs.x;
    double rhs_y = inv_rhs_abs * rhs.y;
    cdouble out = {lhs.x * rhs_x + lhs.y * rhs_y,
                   lhs.y * rhs_x - lhs.x * rhs_y};
    out.x *= inv_rhs_abs;
    out.y *= inv_rhs_abs;
    return out;
}

__device__ cdouble __cmin(cdouble lhs, cdouble rhs)
{
    return __cabs(lhs) < __cabs(rhs) ? lhs : rhs;
}

__device__ cdouble __cmax(cdouble lhs, cdouble rhs)
{
    return __cabs(lhs) > __cabs(rhs) ? lhs : rhs;
}
#define __cand(lhs, rhs) __cabs(lhs) && __cabs(rhs)
#define __cor(lhs, rhs) __cabs(lhs) || __cabs(rhs)
#define __ceq(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
#define __cneq(lhs, rhs) !__ceq((lhs), (rhs))
#define __clt(lhs, rhs) (__cabs(lhs) < __cabs(rhs))
#define __cle(lhs, rhs) (__cabs(lhs) <= __cabs(rhs))
#define __cgt(lhs, rhs) (__cabs(lhs) > __cabs(rhs))
#define __cge(lhs, rhs) (__cabs(lhs) >= __cabs(rhs))
#define __convert_cdouble(real) __cplx2(real, 0)
#define __convert_z2z(in) (in)
#define __convert_c2z(in) __cplx2((double)in.x, (double)in.y)

template<typename T>
struct Param
{
  T *ptr;
  dim_t dims[4];
  dim_t strides[4];
};

extern "C" __global__ void
KER3024106651647577914(
float *in0_ptr,
float *in1_ptr,
float *in2_ptr,
float *in5_ptr,
float *in8_ptr,
float *in11_ptr,
float *in14_ptr,
float *in17_ptr,
float *in20_ptr,
float *in23_ptr,
float *in26_ptr,
float *in29_ptr,
float *in32_ptr,
float *in35_ptr,
float *in38_ptr,
float *in41_ptr,
float *in44_ptr,
float *in47_ptr,
float *in50_ptr,
float *in53_ptr,
Param<float> out55, 
uint blocks_x, uint blocks_y, uint blocks_x_total, uint num_odims)
{

Param<float> outref = out55;

    for (int blockIdx_x = blockIdx.x; blockIdx_x < blocks_x_total; blockIdx_x += gridDim.x) {
    
        uint threadId = threadIdx.x;
        int idx = blockIdx_x * blockDim.x * blockDim.y + threadId;
        if (idx >= outref.dims[3] * outref.strides[3]) return;
        int idx0 = idx;
int idx1 = idx;
int idx2 = idx;
int idx5 = idx;
int idx8 = idx;
int idx11 = idx;
int idx14 = idx;
int idx17 = idx;
int idx20 = idx;
int idx23 = idx;
int idx26 = idx;
int idx29 = idx;
int idx32 = idx;
int idx35 = idx;
int idx38 = idx;
int idx41 = idx;
int idx44 = idx;
int idx47 = idx;
int idx50 = idx;
int idx53 = idx;
float val0 = in0_ptr[idx0];
float val1 = in1_ptr[idx1];
float val2 = in2_ptr[idx2];
float val3 = __add(val1, val2);
float val4 = __add(val0, val3);
float val5 = in5_ptr[idx5];
float val6 = __add(val3, val5);
float val7 = __add(val4, val6);
float val8 = in8_ptr[idx8];
float val9 = __add(val6, val8);
float val10 = __add(val7, val9);
float val11 = in11_ptr[idx11];
float val12 = __add(val9, val11);
float val13 = __add(val10, val12);
float val14 = in14_ptr[idx14];
float val15 = __add(val12, val14);
float val16 = __add(val13, val15);
float val17 = in17_ptr[idx17];
float val18 = __add(val15, val17);
float val19 = __add(val16, val18);
float val20 = in20_ptr[idx20];
float val21 = __add(val18, val20);
float val22 = __add(val19, val21);
float val23 = in23_ptr[idx23];
float val24 = __add(val21, val23);
float val25 = __add(val22, val24);
float val26 = in26_ptr[idx26];
float val27 = __add(val24, val26);
float val28 = __add(val25, val27);
float val29 = in29_ptr[idx29];
float val30 = __add(val27, val29);
float val31 = __add(val28, val30);
float val32 = in32_ptr[idx32];
float val33 = __add(val30, val32);
float val34 = __add(val31, val33);
float val35 = in35_ptr[idx35];
float val36 = __add(val33, val35);
float val37 = __add(val34, val36);
float val38 = in38_ptr[idx38];
float val39 = __add(val36, val38);
float val40 = __add(val37, val39);
float val41 = in41_ptr[idx41];
float val42 = __add(val39, val41);
float val43 = __add(val40, val42);
float val44 = in44_ptr[idx44];
float val45 = __add(val42, val44);
float val46 = __add(val43, val45);
float val47 = in47_ptr[idx47];
float val48 = __add(val45, val47);
float val49 = __add(val46, val48);
float val50 = in50_ptr[idx50];
float val51 = __add(val48, val50);
float val52 = __add(val49, val51);
float val53 = in53_ptr[idx53];
float val54 = __add(val51, val53);
float val55 = __add(val52, val54);
out55.ptr[idx] = val55;
}

}

Unrelated, but you might also be interested by what happens with canny_cuda otsu threshold test:

./test/canny_cuda --gtest_filter=CannyEdgeDetector.OtsuThreshold

Hi, Honey_Patouceul

Thanks for sharing detail reproduce steps.
We are discussing this issue internally and will update information with you later.

Thanks

Hi,

Please try to reproduce this issue with JetPack3.2 DP and let us know the results.
Thanks.

@AastaLLL,

I am a bit confused…as explained in post #1, this happens with JetPack3.2 DP and cuda9.0.
[EDIT: Also confirmed the same on my TX1 R28.1/Cuda8. Same max value 18.]
Am I missing something ?

Hi, Honey

Sorry for the missing.
In summary, this issue occurs in both rel-28.1 and rel-28.2.

We are checking this issue internally. Will update information with you later.
Thanks and Happy New Year : )

Hi,

Could you help to provide complete failure logs?

Hi Honey_Patouceul,

I setting AF_CUDA_MAX_JIT_LEN value to 19, 20 and 30, the result are passed.
Test on JetPack3.2 DP/TX2.

nvidia@tegra-ubuntu:~/arrayfire/build$ ./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN      ] JIT.ISSUE_1646
[       OK ] JIT.ISSUE_1646 (3834 ms)
[----------] 1 test from JIT (3834 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3835 ms total)
[  PASSED  ] 1 test.

Hi AastaLLL and carolyu,

Thanks for your help and happy new year.

That’s interesting…in my case I get:

head -1 /etc/nv_tegra_release 
# R28 (release), REVISION: 2.0, GCID: 10136452, BOARD: t186ref, EABI: aarch64, DATE: Fri Dec  1 14:20:33 UTC 2017

/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery 
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7851 MBytes (8232407040 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1301 MHz (1.30 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS


./test/jit_cuda --gtest_filter=JIT.ISSUE_1646
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN      ] JIT.ISSUE_1646
unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)


In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[  FAILED  ] JIT.ISSUE_1646 (3129 ms)
[----------] 1 test from JIT (3130 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3130 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] JIT.ISSUE_1646

 1 FAILED TEST

So the question is why it fails on my TX1/TX2 but it succeeds on yours ?
May you attach your arrayfire build CMakeCache.txt so that I can check if an option is different ?

Thanks again

Hi Honey_Patouceul,

Attached my CMakeCache.txt for you check.
CMakeCache.txt (56.9 KB)

Thanks for providing this. Indeed, and that’s good news, the Release build works.
You should be able to reproduce the error with CMAKE_BUILD_TYPE=Debug or default (undefined).
Is -O3 flag mandatory for properly using cuda JIT ?

Seems not related to optimization, but rather linked to option “–device-debug” being passed to nvrtcCompileProgram(). Commenting line 286 in src/backend/cuda/jit.cpp allows a debug build to pass the test.

Hi,

Thanks for looking into this.
Here is the nvrtc document: NVRTC (Runtime Compilation) :: CUDA Toolkit Documentation

It looks like the –device-debug flag is to output some debug information only.
We will ask more information from our internal team and update to you.

Thanks

Using the debug build on my TX2-R28.2-DP with cuda-gdb I see this warning before failure:

cuda-gdb test/jit_cuda 
NVIDIA (R) CUDA Debugger
9.0 release
Portions Copyright (C) 2007-2017 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-elf-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test/jit_cuda...done.
(cuda-gdb) set args --gtest_filter=JIT.ISSUE_1646
(cuda-gdb) run
Starting program: /media/nvidia/Data/arrayfire/github/build/TX2-R28.2-DP_Debug/test/jit_cuda --gtest_filter=JIT.ISSUE_1646
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Running main() from gtest_main.cc
Note: Google Test filter = JIT.ISSUE_1646
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from JIT
[ RUN      ] JIT.ISSUE_1646
[New Thread 0x7f93b35240 (LWP 19089)]
<b>warning: Cuda API error detected: cuLaunchKernel returned (0x2bd)</b>

unknown file: Failure
C++ exception with description "ArrayFire Exception (Internal error:998):
In function void cuda::evalNodes(std::vector<cuda::Param<T> >&, std::vector<cuda::JIT::Node*>) [with T = float]
In file src/backend/cuda/jit.cpp:471
CU Error (701)

In function af::array& af::array::operator+=(const af::array&)
In file src/api/cpp/array.cpp:814" thrown in the test body.
[  FAILED  ] JIT.ISSUE_1646 (57409 ms)
[----------] 1 test from JIT (57410 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (57411 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] JIT.ISSUE_1646

 1 FAILED TEST
[Thread 0x7f93b35240 (LWP 19089) exited]
[Inferior 1 (process 19078) exited with code 01]

What does 0x2bd return value mean for cuLaunchKernel ? CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES = 701 ?

Hi, Honey

Thanks for this information. Have passed this log to the internal developer.
We are still checking this issue and will update comment here once we have a further suggestion.

Thanks.

Hi,

Error 701 is CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES, indicating the application may run out of resources.
Guess that compiler ends up using more registers in --device-debug path(due to un-optimized code) and if the application launches more threads program may fail.

To give further suggestion, could you help to provide the following logs?
1. Call nvrtcGetPTX() in your application
2. Collect the compute cache

rm -rf ~/.nv/ComputeCache
export CUDA_​FORCE_​PTX_​JIT=1 
run the application
zip/tar ~/.nv/ComputeCache

Thanks.

Hi AastaLLL,

Thanks for your support.
Attached is the log for AF_CUDA_MAX_JIT_LEN set to 19 (Note that default value is 100).
Let me know if you need more experiments.
jit.issue_1646.log (85.6 KB)
ComputeCache.tar.gz (31.9 KB)

Thanks.

Will update information with you later.

Hi,

Please help to provide the following information:

1. nvrtcGetPTX() log of successful case
2. How many threads/block is launched?

Thanks.