I have noticed that the NVidia OpenCL implementation w.r.t. structs does not behave as expected. We have an application that passes the following struct as an argument in global memory:
typedef struct {
float4 row1;
float4 row2;
float4 row3;
} mat3x3;
typedef struct {
mat3x3 transform;
mat3x3 inverse;
float4 arg1;
float4 arg2;
float4 arg3;
float4 arg4;
} transformation_t;
In the OpenCL code it is used like this:
float3 transform_coordinate(mat3x3 transformation, float3 coord){
float3 transformed_coord;
transformed_coord.x = dot(transformation.row1.xyz, coord);
transformed_coord.y = dot(transformation.row2.xyz, coord);
transformed_coord.z = dot(transformation.row3.xyz, coord);
return transformed_coord;
}
kernel my_kernel(..., global transformation_t *trans, ...) {
float3 float_coord = ...;
float3 transformed = transform_coordinate(trans->transform, float_coord);
...
}
In the CPU code we simply create an array of 160 bytes with all the floats (e.g. no padding). The pointer passed in clEnqueueWriteBuffer is aligned to a 16-byte boundary.
In the past I have already noticed some glitches (e.g. some kernel executions read/write wrong values).
- Linux + NV driver 331: using int4 for ‘arg1’ causes artifacts
→ worked around by using float4 - Windows + NV driver 340: using float3 for mat3x3 causes artifacts
→ worked around by using float4
Now I have a new issue after updating my driver:
3) Windows + NV driver 352.63: using float4 for mat3x3 causes artifacts
→ using float3 again seems to avoid the issue
So I wonder:
- Can anybody explain this behavior?
- Did anything change recently in the NVidia drivers that could explain this?
- Is there anything that I might be doing wrong?