cudaErrorIllegalAddress on Kepler GPUs, but program runs fine on Fermi GPUs

Hi!
I’m having a very weird problem with my program. Essentially I’m doing a matrix multiplication on part of a matrix. The program apparently runs fine on non-Kepler cards (tried on my notebooks GPU) but crashes on Kepler cards (tried on a K40c, a GTX 780). Someone else was able to reproduce it on a K20m and failed to reproduce it on a Tesla S2050 (= a Fermi card). This was tried on several Linux platforms, some of them using CUDA 5.5 and some using CUDA 6.0.

The initial program was written in PyCUDA (see here for the pycuda mailing list entry), but I’ve since managed to boil it down to the following minimal example written in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <cuda.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>


int main(int argc, char **argv)
{
    cublasStatus_t status;
    float *A = 0;
    float *B = 0;
    float *C = 0;
    float alpha = 1.0f;
    float beta = 0.0f;
    float *oldA, *oldB, *oldC;
    cublasHandle_t handle;
    int n = 131;
    int m = 2483;
    int k = 3;
    int i;

    CUcontext ctx;
    cuInit(0);
    status = cuCtxCreate(&ctx, 0, 0);
    if (status != CUBLAS_STATUS_SUCCESS) {
        fprintf(stderr, "!!!! Context creation error: %d\n", status);
        return EXIT_FAILURE;
    }

    status = cublasCreate(&handle);
    if (status != CUBLAS_STATUS_SUCCESS) {
        fprintf(stderr, "!!!! CUBLAS initialization error\n");
        return EXIT_FAILURE;
    }

    for (i = 0; i < 5; ++i) {
        printf("Iteration %d\n", i);

        if (cudaMalloc((void **)&B, m * k * sizeof(B[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate B)\n");
            return EXIT_FAILURE;
        }

        if (cudaMalloc((void **)&C, m * m * sizeof(C[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate C)\n");
            return EXIT_FAILURE;
        }

        if (cudaMalloc((void **)&A, n * m * sizeof(A[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate A)\n");
            return EXIT_FAILURE;
        }

        int s = 3;
        float * A_slice = A + 128*m;
        status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, m, s,
            &alpha, A_slice, m, B, k, &beta, C, m);
        if (status != CUBLAS_STATUS_SUCCESS) {
            fprintf(stderr, "!!!! kernel execution error.\n");
            return EXIT_FAILURE;
        }

        if (i == 0) {
            oldA = A;
            oldB = B;
            oldC = C;
        } else if (i == 1) {
            status = cudaFree(oldA);
            if (status != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free A, %d)\n", status);
                return EXIT_FAILURE;
            }
            if (cudaFree(oldB) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free B)\n");
                return EXIT_FAILURE;
            }
            if (cudaFree(oldC) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free C)\n");
                return EXIT_FAILURE;
            }
        }
    }
    status = cublasDestroy(handle);
    if (status != CUBLAS_STATUS_SUCCESS) {
        fprintf(stderr, "!!!! shutdown error (A)\n");
        return EXIT_FAILURE;
    }

    cuCtxDestroy(ctx);
    return 0;
}

I only free memory in the 2nd iteration of the for loop to mimic the behavior of the original python program. The program will crash in the 2nd iteration of the for-loop when it tries to free “A”, with cudaFree returning a cudaErrorIllegalAddress error.

There are several things to note here:

  • the order of the malloc calls matters. If I allocate A before B things run fine
  • the numerical constants matter a bit. For some values (e.g. n=30) no crash occurs, for others there is a crash
  • The order of the free/malloc calls matter. If I free the memory in the same iteration where I allocate, everything works just fine

At this point I’m pretty desperate. I don’t see why or where I’m doing anything wrong. If anyone could help me, I’d really appreciate it.

I forgot: I’m using the current 331 drivers. (The problem occured with both 331.62 and 331.75).

Also, to compile above program, I simply used:

nvcc test.c -o test -lcublas -lcuda