cuSolverSp_LinearSolver produces different results with different order methods

pepe2 · September 14, 2017, 9:40am

Hi NVidia,

I am running cuSolverSp_LinearSolver with the matrix that you provided (lap2D_5pt_n100.mtx) and what I noticed is that the solution vector X, has completely different solutions when the order method is the default symrcm (Reverse Cuthill-McKee) or the alternative symamd (Approximate Minimum Degree).

method symrcm (I am only outputing the last element value the x9999):
./cuSolverSp_LinearSolver -file=lap2D_5pt_n100.mtx
…
x9999 = 2.756075
b9999 = 1.000000
(GPU) |b - Ax| = 1.818989E-12
(GPU) |A| = 8.000000E+00
(GPU) |x| = 7.513384E+02
(GPU) |b - Ax|/(|A|*|x|) = 3.026248E-16
timing chol: CPU = 0.193306 sec , GPU = 0.141850 sec

method symamd:
./cuSolverSp_LinearSolver -file=lap2D_5pt_n100.mtx -P=symamd
…
x9999 = 32.066522
b9999 = 1.000000
(GPU) |b - Ax| = 2.273737E-12
(GPU) |A| = 8.000000E+00
(GPU) |x| = 7.513384E+02
(GPU) |b - Ax|/(|A|*|x|) = 3.782810E-16
timing chol: CPU = 0.078467 sec , GPU = 0.043648 sec

So the x9999 (as well as the other X pameters) show completely different solutions.
BTW, by using my own simple 3x3 matrix, I have confirmed that only the second ordering method has shown correct results.

Any ideas?

Simple print function to print the X and B vector
int i;
for (i=0; i < colsA; i++)
{
printf (“x%d = %f\n”, i, h_x[i]);
printf (" b%d = %f\n", i, h_b[i]);
}

Robert_Crovella · September 16, 2017, 2:57pm

The measure of correctness here is the various normalized residuals calculated. They are pretty small and nearly the same. I don’t think you can compare just one or even a small number of datapoints out of 10000 and say there is a problem.

In general, I would not expect two different numerical methods to produce identically the same result.

njuffa · September 16, 2017, 4:22pm

Presumably different methods exist and are supported by the library because they support different trade-off scenarios (e.g. with respect to performance, accuracy, robustness) or use cases. A library user should have some domain knowledge to determine which method best suits the needs of their particular use case.

That does not mean there couldn’t be a bug somewhere, but to demonstrate the existence of a bug a lot more evidence would be needed than this example.

pepe2 · September 19, 2017, 7:59am

Hi All, thank you for your support. That makes full sense for me.

The reason why I have raised the topic was because I was using a simple matrix (3x3) as shown below.

%%MatrixMarket matrix coordinate real symmetric
% Generated 15-Sep-2017
3 3 6
1 1 1
2 1 3
3 1 -4
2 2 10
3 2 -7
3 3 42

which by using the two reordering algorithms yielded very different results.
The matrix A is then:
1 3 -4
3 10 -7
-4 -7 42

The b vector used is [1; 2; 3].

Solving manually or with octave one gets:
x0 = 232
x1 = -61
x2 = 12

Solving with ./cuSolverSp_LinearSolver symrcm one gets (wrong results):
x0 = 24
x1 = 467
x2 = -123

Solving with ./cuSolverSp_LinearSolver symamd
x0 = 232
x1 = -61
x2 = 12

The reason for this wrong results with symrcm seems to be that symrcm reorders my matrix like [3; 1; 2], therefore generating a reordered matrix as:
42 -4 -7
-4 1 3
-7 3 10

but it does not reorder b at all (before or when solving), and therefore solves a different system:
42 -4 -7 * x0 = 1
-4 1 3 x1 2
-7 3 10 x2 3

x0 = 24
x1 = 467
x2 = -123

I can confirm that when I change my b to the reorder expected and run again with symrcm, I now get the correct results.

This problem does not affect symamd, as for my matrix, with the symamd method there is no reordering of matrix A, and so the problem is never seen.

This problem is also not detected by the given matrix example (lap2D_5pt_n100.mtx), because the vector b used is set to all ones (so the lack of reordering of b does not matter, just the x9999 ran with one method may not match the x9999 ran with another method, but will match other x, there is just a reorder issue for this example, but the values are correct).

Please let me know if my understanding is correct.

Robert_Crovella · September 19, 2017, 6:46pm

Yes, you need to reorder the RHS vector to match the A reordering. The h_Q vector in that sample code provides the necessary reordering indices/map. The cuda sample code does not demonstrate that step, and it will be needed except in the demonstrated case where the RHS vector is a 1 so reordering is meaningless.

jcmarginal · September 20, 2017, 8:52pm

How can I reorder the RHS vector?

I tried

p = symcrm(A);
R = A(p,p);
b = b(p);

But I still get a different result.

Robert_Crovella · September 21, 2017, 1:05am

My suggestion was to use the h_Q vector to reorder the b vector. You don’t seem to be doing that. Using the h_Q vector to reorder the b vector is not the only possible method, but I’ll demonstrate that since it is the one I suggested.

Here is what I had in mind. The following test case uses the short matrix market example provided by OP, and acknowledges that the solution vector x = {232, -61, 12} (possibly reordered - see below) is the correct solution. First we compile and run both symamd and symrcm cases, and just as OP reported, one case is correct, one is not. Then we compile and run with -DUSE_REORDERB and we see that the outputs match (albeit in different order). The output vectors are in a different order, but this is due to the fact that one of the cases uses reordering, one does not, just as OP reported. Since the sample code overwrites the original A matrix in the reordering case, we should expect the result vector to be a reordered version of the “correct solution” and that is exactly what we see:

$ cat cuSolverSp_LinearSolver.cpp
/*
 * Copyright 2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/*
 *  Test three linear solvers, including Cholesky, LU and QR.
 *  The user has to prepare a sparse matrix of "matrix market format" (with extension .mtx).
 *  For example, the user can download matrices in Florida Sparse Matrix Collection.
 *  (http://www.cise.ufl.edu/research/sparse/matrices/)
 *
 *  The user needs to choose a solver by the switch -R<solver> and
 *  to provide the path of the matrix by the switch -F<file>, then
 *  the program solves
 *          A*x = b  where b = ones(m,1)
 *  and reports relative error
 *          |b-A*x|/(|A|*|x|)
 *
 *  The elapsed time is also reported so the user can compare efficiency of different solvers.
 *
 *  The runtime of linear solver contains symbolic analysis, numerical factorization and solve.
 *  The user can set environmental variable OMP_NUM_THREADS to configure number of cores in CPU path.
 *
 *  How to use
        /cuSolverSp_LinearSolver            // Default: Cholesky, symrcm & file=lap2D_5pt_n100.mtx
 *     ./cuSolverSp_LinearSolver -R=chol  -file=<file>   // cholesky factorization
 *     ./cuSolverSp_LinearSolver -R=lu -P=symrcm -file=<file>     // symrcm + LU with partial pivoting
 *     ./cuSolverSp_LinearSolver -R=qr -P=symamd -file=<file>     // symamd + QR factorization
 *
 *
 *  Remark: the absolute error on solution x is meaningless without knowing condition number of A.
 *     The relative error on residual should be close to machine zero, i.e. 1.e-15.
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

#include <cuda_runtime.h>

#include "cusparse.h"
#include "cusolverSp.h"

#include "helper_cuda.h"
#include "helper_cusolver.h"

template <typename T_ELEM>
int loadMMSparseMatrix(
    char *filename,
    char elem_type,
    bool csrFormat,
    int *m,
    int *n,
    int *nnz,
    T_ELEM **aVal,
    int **aRowInd,
    int **aColInd,
    int extendSymMatrix);

void UsageSP(void)
{
    printf( "<options>\n");
    printf( "-h          : display this help\n");
    printf( "-R=<name>   : choose a linear solver\n");
    printf( "              chol (cholesky factorization), this is default\n");
    printf( "              qr   (QR factorization)\n");
    printf( "              lu   (LU factorization)\n");
    printf( "-P=<name>    : choose a reordering\n");
    printf( "              symrcm (Reverse Cuthill-McKee)\n");
    printf( "              symamd (Approximate Minimum Degree)\n");
    printf( "-file=<filename> : filename containing a matrix in MM format\n");
    printf( "-device=<device_id> : <device_id> if want to run on specific GPU\n");

    exit( 0 );
}

void parseCommandLineArguments(int argc, char *argv[], struct testOpts &opts)
{
    memset(&opts, 0, sizeof(opts));

    if (checkCmdLineFlag(argc, (const char **)argv, "-h"))
    {
        UsageSP();
    }

    if (checkCmdLineFlag(argc, (const char **)argv, "R"))
    {
        char *solverType = NULL;
        getCmdLineArgumentString(argc, (const char **)argv, "R", &solverType);

        if (solverType)
        {
            if ((STRCASECMP(solverType, "chol") != 0) && (STRCASECMP(solverType, "lu") != 0) && (STRCASECMP(solverType, "qr") != 0))
            {
                printf("\nIncorrect argument passed to -R option\n");
                UsageSP();
            }
            else
            {
                opts.testFunc = solverType;
            }
        }
    }

    if (checkCmdLineFlag(argc, (const char **)argv, "P"))
    {
        char *reorderType = NULL;
        getCmdLineArgumentString(argc, (const char **)argv, "P", &reorderType);

        if (reorderType)
        {
            if ((STRCASECMP(reorderType, "symrcm") != 0) && (STRCASECMP(reorderType, "symamd") != 0))
            {
                printf("\nIncorrect argument passed to -P option\n");
                UsageSP();
            }
            else
            {
                opts.reorder = reorderType;
            }
        }
    }

    if (!opts.reorder)
    {
        opts.reorder = "symrcm"; // Setting default reordering to be symrcm.
    }

    if (checkCmdLineFlag(argc, (const char **)argv, "file"))
    {
        char *fileName = 0;
        getCmdLineArgumentString(argc, (const char **)argv, "file", &fileName);

        if (fileName)
        {
            opts.sparse_mat_filename = fileName;
        }
        else
        {
            printf("\nIncorrect filename passed to -file \n ");
            UsageSP();
        }
    }
}

int main (int argc, char *argv[])
{
    struct testOpts opts;
    cusolverSpHandle_t handle = NULL;
    cusparseHandle_t cusparseHandle = NULL; // used in residual evaluation
    cudaStream_t stream = NULL;
    cusparseMatDescr_t descrA = NULL;

    int rowsA = 0; // number of rows of A
    int colsA = 0; // number of columns of A
    int nnzA  = 0; // number of nonzeros of A
    int baseA = 0; // base index in CSR format

    // CSR(A) from I/O
    int *h_csrRowPtrA = NULL;
    int *h_csrColIndA = NULL;
    double *h_csrValA = NULL;

    double *h_x = NULL; // x = A \ b
    double *h_b = NULL; // b = ones(m,1)
    double *h_r = NULL; // r = b - A*x

    int *h_Q = NULL; // <int> n
                     // reorder to reduce zero fill-in
                     // Q = symrcm(A) or Q = symamd(A)
    // B = Q*A*Q^T
    int *h_csrRowPtrB = NULL; // <int> n+1
    int *h_csrColIndB = NULL; // <int> nnzA
    double *h_csrValB = NULL; // <double> nnzA
    int *h_mapBfromA = NULL;  // <int> nnzA

    size_t size_perm = 0;
    void *buffer_cpu = NULL; // working space for permutation: B = Q*A*Q^T

    int *d_csrRowPtrA = NULL;
    int *d_csrColIndA = NULL;
    double *d_csrValA = NULL;
    double *d_x = NULL; // x = A \ b
    double *d_b = NULL; // a copy of h_b
    double *d_r = NULL; // r = b - A*x

    double tol = 1.e-12;
    int reorder = 0; // no reordering
    int singularity = 0; // -1 if A is invertible under tol.

    // the constants are used in residual evaluation, r = b - A*x
    const double minus_one = -1.0;
    const double one = 1.0;

    double x_inf = 0.0;
    double r_inf = 0.0;
    double A_inf = 0.0;
    int errors = 0;
    int issym = 0;

    double start, stop;
    double time_solve_cpu;
    double time_solve_gpu;

    parseCommandLineArguments(argc, argv, opts);

    if (NULL == opts.testFunc)
    {
        opts.testFunc = "chol"; // By default running Cholesky as NO solver selected with -R option.
    }

    findCudaDevice(argc, (const char **)argv);

    if (opts.sparse_mat_filename == NULL)
    {
        opts.sparse_mat_filename =  sdkFindFilePath("lap2D_5pt_n100.mtx", argv[0]);
        if (opts.sparse_mat_filename != NULL)
            printf("Using default input file [%s]\n", opts.sparse_mat_filename);
        else
            printf("Could not find lap2D_5pt_n100.mtx\n");
    }
    else
    {
        printf("Using input file [%s]\n", opts.sparse_mat_filename);
    }

    printf("step 1: read matrix market format\n");

    if (opts.sparse_mat_filename == NULL)
    {
        fprintf(stderr, "Error: input matrix is not provided\n");
        return EXIT_FAILURE;
    }

    if (loadMMSparseMatrix<double>(opts.sparse_mat_filename, 'd', true , &rowsA, &colsA,
           &nnzA, &h_csrValA, &h_csrRowPtrA, &h_csrColIndA, true))
    {
        exit(EXIT_FAILURE);
    }
    baseA = h_csrRowPtrA[0]; // baseA = {0,1}
    printf("sparse matrix A is %d x %d with %d nonzeros, base=%d\n", rowsA, colsA, nnzA, baseA);

    if ( rowsA != colsA ){
        fprintf(stderr, "Error: only support square matrix\n");
        return 1;
    }

    checkCudaErrors(cusolverSpCreate(&handle));
    checkCudaErrors(cusparseCreate(&cusparseHandle));

    checkCudaErrors(cudaStreamCreate(&stream));
    checkCudaErrors(cusolverSpSetStream(handle, stream));

    checkCudaErrors(cusparseSetStream(cusparseHandle, stream));

    checkCudaErrors(cusparseCreateMatDescr(&descrA));

    checkCudaErrors(cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL));

    if (baseA)
    {
        checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE));
    }
    else
    {
        checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO));
    }

    h_x = (double*)malloc(sizeof(double)*colsA);
    h_b = (double*)malloc(sizeof(double)*rowsA);
    h_r = (double*)malloc(sizeof(double)*rowsA);
    assert(NULL != h_x);
    assert(NULL != h_b);
    assert(NULL != h_r);

    checkCudaErrors(cudaMalloc((void **)&d_csrRowPtrA, sizeof(int)*(rowsA+1)));
    checkCudaErrors(cudaMalloc((void **)&d_csrColIndA, sizeof(int)*nnzA));
    checkCudaErrors(cudaMalloc((void **)&d_csrValA   , sizeof(double)*nnzA));
    checkCudaErrors(cudaMalloc((void **)&d_x, sizeof(double)*colsA));
    checkCudaErrors(cudaMalloc((void **)&d_b, sizeof(double)*rowsA));
    checkCudaErrors(cudaMalloc((void **)&d_r, sizeof(double)*rowsA));

    // verify if A has symmetric pattern or not
    checkCudaErrors(cusolverSpXcsrissymHost(
        handle, rowsA, nnzA, descrA, h_csrRowPtrA, h_csrRowPtrA+1, h_csrColIndA, &issym));

    if ( 0 == strcmp(opts.testFunc, "chol") )
    {
        if (!issym)
        {
            printf("Error: A has no symmetric pattern, please use LU or QR \n");
            exit(EXIT_FAILURE);
        }
    }

    if (NULL != opts.reorder)
    {
        printf("step 2: reorder the matrix A to minimize zero fill-in\n");
        printf("        if the user choose a reordering by -P=symrcm or -P=symamd\n");
        printf("        The reordering will overwrite A such that \n");
        printf("            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)\n");

        h_Q          = (int*   )malloc(sizeof(int)*colsA);
        h_csrRowPtrB = (int*   )malloc(sizeof(int)*(rowsA+1));
        h_csrColIndB = (int*   )malloc(sizeof(int)*nnzA);
        h_csrValB    = (double*)malloc(sizeof(double)*nnzA);
        h_mapBfromA  = (int*   )malloc(sizeof(int)*nnzA);

        assert(NULL != h_Q);
        assert(NULL != h_csrRowPtrB);
        assert(NULL != h_csrColIndB);
        assert(NULL != h_csrValB   );
        assert(NULL != h_mapBfromA);

        if ( 0 == strcmp(opts.reorder, "symrcm") )
        {
            checkCudaErrors(cusolverSpXcsrsymrcmHost(
                handle, rowsA, nnzA,
                descrA, h_csrRowPtrA, h_csrColIndA,
                h_Q));
        }
        else if ( 0 == strcmp(opts.reorder, "symamd") )
        {
            checkCudaErrors(cusolverSpXcsrsymamdHost(
                handle, rowsA, nnzA,
                descrA, h_csrRowPtrA, h_csrColIndA,
                h_Q));
        }
        else
        {
            fprintf(stderr, "Error: %s is unknown reordering\n", opts.reorder);
            return 1;
        }

        // B = Q*A*Q^T
        memcpy(h_csrRowPtrB, h_csrRowPtrA, sizeof(int)*(rowsA+1));
        memcpy(h_csrColIndB, h_csrColIndA, sizeof(int)*nnzA);

        checkCudaErrors(cusolverSpXcsrperm_bufferSizeHost(
            handle, rowsA, colsA, nnzA,
            descrA, h_csrRowPtrB, h_csrColIndB,
            h_Q, h_Q,
            &size_perm));

        if (buffer_cpu)
        {
            free(buffer_cpu);
        }
        buffer_cpu = (void*)malloc(sizeof(char)*size_perm);
        assert(NULL != buffer_cpu);

        // h_mapBfromA = Identity
        for(int j = 0 ; j < nnzA ; j++)
        {
            h_mapBfromA[j] = j;
        }
        checkCudaErrors(cusolverSpXcsrpermHost(
            handle, rowsA, colsA, nnzA,
            descrA, h_csrRowPtrB, h_csrColIndB,
            h_Q, h_Q,
            h_mapBfromA,
            buffer_cpu));

        // B = A( mapBfromA )
        for(int j = 0 ; j < nnzA ; j++)
        {
            h_csrValB[j] = h_csrValA[ h_mapBfromA[j] ];
        }

        // A := B
        memcpy(h_csrRowPtrA, h_csrRowPtrB, sizeof(int)*(rowsA+1));
        memcpy(h_csrColIndA, h_csrColIndB, sizeof(int)*nnzA);
        memcpy(h_csrValA   , h_csrValB   , sizeof(double)*nnzA);

        printf("step 2.1: set right hand side vector (b) to 1\n");
    }
    else
    {
        printf("step 2: set right hand side vector (b) to 1\n");
    }

for(int row = 0 ; row < rowsA ; row++)
    {
#ifdef USE_123
        h_b[row] = row+1.0;
#else
        h_b[row] = 1.0;
#endif
    }
#ifdef USE_REORDERB
    double *t_b = (double*)malloc(sizeof(double)*rowsA);
    for(int row = 0 ; row < rowsA ; row++)
    {
      t_b[row] = h_b[h_Q[row]];
    }
    memcpy(h_b, t_b, sizeof(double)*rowsA);
#endif

for (int row = 0; row < rowsA; row++)

    printf("step 3: prepare data on device\n");
    checkCudaErrors(cudaMemcpy(d_csrRowPtrA, h_csrRowPtrA, sizeof(int)*(rowsA+1), cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_csrColIndA, h_csrColIndA, sizeof(int)*nnzA     , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_csrValA   , h_csrValA   , sizeof(double)*nnzA  , cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_b, h_b, sizeof(double)*rowsA, cudaMemcpyHostToDevice));

    printf("step 4: solve A*x = b on CPU\n");
    // A and b are read-only
    start = second();
    start = second();

    if ( 0 == strcmp(opts.testFunc, "chol") )
    {
        checkCudaErrors(cusolverSpDcsrlsvcholHost(
            handle, rowsA, nnzA,
            descrA, h_csrValA, h_csrRowPtrA, h_csrColIndA,
            h_b, tol, reorder, h_x, &singularity));
    }
    else if ( 0 == strcmp(opts.testFunc, "lu") )
    {
        checkCudaErrors(cusolverSpDcsrlsvluHost(
            handle, rowsA, nnzA,
            descrA, h_csrValA, h_csrRowPtrA, h_csrColIndA,
            h_b, tol, reorder, h_x, &singularity));

    }
    else if ( 0 == strcmp(opts.testFunc, "qr") )
    {
        checkCudaErrors(cusolverSpDcsrlsvqrHost(
            handle, rowsA, nnzA,
            descrA, h_csrValA, h_csrRowPtrA, h_csrColIndA,
            h_b, tol, reorder, h_x, &singularity));

    }
    else
    {
        fprintf(stderr, "Error: %s is unknown function\n", opts.testFunc);
        return 1;
    }
    stop = second();

    time_solve_cpu = stop - start;

    if (0 <= singularity)
    {
        printf("WARNING: the matrix is singular at row %d under tol (%E)\n", singularity, tol);
    }

    printf("step 5: evaluate residual r = b - A*x (result on CPU)\n");
    checkCudaErrors(cudaMemcpy(d_r, h_b, sizeof(double)*rowsA, cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(d_x, h_x, sizeof(double)*colsA, cudaMemcpyHostToDevice));
    checkCudaErrors(cusparseDcsrmv(cusparseHandle,
        CUSPARSE_OPERATION_NON_TRANSPOSE,
        rowsA,
        colsA,
        nnzA,
        &minus_one,
        descrA,
        d_csrValA,
        d_csrRowPtrA,
        d_csrColIndA,
        d_x,
        &one,
        d_r));
    checkCudaErrors(cudaMemcpy(h_r, d_r, sizeof(double)*rowsA, cudaMemcpyDeviceToHost));

    x_inf = vec_norminf(colsA, h_x);
    r_inf = vec_norminf(rowsA, h_r);
    A_inf = csr_mat_norminf(rowsA, colsA, nnzA, descrA, h_csrValA, h_csrRowPtrA, h_csrColIndA);

    printf("(CPU) |b - A*x| = %E \n", r_inf);
    printf("(CPU) |A| = %E \n", A_inf);
    printf("(CPU) |x| = %E \n", x_inf);
    printf("(CPU) |b - A*x|/(|A|*|x|) = %E \n", r_inf/(A_inf * x_inf));

    printf("step 6: solve A*x = b on GPU\n");
    // d_A and d_b are read-only
    start = second();
    start = second();

    if ( 0 == strcmp(opts.testFunc, "chol") )
    {
        checkCudaErrors(cusolverSpDcsrlsvchol(
            handle, rowsA, nnzA,
            descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA,
            d_b, tol, reorder, d_x, &singularity));

    }
    else if ( 0 == strcmp(opts.testFunc, "lu") )
    {
        printf("WARNING: no LU available on GPU \n");
    }
    else if ( 0 == strcmp(opts.testFunc, "qr") )
    {
        checkCudaErrors(cusolverSpDcsrlsvqr(
            handle, rowsA, nnzA,
            descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA,
            d_b, tol, reorder, d_x, &singularity));
    }
    else
    {
        fprintf(stderr, "Error: %s is unknow function\n", opts.testFunc);
        return 1;
    }
    checkCudaErrors(cudaDeviceSynchronize());
    stop = second();

    time_solve_gpu = stop - start;

    if (0 <= singularity)
    {
        printf("WARNING: the matrix is singular at row %d under tol (%E)\n", singularity, tol);
    }

    printf("step 7: evaluate residual r = b - A*x (result on GPU)\n");
    checkCudaErrors(cudaMemcpy(d_r, d_b, sizeof(double)*rowsA, cudaMemcpyDeviceToDevice));

    checkCudaErrors(cusparseDcsrmv(cusparseHandle,
        CUSPARSE_OPERATION_NON_TRANSPOSE,
        rowsA,
        colsA,
        nnzA,
        &minus_one,
        descrA,
        d_csrValA,
        d_csrRowPtrA,
        d_csrColIndA,
        d_x,
        &one,
        d_r));

    checkCudaErrors(cudaMemcpy(h_x, d_x, sizeof(double)*colsA, cudaMemcpyDeviceToHost));
    checkCudaErrors(cudaMemcpy(h_r, d_r, sizeof(double)*rowsA, cudaMemcpyDeviceToHost));

    x_inf = vec_norminf(colsA, h_x);
    r_inf = vec_norminf(rowsA, h_r);

    if ( 0 != strcmp(opts.testFunc, "lu") )
    {
        // only cholesky and qr have GPU version
        printf("(GPU) |b - A*x| = %E \n", r_inf);
        printf("(GPU) |A| = %E \n", A_inf);
        printf("(GPU) |x| = %E \n", x_inf);
        printf("(GPU) |b - A*x|/(|A|*|x|) = %E \n", r_inf/(A_inf * x_inf));
    }

    fprintf (stdout, "timing %s: CPU = %10.6f sec , GPU = %10.6f sec\n", opts.testFunc, time_solve_cpu, time_solve_gpu);

for (int i=colsA-3; i < colsA; i++)
{
printf ("x%d = %f\n", i, h_x[i]);
}

if (handle) { checkCudaErrors(cusolverSpDestroy(handle)); }
    if (cusparseHandle) { checkCudaErrors(cusparseDestroy(cusparseHandle)); }
    if (stream) { checkCudaErrors(cudaStreamDestroy(stream)); }
    if (descrA) { checkCudaErrors(cusparseDestroyMatDescr(descrA)); }

    if (h_csrValA   ) { free(h_csrValA); }
    if (h_csrRowPtrA) { free(h_csrRowPtrA); }
    if (h_csrColIndA) { free(h_csrColIndA); }
    if (h_x) { free(h_x); }
    if (h_b) { free(h_b); }
    if (h_r) { free(h_r); }

    if (h_Q) { free(h_Q); }

    if (h_csrRowPtrB) { free(h_csrRowPtrB); }
    if (h_csrColIndB) { free(h_csrColIndB); }
    if (h_csrValB   ) { free(h_csrValB   ); }
    if (h_mapBfromA ) { free(h_mapBfromA ); }

    if (buffer_cpu) { free(buffer_cpu); }

    if (d_csrValA   ) { checkCudaErrors(cudaFree(d_csrValA)); }
    if (d_csrRowPtrA) { checkCudaErrors(cudaFree(d_csrRowPtrA)); }
    if (d_csrColIndA) { checkCudaErrors(cudaFree(d_csrColIndA)); }
    if (d_x) { checkCudaErrors(cudaFree(d_x)); }
    if (d_b) { checkCudaErrors(cudaFree(d_b)); }
    if (d_r) { checkCudaErrors(cudaFree(d_r)); }

    return 0;
}

$ g++ -Wno-write-strings -fopenmp -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc mmio.c mmio_wrapper.cpp  cuSolverSp_LinearSolver.cpp -o test -L/usr/local/cuda/lib64 -lcusolver -lcusparse -lcudart -DUSE_123
$ ./test  -file=./test.mtx -P=symrcm
GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1

Using input file [./test.mtx]
step 1: read matrix market format
sparse matrix A is 3 x 3 with 9 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b) to 1
step 3: prepare data on device
step 3: prepare data on device
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 2.273737E-13
(CPU) |A| = 5.300000E+01
(CPU) |x| = 4.670000E+02
(CPU) |b - A*x|/(|A|*|x|) = 9.186444E-18
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 2.273737E-13
(GPU) |A| = 5.300000E+01
(GPU) |x| = 4.670000E+02
(GPU) |b - A*x|/(|A|*|x|) = 9.186444E-18
timing chol: CPU =   0.009646 sec , GPU =   0.002064 sec
x0 = 24.000000
x1 = 467.000000
x2 = -123.000000
$ ./test  -file=./test.mtx -P=symamd
GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1

Using input file [./test.mtx]
step 1: read matrix market format
sparse matrix A is 3 x 3 with 9 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b) to 1
step 3: prepare data on device
step 3: prepare data on device
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 0.000000E+00
(CPU) |A| = 5.300000E+01
(CPU) |x| = 2.320000E+02
(CPU) |b - A*x|/(|A|*|x|) = 0.000000E+00
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 0.000000E+00
(GPU) |A| = 5.300000E+01
(GPU) |x| = 2.320000E+02
(GPU) |b - A*x|/(|A|*|x|) = 0.000000E+00
timing chol: CPU =   0.007672 sec , GPU =   0.001756 sec
x0 = 232.000000
x1 = -61.000000
x2 = 12.000000
$ g++ -Wno-write-strings -fopenmp -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc mmio.c mmio_wrapper.cpp  cuSolverSp_LinearSolver.cpp -o test -L/usr/local/cuda/lib64 -lcusolver -lcusparse -lcudart -DUSE_123 -DUSE_REORDERB
$ ./test  -file=./test.mtx -P=symrcm                                                            
GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1

Using input file [./test.mtx]
step 1: read matrix market format
sparse matrix A is 3 x 3 with 9 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b) to 1
step 3: prepare data on device
step 3: prepare data on device
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 2.273737E-13
(CPU) |A| = 5.300000E+01
(CPU) |x| = 2.320000E+02
(CPU) |b - A*x|/(|A|*|x|) = 1.849168E-17
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 2.273737E-13
(GPU) |A| = 5.300000E+01
(GPU) |x| = 2.320000E+02
(GPU) |b - A*x|/(|A|*|x|) = 1.849168E-17
timing chol: CPU =   0.004246 sec , GPU =   0.004758 sec
x0 = 12.000000
x1 = 232.000000
x2 = -61.000000
$ ./test  -file=./test.mtx -P=symamd                                                           
 GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1

Using input file [./test.mtx]
step 1: read matrix market format
sparse matrix A is 3 x 3 with 9 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm or -P=symamd
        The reordering will overwrite A such that
            A := A(Q,Q) where Q = symrcm(A) or Q = symamd(A)
step 2.1: set right hand side vector (b) to 1
step 3: prepare data on device
step 3: prepare data on device
step 3: prepare data on device
step 4: solve A*x = b on CPU
step 5: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = 0.000000E+00
(CPU) |A| = 5.300000E+01
(CPU) |x| = 2.320000E+02
(CPU) |b - A*x|/(|A|*|x|) = 0.000000E+00
step 6: solve A*x = b on GPU
step 7: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = 0.000000E+00
(GPU) |A| = 5.300000E+01
(GPU) |x| = 2.320000E+02
(GPU) |b - A*x|/(|A|*|x|) = 0.000000E+00
timing chol: CPU =   0.005667 sec , GPU =   0.002042 sec
x0 = 232.000000
x1 = -61.000000
x2 = 12.000000
$

If you desired an identical solution, you could reorder the resultant h_x vector in a “reverse” fashion to the way the above code (with -DUSE_REORDERB) reorders the b vector.