Accurate computation of modified Bessel functions - Using netlib Fortran routines in CUDA?

JFSebastian · January 2, 2013, 2:45pm

I’m dealing with the problem of accurately calculating the modified Bessel function of zero-th order I0 in CUDA.

For a long time, I have been using a rational Chebyshev approximation according to the paper

J.M. Blair, “Rational Chebyshev approximations for the modified Bessel functions I_0(x) and I_1(x)”, Math. Comput., vol. 28, n. 126, pp. 581-583, Apr. 1974.

which, as compared to the result provided by Matlab, gives an average error of the order of 1e-29. Unfortunately, this seemingly high accuracy is not anymore enough for a new application I’m working on.

Matlab uses the Fortran routines developed by D.E. Amos

Amos, D.E., “A Subroutine Package for Bessel Functions of a Complex Argument and Nonnegative Order,” Sandia National Laboratory Report, SAND85-1018, May, 1985.

Amos, D.E., “A Portable Package for Bessel Functions of a Complex Argument and Nonnegative Order,” Trans. Math. Software, 1986.

which are available for download from the netlib/amos website.

There are ways to use those Fortran routines in a C/C++ code by compiling them in a library file and then using a C/C++ wrapper (see for example netlib_wrapping). I’m wondering if there is any means to make device functions out of those Fortran routines to be then called by a CUDA kernel).

njuffa · January 2, 2013, 6:03pm

Highly accurate double-precision implementations of mathematical functions provide a maximum error on the order of 1e-16. Your question implies you are looking for implementations accurate to quadruple precision or even better?

Since CUDA is basically C / C++, you should be able to simply translate the Fortran code into C by hand which would take a few hours of porting time per function. The bigger problem would likely be that CUDA does mot provide a built-in quadruple precision data type. NVIDIA provides a sample implementation of double-double arithmetic (addition, subtraction, multiplication, division, square root) on the registered developer website which could use; the basic operations are accurate to about 1e-31 if I recall correctly.

pasoleatis · January 2, 2013, 8:21pm

Hello,

I have a program which does a Hankel transform and it uses the Bessel functions. In C I just used the gsl library:

#include 
#include 
#include 
#include 
#include 
#include 

m1[i]=fabs(gsl_sf_bessel_J1(bsz[i]))/rmax;

Will this code work also in a kernel ?

njuffa · January 2, 2013, 8:44pm

CUDA supports those Bessel functions traditionally supported with Unix / Posix, namely j0, j1, jn, y0, y1, yn, in both single-precision and double-precision flavors. I am not familiar with GSL, other than being aware of its existence.

I am curious in which application areas CUDA developers use Bessel functions. Any pointers are appreciated.

njuffa · January 3, 2013, 8:33pm

I checked the GSL documentation and the protoype of gsl_sf_bessel_J1() appears to be:

double  gsl_sf_bessel_j1 (const double x)

Based on that, you should be able to simply invoke j1() with a double-precision argument in the corresponding device code.

JFSebastian · January 3, 2013, 9:46pm

Thank you pasoleatis for having pointed out the GSL library. Could I use the gsl Bessel routines (for example, gsl_sg_bessel_j1) as device functions? Is the zeroth order modified Bessel function I0 included in that library? Unfortunately, modified Bessel functions are not available in the mentioned CUDA set of Bessel functions (j0, j1, jn, y0, y1, yn).

Thank you njuffa for providing your information as well as the prototype of the function. To answer your question, Bessel functions arise in many fields, for example microwaves, optics and signal processing (filtering).

njuffa · January 3, 2013, 10:10pm

[url]GNU Scientific Library — GSL 2.7 documentation

From the GSL documentation (see link above) it would appear that I0 and I1 are supported. Given that there are probably a bunch of dependencies on other GSL functions, it might not be trivial to port the code to CUDA (I haven’t looked or tried).

The algorithms by Blair look very straightforward to implement. Assuming the basic approximations are sound this should result in code with good performance and good accuracy. I intend to take them for a spin and report my findings back here (may take a few days).

In general, customers should feel free to suggest expansions to the CUDA math library, by filing a bug through the registered developer website. The more widely used such a function is, and the more cross-platform consensus there is as to specifications (e.g. the set of Bessel functions specified in Posix) the higher the likelihood that it might be added. All functions added to the CUDA math library beyond what is specified in C99 were added on the basis of customer feedback and needs.

njuffa · January 4, 2013, 7:39am

I got half of Blair’s algorithm for i0 to work, the part that handles |x|

I guess the forum software does not like the “less than” character, and duly deleted the rest of my post …

njuffa · January 4, 2013, 7:53am

Don’t feel like re-typing my post right now, here is the code if anybody is interested. I had a very hard time reading the coefficients from the paper due to poor print quality, especially as far as the digits ‘0’ and ‘8’ are concerned, so no guarantees. A quick test of the code below revealed a maximum ulp error of 10.06 at 11.799507711536606

[code]
/* Based on: J. M. Blair, Rational Chebyshev Approximations of the Modified
Bessel Functions I0(x) and I1(x). Mathematics of Computation, Vol. 28,
No. 126, April 1974, pp. 581-583.
*/
device double besseli0 (double a)
{
double p, q, a2;

if (fabs (a)

pasoleatis · January 4, 2013, 10:57am

Hello,

I am personally using them for the so called Hankel transform. This type of transform appears for a function f(x,y) where f(x,y)=f(r) with r=sqrt(xx+yy) . When a Fourier transform (f(r) → f(k)) is made of this function we can change the coordinates from x,y to polar (cylindrical) coordinates r,angle. Now we can first integral over angle and obtain the lowest order Bessel function. The transform can now be written as a matrix vector multiplication with coefficient in the matrix being related to the j1(k*r). Of course I can defined this matrix on cpu and then transfer everything to gpu and use blas library, but the transform matrix is of order N^2, where N is the number of points in the interval (0,rmax]. Án alternative is to call directly in the kernel the Bessel function to go around the memory restriction.

My question is if I can use in a kernel a function defined in the gsl library in the same way I use for example sin or exp function.

njuffa · January 6, 2013, 4:49am

GSL is a host library. Host code cannot be called from device code, i.e. a CUDA kernel. If the CUDA math library does not provide a particular function you need, you could try to port the relevant GSL code to CUDA. I am not not familiar with the GSL code base, and do not know how difficult this would be.

You mentioned that your code requires the Bessel function j1(). This function is provided by the CUDA math library, so you can use that in CUDA kernels, and there is no need for GSL’s j1() implementation. Simply invoke j1() in your CUDA code as you would invoke any other math function such as sin(), exp(), and log().

pasoleatis · January 6, 2013, 9:10am

Thanks. I hope this answered also the question of the OP.

njuffa · January 6, 2013, 9:16am

Here is the implementation of the modified Bessel function i0(). I used Blair’s basic algorithmic framework, but substituted my own minimax approximations for a bit of added accuracy. In terms of accuracy, the interval [0,15] has much larger error, as the rational approximation doesn’t have very favorable numerical properties. One might consider splitting into different intervals, and using a different kind of approximation around zero. However this code should compare favorably with other double-precision implementations.

/* Algorithm derived from: J. M. Blair, Rational Chebyshev Approximations for
   the Modified Bessel Functions I0(x) and I1(x). Mathematics of Computation,
   Vol. 28, No. 126, April 1974, pp. 581-583
*/
__device__ double my_besseli0 (double a)
{
    double p, q;

    a = fabs (a);
    if (a > 15.0) {
        /* worst case error about 5 ulps */
        q = 1.0 / a;
        p =            -4.4979236558557991E+006;
        p = fma (p, q,  2.7472555659426521E+006);
        p = fma (p, q, -6.4572046640793153E+005);
        p = fma (p, q,  8.5476214845610564E+004);
        p = fma (p, q, -7.1127665397362362E+003);
        p = fma (p, q,  4.1710918140001479E+002);
        p = fma (p, q, -1.3787683843558749E+001);
        p = fma (p, q,  1.1452802345029696E+000);
        p = fma (p, q,  2.1935487807470277E-001);
        p = fma (p, q,  9.0727240339987830E-002);
        p = fma (p, q,  4.4741066428061006E-002);
        p = fma (p, q,  2.9219412078729436E-002);
        p = fma (p, q,  2.8050629067165909E-002);
        p = fma (p, q,  4.9867785050221047E-002);
        p = fma (p, q,  3.9894228040143265E-001);
        p = p * q;
        q = sqrt (a);
        p = p * q;
        q = exp (0.5 * a);  /* prevent premature overflow */
        p = p * q;
        p = p * q;
    } else {
        /* worst case error about 9 ulps */
        a = a * a;
        p =             9.8607740386136838E-030;
        p = fma (p, a,  2.5272752843984919E-026);
        p = fma (p, a,  3.5478634725286819E-023);
        p = fma (p, a,  3.3285536355723051E-020);
        p = fma (p, a,  2.2291803503378560E-017);
        p = fma (p, a,  1.0887882347249224E-014);
        p = fma (p, a,  3.8764307466171632E-012);
        p = fma (p, a,  9.9012829763755340E-010);
        p = fma (p, a,  1.7594962832748317E-007);
        p = fma (p, a,  2.0714408281080058E-005);
        p = fma (p, a,  1.4969826091882862E-003);
        p = fma (p, a,  5.8564901942292975E-002);
        p = fma (p, a,  1.0000000000000000E+000);
        q =            -3.1940857378019824E-015;
        q = fma (q, a,  1.3713171027237885E-011);
        q = fma (q, a, -2.6023145863317146E-008);
        q = fma (q, a,  2.7260506735327429E-005);
        q = fma (q, a, -1.5740392230827929E-002);
        q = fma (q, a,  4.0000000000000000E+000);
        p = p / q;
        p = fma (p, a, 1.0);
    } 
    return p;
}

sergeyn · January 7, 2013, 1:38am

Hi,
Do you happen to have bessel function for complex numbers ?

Sergey.

njuffa · January 7, 2013, 3:47am

CUDA comes with a header file cuComplex.h which implements basic complex arithmetic. Beyond that there is currently no support for complex arithmetic.

While I have significant experience with the implementation of math functions with real arguments, I have no experience with the implementation of transcendental functions with complex arguments, let alone special functions like the Bessel functions. The code above represents my first work ever on Bessel functions. So I am afraid I am unable to come up with code for complex Bessel functions at this point.

You may want to check whether CUDA Fortran ([url]http://www.pgroup.com/resources/cudafortran.htm[/url]) has support for complex Bessel functions, since Fortran has a long history of support for complex arithmetic.

JFSebastian · January 8, 2013, 3:29pm

Thanks for your code Njuffa.

At [url]http://www.orangeowlsolutions.com/archives/256[/url] you could find our original CUDA code for the calculation of I0 according to Blair’s algorithm. Of course, our code could be improved by using fma’s, as you already do.

Did you further optimize the polynomial coefficients by minimax? Against which reference did you perform the minimax optimization and you made the assessment of the results of your routine?

njuffa · January 8, 2013, 6:39pm

I kept Blair’s original algorithmic framework, but generated new minimax approximations from scratch, using the Remez algorithm, operating with 1024-bit arithmetic for all intermediate computations. Relative error of both approximations was reported to be < 1e-17, the error peaks were leveled to less than 1e-33. For multi-precision arithmetic, I use Brent’s MP code ([url]http://www.netlib.org/toms/524[/url]) which I have used for many years and consider reliable.

I used the series 10.25.2 in the DLMF ([url]DLMF: 10.25 Definitions) as my reference, using the first 10*k+100 terms, where k is the argument to i0().

I kicked off a more intensive test with 100 million test cases last night to get a more accurate reading of the maximum ulp error for the posted code. All the worst case errors occur for arguments in [10,15]. I will report here the worst case ulp error found and the corresponding argument once the run is done.

njuffa · January 9, 2013, 7:02am

My test run finally finished using 100 million test case in [1e-3, 713]. Reference was implemented from the series mentioned above, using the first 1.25*k+35 terms [where k is the input to the i0() function], evaluated in 140-bit arithmetic. This provides slightly more than quadruple precision reference results. The worst case error found was 9.20648 ulps for input of 1.2558077675329185e+01. About 71% of results are rounded correctly.

The error of my rational approximation is higher than I would like. Comparing to SFSebastian’s code based on Blair’s paper, I notice he used the rational approximatiom of degree 16 + 3 from table VIII, listed as providing a precision of 18.38, more than an order of magnitude order better than my rational approximation of degree 12 + 5. Looks like I should try a rational approximation of higher degree.

SFSebastian, I noticed a few minor issue with the code on your website, that may or may not be worth fixing. Larger negative arguments are not handled correctly, there is a missing fabs(). The function overflows prematurely for arguments greater than about 709.7 in magnitude, this could be avoided by multiplying twice by exp(0.5). Specifying the coefficient as products of two double-precision numbers is likely to introduce additional rounding error as these products are not exactly representable as double-precision numbers; it would be better to specify the coefficients as a single literal. For the code branch dealing with arguments greater than 15, I would suggest multiplying by rsqrt() instead of dividing by sqrt(). While CUDA’s double-precision rsqrt() is not IEEE-rounded its maximum ulp error is very close to 0.5, so the substitution should improve performance at negligible impact to accuracy.

While I don’t have the time to run full test run with 100 million test cases against this code, based on preliminary testing it seems the the maximum ulp error could be around 8.5 ulps, with about 55% of results correctly rounded.

JFSebastian · January 10, 2013, 1:55pm

Thank you very much, Njuffa.

Below, please find attached the version of the code revised according to your recommendations:

[code]
device double bessi0(double x)
{

double num, den, x2, y, tn_2, tn_1, tn, tn_1_old;

x2 = abs(x*x);

if (abs(x)

JFSebastian · January 10, 2013, 1:56pm